• Machine learning week 7(Andrew Ng)


    Decision tree

    1、Decision tree

    1.1、Decision tree model

    在这里插入图片描述

    1.2、Learning process

    在这里插入图片描述

    在这里插入图片描述

    2、Decision tree learning

    2.1、Measuring purity

    The entropy function
    在这里插入图片描述
    H ( p 1 ) = − p 1 l o g 2 ( p 1 ) − p 0 l o g 2 ( p 0 ) H(p_1)=-p_1log_2(p_1)-p_0log_2(p_0) H(p1)=p1log2(p1)p0log2(p0) H ( p 1 ) = − p 1 log 2 ( p 1 ) − ( 1 − p 1 ) log 2 ( 1 − p 1 ) H(p_1) = -p_1 \text{log}_2(p_1) - (1- p_1) \text{log}_2(1- p_1) H(p1)=p1log2(p1)(1p1)log2(1p1)

        def compute_entropy(y):
        """
        Computes the entropy for 
        
        Args:
           y (ndarray): Numpy array indicating whether each example at a node is
               edible (`1`) or poisonous (`0`)
           
        Returns:
            entropy (float): Entropy at that node
            
        """
        # You need to return the following variables correctly
        entropy = 0.
        
        ### START CODE HERE ###
        m = len(y)
        if m != 0:
            p1 = len(y[y == 1]) / len(y)
            if p1 != 0 and p1 != 1:
                entropy = -p1*np.log2(p1)-(1-p1)*np.log2(1-p1)
            else:
                entropy = 0
        ### END CODE HERE ###        
        
        return entropy
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    2.2、Choosing a split: Information Gain

    Information Gain = H ( p 1 node ) − ( w left H ( p 1 left ) + w right H ( p 1 right ) ) \text{Information Gain} = H(p_1^\text{node})- (w^{\text{left}}H(p_1^\text{left}) + w^{\text{right}}H(p_1^\text{right})) Information Gain=H(p1node)(wleftH(p1left)+wrightH(p1right))
    Pick the one with the highest information gain
    在这里插入图片描述

    def split_dataset(X, node_indices, feature):
        """
        Splits the data at the given node into
        left and right branches
        
        Args:
            X (ndarray):             Data matrix of shape(n_samples, n_features)
            node_indices (list):  List containing the active indices. I.e, the samples being considered at this step.
            feature (int):           Index of feature to split on
        
        Returns:
            left_indices (list): Indices with feature value == 1
            right_indices (list): Indices with feature value == 0
        """
        
        # You need to return the following variables correctly
        left_indices = []
        right_indices = []
        
        ### START CODE HERE ###
        for i in node_indices:
            if X[i][feature] == 1:
                left_indices.append(i)
            else:
                right_indices.append(i)
        ### END CODE HERE ###
            
        return left_indices, right_indices
    
    
    def compute_information_gain(X, y, node_indices, feature):
        
        """
        Compute the information of splitting the node on a given feature
        
        Args:
            X (ndarray):            Data matrix of shape(n_samples, n_features)
            y (array like):         list or ndarray with n_samples containing the target variable
            node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
       
        Returns:
            cost (float):        Cost computed
        
        """    
        # Split dataset
        left_indices, right_indices = split_dataset(X, node_indices, feature)
        
        # Some useful variables
        X_node, y_node = X[node_indices], y[node_indices]
        X_left, y_left = X[left_indices], y[left_indices]
        X_right, y_right = X[right_indices], y[right_indices]
        
        # You need to return the following variables correctly
        information_gain = 0
        
        ### START CODE HERE ###
        w_left = len(X_left) / len(X_node)
        w_right = len(X_right) / len(X_node)
        # Weights 
        node_entropy = compute_entropy(y_node)
        left_entropy = compute_entropy(y_left)
        right_entropy = compute_entropy(y_right)
        #Weighted entropy
        information_gain = node_entropy - (w_left * left_entropy + w_right * right_entropy)
        #Information gain                                                   
        ### END CODE HERE ###  
        
        return information_gain
    
    
    def get_best_split(X, y, node_indices):   
        """
        Returns the optimal feature and threshold value
        to split the node data 
        
        Args:
            X (ndarray):            Data matrix of shape(n_samples, n_features)
            y (array like):         list or ndarray with n_samples containing the target variable
            node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
    
        Returns:
            best_feature (int):     The index of the best feature to split
        """    
        
        # Some useful variables
        num_features = X.shape[1]
        
        # You need to return the following variables correctly
        best_feature = -1
        
        ### START CODE HERE ###
        max_info_gain = 0
        for i in range(num_features):
             now_info_gain = compute_information_gain(X, y, node_indices,i)
             if now_info_gain > max_info_gain:
                    max_info_gain = now_info_gain
                    best_feature = i
        ### END CODE HERE ##    
       
        return best_feature
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    2.3、Putting it together

    The steps:
    在这里插入图片描述

    # Not graded
    tree = []
    
    def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth):
        """
        Build a tree using the recursive algorithm that split the dataset into 2 subgroups at each node.
        This function just prints the tree.
        
        Args:
            X (ndarray):            Data matrix of shape(n_samples, n_features)
            y (array like):         list or ndarray with n_samples containing the target variable
            node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
            branch_name (string):   Name of the branch. ['Root', 'Left', 'Right']
            max_depth (int):        Max depth of the resulting tree. 
            current_depth (int):    Current depth. Parameter used during recursive call.
       
        """ 
    
        # Maximum depth reached - stop splitting
        if current_depth == max_depth:
            formatting = " "*current_depth + "-"*current_depth
            print(formatting, "%s leaf node with indices" % branch_name, node_indices)
            return
       
        # Otherwise, get best split and split the data
        # Get the best feature and threshold at this node
        best_feature = get_best_split(X, y, node_indices) 
        tree.append((current_depth, branch_name, best_feature, node_indices))
        
        formatting = "-"*current_depth
        print("%s Depth %d, %s: Split on feature: %d" % (formatting, current_depth, branch_name, best_feature))
        
        # Split the dataset at the best feature
        left_indices, right_indices = split_dataset(X, node_indices, best_feature)
        
        # continue splitting the left and the right child. Increment current depth
        build_tree_recursive(X, y, left_indices, "Left", max_depth, current_depth+1)
        build_tree_recursive(X, y, right_indices, "Right", max_depth, current_depth+1)
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    2.4、Using one-hot encoding of categorical features

    在这里插入图片描述

    2.5、Continuous valued features

    在这里插入图片描述
    Choose the 9 mid-points between the 10 examples as possible splits, and find the split that gives the highest information gain.

    2.6、Regression with Decision Trees

    Use some features to predict the weight bias Information Gain \text{Information Gain} Information Gain.

    在这里插入图片描述

    3、Tree ensembles

    3.1、Using multiple decision trees

    Trees are highly sensitive to small changes of the data.
    在这里插入图片描述

    3.2、Sampling with replacement(有放回抽样)

    Create a new training set.

    3.3、Random forest algorithm

    在这里插入图片描述

    3.4、XGBoost

    (Deliberate practice)When building the next decision tree, we’re going to focus more attention on the subset of examples that are not yet doing well on and get the new decision tree.
    在这里插入图片描述

    3.5、When to use decision trees

    在这里插入图片描述

  • 相关阅读:
    计算机视觉学习笔记(一)
    2、Pinpoint-Server端安装
    MySQL第十二讲:ShardingJDBC详解
    【Win10】如何关闭Windows10自动更新
    数据分析面试经验
    【Python】PySpark 数据处理 ② ( 安装 PySpark | PySpark 数据处理步骤 | 构建 PySpark 执行环境入口对象 )
    cadence SPB17.4 - 中文UI设置
    AI全栈大模型工程师(七)内容审核
    Linux内核编译
    flex布局中的align-content属性
  • 原文地址:https://blog.csdn.net/weixin_62012485/article/details/126458961