• About Random Forest


    Decision trees can suffer from high variance which makes their results fragile to the specific training data used. Building multiple models from samples of your training data, called bagging, can reduce this variance, but the trees are highly correlated.

    Random Forest is an extension of bagging that in addition to building trees based on multiple samples of your training data, it also constrains the features that can be used to build the trees, forcing trees to be different.

    After completing this tutorial, you will know:

    • The difference between bagged decision trees and the random algorithm.
    • How to construct bagged decsion trees with more variance
    • How to apply the random forest algorithm to a predictive modeling problem.

    1.1.1 Random Forest Algorithm

    Decision trees involve the greedy selection of the best split point from the dataset at each step.

    Random Forest Algorithm : force the decision trees to be different by limiting the features(column) that greedy algorithm can evaluate at each split point when creating the tree.

    1.2 Tutorial

    • Calculating Splits
    • Sonar Case Study

    These step provide the foundation that you need to implement and apply the Random Forest algortihm to your own predictive modeling problem.

    1.2.1 Calculating Splits

    Split points are chosen by finding the attribute and the value of that attribute that results in the lowest cost.

    Finding the best split point in a decision tree involves evaluating the cost of each value in the training dataset for each input variable.

    Below is a function name :

    get_split() that implement this procedure.

    test_split() : split the dataset by a candiate split point

    gini_index() : evaluate the cost of a given split by the groups of rows created.

    features: a list of features is created by randomly selecting feature indices and adding them to a list.

    split points : this list of feature is then enumerated and specific values in the traing dataset evaluated .

    1. # Select the best split point for a dataset
    2. def get_split(dataset, n_features):
    3. class_values = list(see(row[-1] for row in dataset))
    4. b_index,b_value,b_score,b_groups = 999,999,999,None
    5. features = list()
    6. while len(features) < n_features:
    7. index = randrange(len(dataset[0])-1)
    8. if index not in features:
    9. features.append(index)
    10. for index in features:
    11. for row in dataset:
    12. groups = test_split(index, row[index],dataset)
    13. gini = gini_index(groups, class_values)
    14. if gini < b_score:
    15. b_index,b_value,b_score,b_groups = index,row[index],gini,groups
    16. return {'index':b_index,'value':b_value,'groups':b_groups}

    Now that we know how a decison tree algorithm can be modified for use with Random Forest algorithm. we can piece this together with an implementation of bagging and apply it to a real-world dataset.

    1.2.2 Sonar Case Study

    load_csv() : load the dataset

    str_column_to_float(): the string value converted to numeric and the output column is convert from strings to the integer values of 0 and 1.

    str_column_to_int() to load and prepare the dataset.

    will use k-fold cross-validation to estimate the performance of the learned model on unseen data.This means that we will construct and evaluate k models and estimate the performance as the mean model error.Classification accuracy will be used to evaluate each model/

    cross_validation_split()

    accuracy_metric()

    evaluate_algorithm()

    will use an implementation of the classification and regression trees(CART) algorithm adapted for bagging and the helper functions :

    test_split() : split a dataset into groups

    gini_index() : evaluate a split point

    get_split() :split the dataset by a candiate split point

    to_terminal(),split() and build_tree() used to create a single decision tree.

    predict() to make a prediction with a decision tree

    subsample() : make a subsample of the training dataset

    bagging_predict() : make a prediction with a list of decision trees.

    random_forest() : creates a list of decision trees from subsamples of the training dataset and then uses them to make predictions.

    the key difference between Random Forest and bagged decision trees is the one small change to the way that trees are created here in the get split() function. The complete example is listed below.

    1. # Random Forest Algorithm on Sonar Dataset
    2. from random import seed
    3. from random import randrange
    4. from csv import reader
    5. from math import sqrt
    6. # Load a CSV file
    7. def load_csv(filename):
    8. dataset = list()
    9. with open(filename,'r') as file:
    10. csv_reader = reader(file)
    11. for row in csv_reader:
    12. if not row:
    13. continue
    14. dataset.append(row)
    15. return dataset
    16. # Convert string column to float
    17. def str_column_to_float(dataset, column):
    18. for row in dataset:
    19. row[column] = float(row[column].strip())
    20. # Convert string column to integer
    21. def str_column_to_int(dataset, column):
    22. class_values = [row[column] for row in dataset]
    23. unique = set(class_values)
    24. lookup = dict()
    25. for i,value in enumerate(unique):
    26. lookup[value] = i
    27. for row in dataset:
    28. row[column] = lookup[row[column]]
    29. return lookup
    30. # Split a dataset into k folds
    31. def cross_validation_split(dataset, n_folds):
    32. dataset_split = list()
    33. dataset_copy = list(dataset)
    34. fold_size = int(len(dataset) / n_folds)
    35. for i in range(n_folds):
    36. fold = list()
    37. while len(fold) < fold_size:
    38. index = randrange(len(dataset_copy))
    39. fold.append(dataset_copy.pop(index))
    40. dataset_split.append(fold)
    41. return dataset_split
    42. # Calculate accuracy percentage
    43. def accuracy_metric(actual,predicted):
    44. correct = 0
    45. for i in range(len(actual)):
    46. if actual[i] == predicted[i]:
    47. correct += 1
    48. return correct / float(len(actual)) * 100.0
    49. # Evaluate an algorithm using a cross validation split
    50. def evaluate_algorithm(dataset, algorithm,n_folds,*args):
    51. folds = cross_validation_split(dataset,n_folds)
    52. scores = list()
    53. for fold in folds:
    54. train_set = list(folds)
    55. train_set.remove(fold)
    56. train_set = sum(train_set, [])
    57. test_set = list()
    58. for row in fold:
    59. row_copy = list(row)
    60. test_set.append(row_copy)
    61. row_copy[-1] = None
    62. predicted = algorithm(train_set,test_set,*args)
    63. actual = [row[-1] for row in fold]
    64. accuracy = accuracy_metric(actual,predicted)
    65. scores.append(accuracy)
    66. return scores
    67. # Split a dataset based on an attribute and an attribute value
    68. def test_split(index,value,dataset):
    69. left,right = list(),list()
    70. for row in dataset:
    71. if row[index] < value:
    72. left.append(row)
    73. else:
    74. right.append(row)
    75. return left,right
    76. # Calculate the Gini index for a split dataset
    77. def gini_index(groups,classes):
    78. # count all samples at split point
    79. n_instances = float(sum([len(group) for group in groups]))
    80. # sum weighted Gini index for each group
    81. gini = 0.0
    82. for group in groups:
    83. size = float(len(group))
    84. # avoid divide by zero
    85. if size == 0:
    86. continue
    87. score = 0.0
    88. # score the group based on the score for each class
    89. for class_val in classes:
    90. p = [row[-1] for row in group].count(class_val) / size
    91. score += p * p
    92. # weight the group score by its relative size
    93. gini += (1.0 - score) * (size / n_instances)
    94. return gini
    95. # Select the best split point for a dataset
    96. def get_split(dataset, n_features):
    97. class_values = list(set(row[-1] for row in dataset))
    98. b_index,b_value,b_score,b_groups = 999,999,999,None
    99. features = list()
    100. while len(features) < n_features:
    101. index = randrange(len(dataset[0])-1)
    102. if index not in features:
    103. features.append(index)
    104. for index in features:
    105. for row in dataset:
    106. groups = test_split(index, row[index],dataset)
    107. gini = gini_index(groups, class_values)
    108. if gini < b_score:
    109. b_index,b_value,b_score,b_groups = index,row[index],gini,groups
    110. return {'index':b_index,'value':b_value,'groups':b_groups}
    111. # Create a terminal node value
    112. def to_terminal(group):
    113. outcomes = [row[-1] for row in group]
    114. return max(set(outcomes),key=outcomes.count)
    115. # Create child splits for a node or make terminal
    116. def split(node, max_depth,min_size,n_features, depth):
    117. left,right = node['groups']
    118. del(node['groups'])
    119. # check for a no split
    120. if not left or not right:
    121. node['left'] = node['right'] = to_terminal(left + right)
    122. return
    123. #check for max depth
    124. if depth >= max_depth:
    125. node['left'],node['right'] = to_terminal(left),to_terminal(right)
    126. return
    127. # process left child
    128. if len(left) < min_size:
    129. node['left'] = to_terminal(left)
    130. else:
    131. node['left'] = get_split(left,n_features)
    132. split(node['left'],max_depth,min_size,n_features,depth+1)
    133. # process right child
    134. if len(right) < min_size:
    135. node['right'] = to_terminal(right)
    136. else:
    137. node['right'] = get_split(right,n_features)
    138. split(node['right'],max_depth,min_size,n_features,depth+1)
    139. # Build a decision tree
    140. def build_tree(train, max_depth, min_size,n_features):
    141. root = get_split(train,n_features)
    142. split(root, max_depth, min_size,n_features, 1)
    143. return root
    144. # Make a prediction with a decision tree
    145. def predict(node, row):
    146. if row[node['index']] < node['value']:
    147. if isinstance(node['left'],dict):
    148. return predict(node['left'],row)
    149. else:
    150. return node['left']
    151. else:
    152. if isinstance(node['right'],dict):
    153. return predict(node['right'],row)
    154. else:
    155. return node['right']
    156. # Create a random subsample from the dataset with replacement
    157. def subsample(dataset, ratio):
    158. sample = list()
    159. n_sample = round(len(dataset) * ratio)
    160. while len(sample) < n_sample:
    161. index = randrange(len(dataset))
    162. sample.append(dataset[index])
    163. return sample
    164. # Make a prediction with a list of bagged trees
    165. def bagging_predict(trees, row):
    166. predictions = [predict(tree, row) for tree in trees]
    167. return max(set(predictions),key=predictions.count)
    168. # Random Forest Algorithm
    169. def random_forest(train, test, max_depth, min_size, sample_size,n_trees,n_features):
    170. trees = list()
    171. for i in range(n_trees):
    172. sample = subsample(train, sample_size)
    173. tree = build_tree(sample, max_depth, min_size, n_features)
    174. trees.append(tree)
    175. predictions = [bagging_predict(trees, row) for row in test]
    176. return (predictions)
    177. # Test bagging on the sonar dataset
    178. seed(2)
    179. # load and prepare data
    180. filename = 'sonar.all-data.csv'
    181. dataset = load_csv(filename)
    182. # convert string attributes to integers
    183. for i in range(len(dataset[0])-1):
    184. str_column_to_float(dataset,i)
    185. # convert class column to integers
    186. str_column_to_int(dataset, len(dataset[0])-1)
    187. # evaluate algorithm
    188. n_folds = 5
    189. max_depth = 10
    190. min_size = 1
    191. sample_size = 1.0
    192. n_features = int(sqrt(len(dataset[0])-1))
    193. for n_trees in [1, 5, 10]:
    194. scores = evaluate_algorithm(dataset, random_forest,n_folds,max_depth,min_size,sample_size,n_trees,n_features)
    195. print('Tree: %d' % n_trees)
    196. print('Scores: %s' % scores)
    197. print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

     

  • 相关阅读:
    Java实现手机验证码登录和SpringSecurity权限控制
    Ribbon负载均衡
    [运维|docker] ubuntu镜像更新时报E: Problem executing scripts APT::Update::Post-Invoke错误
    腾讯Mini项目课程前置学习笔记(第一轮)
    mysql基本命令操作
    一文7个步骤从0到1教你搭建Selenium 自动化测试环境
    C生万物 | 从浅入深理解指针【第一部分】
    牛客网:主持人调度
    基于Dockerfile创建镜像
    GPDB-内核原理-如何指定发送数据目的地
  • 原文地址:https://blog.csdn.net/u011868279/article/details/125402941