Decision trees can suffer from high variance which makes their results fragile to the specific training data used. Building multiple models from samples of your training data, called bagging, can reduce this variance, but the trees are highly correlated.
Random Forest is an extension of bagging that in addition to building trees based on multiple samples of your training data, it also constrains the features that can be used to build the trees, forcing trees to be different.
After completing this tutorial, you will know:
1.1.1 Random Forest Algorithm
Decision trees involve the greedy selection of the best split point from the dataset at each step.
Random Forest Algorithm : force the decision trees to be different by limiting the features(column) that greedy algorithm can evaluate at each split point when creating the tree.

These step provide the foundation that you need to implement and apply the Random Forest algortihm to your own predictive modeling problem.
Split points are chosen by finding the attribute and the value of that attribute that results in the lowest cost.
Finding the best split point in a decision tree involves evaluating the cost of each value in the training dataset for each input variable.
Below is a function name :
get_split() that implement this procedure.
test_split() : split the dataset by a candiate split point
gini_index() : evaluate the cost of a given split by the groups of rows created.
features: a list of features is created by randomly selecting feature indices and adding them to a list.
split points : this list of feature is then enumerated and specific values in the traing dataset evaluated .
- # Select the best split point for a dataset
- def get_split(dataset, n_features):
- class_values = list(see(row[-1] for row in dataset))
- b_index,b_value,b_score,b_groups = 999,999,999,None
- features = list()
- while len(features) < n_features:
- index = randrange(len(dataset[0])-1)
- if index not in features:
- features.append(index)
- for index in features:
- for row in dataset:
- groups = test_split(index, row[index],dataset)
- gini = gini_index(groups, class_values)
- if gini < b_score:
- b_index,b_value,b_score,b_groups = index,row[index],gini,groups
- return {'index':b_index,'value':b_value,'groups':b_groups}
Now that we know how a decison tree algorithm can be modified for use with Random Forest algorithm. we can piece this together with an implementation of bagging and apply it to a real-world dataset.
load_csv() : load the dataset
str_column_to_float(): the string value converted to numeric and the output column is convert from strings to the integer values of 0 and 1.
str_column_to_int() to load and prepare the dataset.
will use k-fold cross-validation to estimate the performance of the learned model on unseen data.This means that we will construct and evaluate k models and estimate the performance as the mean model error.Classification accuracy will be used to evaluate each model/
cross_validation_split()
accuracy_metric()
evaluate_algorithm()
will use an implementation of the classification and regression trees(CART) algorithm adapted for bagging and the helper functions :
test_split() : split a dataset into groups
gini_index() : evaluate a split point
get_split() :split the dataset by a candiate split point
to_terminal(),split() and build_tree() used to create a single decision tree.
predict() to make a prediction with a decision tree
subsample() : make a subsample of the training dataset
bagging_predict() : make a prediction with a list of decision trees.
random_forest() : creates a list of decision trees from subsamples of the training dataset and then uses them to make predictions.
the key difference between Random Forest and bagged decision trees is the one small change to the way that trees are created here in the get split() function. The complete example is listed below.
- # Random Forest Algorithm on Sonar Dataset
- from random import seed
- from random import randrange
- from csv import reader
- from math import sqrt
-
- # Load a CSV file
- def load_csv(filename):
- dataset = list()
- with open(filename,'r') as file:
- csv_reader = reader(file)
- for row in csv_reader:
- if not row:
- continue
- dataset.append(row)
- return dataset
-
- # Convert string column to float
- def str_column_to_float(dataset, column):
- for row in dataset:
- row[column] = float(row[column].strip())
-
- # Convert string column to integer
- def str_column_to_int(dataset, column):
- class_values = [row[column] for row in dataset]
- unique = set(class_values)
- lookup = dict()
- for i,value in enumerate(unique):
- lookup[value] = i
- for row in dataset:
- row[column] = lookup[row[column]]
- return lookup
-
- # Split a dataset into k folds
- def cross_validation_split(dataset, n_folds):
- dataset_split = list()
- dataset_copy = list(dataset)
- fold_size = int(len(dataset) / n_folds)
- for i in range(n_folds):
- fold = list()
- while len(fold) < fold_size:
- index = randrange(len(dataset_copy))
- fold.append(dataset_copy.pop(index))
- dataset_split.append(fold)
- return dataset_split
-
- # Calculate accuracy percentage
- def accuracy_metric(actual,predicted):
- correct = 0
- for i in range(len(actual)):
- if actual[i] == predicted[i]:
- correct += 1
- return correct / float(len(actual)) * 100.0
-
- # Evaluate an algorithm using a cross validation split
- def evaluate_algorithm(dataset, algorithm,n_folds,*args):
- folds = cross_validation_split(dataset,n_folds)
- scores = list()
- for fold in folds:
- train_set = list(folds)
- train_set.remove(fold)
- train_set = sum(train_set, [])
- test_set = list()
- for row in fold:
- row_copy = list(row)
- test_set.append(row_copy)
- row_copy[-1] = None
- predicted = algorithm(train_set,test_set,*args)
- actual = [row[-1] for row in fold]
- accuracy = accuracy_metric(actual,predicted)
- scores.append(accuracy)
- return scores
-
- # Split a dataset based on an attribute and an attribute value
- def test_split(index,value,dataset):
- left,right = list(),list()
- for row in dataset:
- if row[index] < value:
- left.append(row)
- else:
- right.append(row)
- return left,right
-
- # Calculate the Gini index for a split dataset
- def gini_index(groups,classes):
- # count all samples at split point
- n_instances = float(sum([len(group) for group in groups]))
- # sum weighted Gini index for each group
- gini = 0.0
- for group in groups:
- size = float(len(group))
- # avoid divide by zero
- if size == 0:
- continue
- score = 0.0
- # score the group based on the score for each class
- for class_val in classes:
- p = [row[-1] for row in group].count(class_val) / size
- score += p * p
- # weight the group score by its relative size
- gini += (1.0 - score) * (size / n_instances)
- return gini
-
- # Select the best split point for a dataset
- def get_split(dataset, n_features):
- class_values = list(set(row[-1] for row in dataset))
- b_index,b_value,b_score,b_groups = 999,999,999,None
- features = list()
- while len(features) < n_features:
- index = randrange(len(dataset[0])-1)
- if index not in features:
- features.append(index)
- for index in features:
- for row in dataset:
- groups = test_split(index, row[index],dataset)
- gini = gini_index(groups, class_values)
- if gini < b_score:
- b_index,b_value,b_score,b_groups = index,row[index],gini,groups
- return {'index':b_index,'value':b_value,'groups':b_groups}
-
- # Create a terminal node value
- def to_terminal(group):
- outcomes = [row[-1] for row in group]
- return max(set(outcomes),key=outcomes.count)
-
- # Create child splits for a node or make terminal
- def split(node, max_depth,min_size,n_features, depth):
- left,right = node['groups']
- del(node['groups'])
- # check for a no split
- if not left or not right:
- node['left'] = node['right'] = to_terminal(left + right)
- return
- #check for max depth
- if depth >= max_depth:
- node['left'],node['right'] = to_terminal(left),to_terminal(right)
- return
- # process left child
- if len(left) < min_size:
- node['left'] = to_terminal(left)
- else:
- node['left'] = get_split(left,n_features)
- split(node['left'],max_depth,min_size,n_features,depth+1)
- # process right child
- if len(right) < min_size:
- node['right'] = to_terminal(right)
- else:
- node['right'] = get_split(right,n_features)
- split(node['right'],max_depth,min_size,n_features,depth+1)
-
- # Build a decision tree
- def build_tree(train, max_depth, min_size,n_features):
- root = get_split(train,n_features)
- split(root, max_depth, min_size,n_features, 1)
- return root
-
- # Make a prediction with a decision tree
- def predict(node, row):
- if row[node['index']] < node['value']:
- if isinstance(node['left'],dict):
- return predict(node['left'],row)
- else:
- return node['left']
- else:
- if isinstance(node['right'],dict):
- return predict(node['right'],row)
- else:
- return node['right']
-
- # Create a random subsample from the dataset with replacement
- def subsample(dataset, ratio):
- sample = list()
- n_sample = round(len(dataset) * ratio)
- while len(sample) < n_sample:
- index = randrange(len(dataset))
- sample.append(dataset[index])
- return sample
-
- # Make a prediction with a list of bagged trees
- def bagging_predict(trees, row):
- predictions = [predict(tree, row) for tree in trees]
- return max(set(predictions),key=predictions.count)
-
- # Random Forest Algorithm
- def random_forest(train, test, max_depth, min_size, sample_size,n_trees,n_features):
- trees = list()
- for i in range(n_trees):
- sample = subsample(train, sample_size)
- tree = build_tree(sample, max_depth, min_size, n_features)
- trees.append(tree)
- predictions = [bagging_predict(trees, row) for row in test]
- return (predictions)
-
- # Test bagging on the sonar dataset
- seed(2)
-
- # load and prepare data
- filename = 'sonar.all-data.csv'
- dataset = load_csv(filename)
-
- # convert string attributes to integers
- for i in range(len(dataset[0])-1):
- str_column_to_float(dataset,i)
-
- # convert class column to integers
- str_column_to_int(dataset, len(dataset[0])-1)
-
- # evaluate algorithm
- n_folds = 5
- max_depth = 10
- min_size = 1
- sample_size = 1.0
- n_features = int(sqrt(len(dataset[0])-1))
- for n_trees in [1, 5, 10]:
- scores = evaluate_algorithm(dataset, random_forest,n_folds,max_depth,min_size,sample_size,n_trees,n_features)
- print('Tree: %d' % n_trees)
- print('Scores: %s' % scores)
- print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))
