随机森林可以产生高准确度的分类器,被广泛用于解决模式识别问题。然而,随机森林赋予每个决策树相同的权重,这在一定程度上降低了整个分类器的性能。该算法引入二次训练过程,提高分类正确率高目前,鲜有中文文章介绍GWRFC的技术文档,作者想使用该方法时会遇到重重困难,因此有必要利用R来建模,通过流程演示,供读者参阅。
首先,介绍一下随机森林,细心的读者会发现我在别的章节已经介绍过了随机森林,这里我们简单重温一下。
早期,Breiman采用bagging方法从训练集中有放回地随机选取数据来产生决策树;之后Dietterich采用了在每个节点的K个最优分割中随机选取一个作为分割来产生决策树;另外的方法是从训练样本集构成的带有加权随机数据集中选择训练数据。随机森林是以K个决策树为基本分类器,进行集成学习后得到一个组合分类器。该算法由三步实现:首先,采用bootstrap抽样从原始数据集中抽取n个训练集,每个训练集的大小约为原始数据集的三分之二;其次,为每一个bootstrap训练集分别建立分类回归树,共产生n棵决策树构成一片“森林”,这些决策树均不进行剪枝,在每棵树生长过程中,并不是选择全部M个属性中的最优属性作为内部节点进行分支,而是从随机选择的m≤M个属性中选择最优属性进行分支;最后,由于各个决策树的训练是相互独立的,因此随机森林的训练可以通过并行处理来实现,这将大大提高生成模型的效率。将以同样的方式训练得到的K个决策树组合起来,就可以得到一个随机森林。当输入待分类的样本时,随机森林输出的分类结果由每个决策树的输出结果进行简单投票决定。
随机森林通过构建一系列独立同分布的决策树分别对样本进行判别,最后根据每个决策树投票来确定样本的最终类别。两个随机化过程决定了构建的决策树分类能力不同,但是在分类决策上,强分类器和弱分类器有相同的投票权重,这导致随机森林分类器整体性能下降。
BREIMAN L. Bagging predictors[J]. Machine Learing, 1996,24(2):123140.
DIETTERICH T G. An expermental comparison of three methods for constructing ensembles of decision. trees:bagging,boosting and randomization[J].Machine Learning, 2002, 40(2):139157.
话不多说,让我们看看具体的代码吧!
#required libraries
list.of.packages <- c("caret","digest","doParallel","foreach","foreign","fpc","ggplot2","gtools","GWmodel","jpeg","kohonen","mclust","NbClust","parallel","plyr","pracma","ranger","raster","reshape","raster",
"rgdal","rgeos","scales","spdep","spgwr","stringr","tmap","zoo")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)){install.packages(new.packages)}
lapply(list.of.packages, require, character.only = TRUE)
#install GWRFC
require(devtools)
install_github("FSantosCodes/GWRFC")
library(GWRFC)
2.加载样例数据
#view deforestation data
data("deforestation")
tmap_mode("view")
tm_basemap("OpenStreetMap") +
tm_shape(deforestation) +
tm_polygons(col="fao",style="cat",title="Annual deforestation rate 2000-2010 (FAO) - categorical (quantiles)",palette="YlOrRd")
3.模型应用
#run GWRFC
GWRFC(input_shapefile = deforestation, #can be a spatial dataframe (points or polygons) or the complete filename of the shapefile to analyze.
remove_columns = c("ID_grid","L_oth"), #for remove variables if they are not informative. Put NA to avoid removal.
dependent_varName = "fao", #the depedent variable to evaluate. It should be of factor or character data type.
kernel_function = "exponential", #the weightening function. See help for other available functions.
kernel_adaptative = T, #use TRUE for adaptative kernel distance or FALSE for a fixed kernel distance.
kernel_bandwidth = 400, #as the kernel is adaptative, 400 refers to the minimun number of observations to use in modelling.
upsampling = T, #improves accuracy (recommended) but is a bit more computing costly.
save_models = T, #save RF models. Beware of hard disk space and extra processing time.
enable_pdp = F, #experimental, use with caution as is sensible to noise.
number_cores = 3, #defines the number of CPU cores to use
output_folder = "E:/demo/deforestation") #check this folder for GWRFC outputs.
4.结果融合
#run GWRFC
#clustering GWRFC LVI outputs
LVIclust(input_LVI = "E:/demo/deforestation/GWRFC_ADP_400_EX_LVI.shp", #filename of the GWRFC LVI output
remove_columns=NA,
method_clustering="ward.D2", #hierarchical clustering is applied here.
ncluster = 4, #number of clusters.
plots=T, #available only for all hierarchical clustering methods and kohonen.
output_folder = "E:/demo/deforestation") #check this folder for outputs generated by the function.
是不是很简单,不知道小伙伴学习到了吗?有问题私聊我哦~