ConsensusClusterPlus包的ConsensesClusterPlus函数,用于通过稳定性证据确定簇数和类成员身份。计算聚类一致性和项目一致性的calcICL函数。
ConsensusClusterPlus( d=NULL, maxK = 3, reps=10, pItem=0.8, pFeature=1, clusterAlg="hc",title="untitled_consensus_cluster", innerLinkage="average", finalLinkage="average", distance="pearson", ml=NULL, tmyPal=NULL,seed=NULL,plot=NULL,writeTable=FALSE,weightsItem=NULL,weightsFeature=NULL,verbose=F,corUse="everything") calcICL(res,title="untitled_consensus_cluster",plot=NULL,writeTable=FALSE)
d | data to be clustered; either a data matrix where columns=items/samples and rows are features. For example, a gene expression matrix of genes in rows and microarrays in columns, or ExpressionSet object, or a distance object (only for cases of no feature resampling) |
maxK | integer value. maximum cluster number to evaluate. |
reps | integer value. number of subsamples. |
pItem | numerical value. proportion of items to sample. |
pFeature | numerical value. proportion of features to sample. |
clusterAlg | character value. cluster algorithm. 'hc' hierarchical (hclust), 'pam' for paritioning around medoids, 'km' for k-means upon data matrix, or a function that returns a clustering. See example and vignette for more details. |
title | character value for output directory. Directory is created only if plot is not NULL or writeTable is TRUE. This title can be an abosulte or relative path. |
innerLinkage | hierarchical linkage method for subsampling. |
finalLinkage | hierarchical linkage method for consensus matrix. |
distance | character value. 'pearson': (1 - Pearson correlation), 'spearman' (1 - Spearman correlation), 'euclidean', 'binary', 'maximum', 'canberra', 'minkowski" or custom distance function. |
ml | optional. prior result, if supplied then only do graphics and tables. |
tmyPal | optional character vector of colors for consensus matrix |
seed | optional numerical value. sets random seed for reproducible results. |
plot | character value. NULL - print to screen, 'pdf', 'png', 'pngBMP' for bitmap png, helpful for large datasets. |
writeTable | logical value. TRUE - write ouput and log to csv. |
weightsItem | optional numerical vector. weights to be used for sampling items. |
weightsFeature | optional numerical vector. weights to be used for sampling features. |
res | result of consensusClusterPlus. |
verbose | boolean. If TRUE, print messages to the screen to indicate progress. This is useful for large datasets. |
corUse | optional character value. specifies how to handle missing data in correlation distances 'everything','pairwise.complete.obs', 'complete.obs' see cor() for description. |
- # if (!require("BiocManager", quietly = TRUE))
- # install.packages("BiocManager")
- #
- # BiocManager::install("ConsensusClusterPlus")
-
- ### 1.准备数据
- ## 行为特征,列为样本
- library(ALL)
- data(ALL)
- d=exprs(ALL)
- d[1:5,1:5]
-
- # 取中位数绝对偏差(Median Absolute Deviation)大的前5000个探针
- mads=apply(d,1,mad)
- d=d[rev(order(mads))[1:5000],]
- # order(mads):从小到大排序,返回索引
- # rev(order(mads):从大到小排序
-
- d = sweep(d,1, apply(d,1,median,na.rm=T))
- # sweep:Return an array obtained from an input array
- # by sweeping out a summary statistic.
- # 输入数组行数据减去各行中间值得到的数据。
- # 如第一行 d[1,]-median(d[1,])
-
- ### 2.运行一致性聚类
- library(ConsensusClusterPlus)
- output_dir="/Users/zhengxueming/test/test0705"
- results = ConsensusClusterPlus(d,maxK=6,reps=50,pItem=0.8,pFeature=1,
- title=output_dir,clusterAlg="hc",distance="pearson",
- seed=1213,plot="png")
- # str(results)
- # str(results[[2]])
-
- ## output_dir 目录下生成不同K值下的聚类图和聚类评估图
- # 根据consensus CDF和Delta area图,选择最佳的k值:从K=2开始,计算K和K-1相比,
- # CDF 曲线下面积的相对变化,选取增加不明显的点作为最佳的K值
- # trackling plot:行为样本,列为每个K, 用热图展示样本在每个K下的cluster,
- # 用于定性评估不稳定的聚类和不稳定的样本
-
- # the top ten rows and columns of results for k=2:
- results[[2]][["consensusMatrix"]][1:10,1:10]
-
- # 查看各类别颜色
- results[[6]][["clrs"]]
-
- #consensusTree - hclust object
- results[[2]][["consensusTree"]]
-
-
- ###3.计算组间一致性和组类一致性
- # calculating cluster-consensus and item-consensus.
- icl = calcICL(results,title=output_dir,plot="png")
- # output_dir生成icl开头的png文件
- # icl 为list,含有"clusterConsensus" "itemConsensus"
- icl[["clusterConsensus"]]
- icl[["itemConsensus"]][1:5,]
-
-
- ### 4.选择合适的K值,得到各样本聚类结果的数据框
- sample_cluster <- results[[5]]$consensusClass
-
- sample_cluster_df <- data.frame(sample = names(sample_cluster),
- cluster = sample_cluster)
- head(sample_cluster_df)