• GEO生信数据挖掘(九)WGCNA分析


    第六节,我们使用结核病基因数据,做了一个数据预处理的实操案例。例子中结核类型,包括结核,潜隐进展,对照和潜隐,四个类别。第七节延续上个数据,进行了差异分析。 第八节对差异基因进行富集分析。本节进行WGCNA分析。

    更新版本:GEO生信数据挖掘(九)肺结核数据-差异分析-WGCNA分析(900行代码整理注释更新版本)

    目录

    加载数据,进行聚类

    初次聚类观察

    自己定义红线位置,进行切割划分

    载入性状数据

    增加形状信息后,再次聚类

    网络构建

    选取soft-thresholding powers

    基于tom的差异的基因聚类,绘制聚类树

    根据聚类情况,设置颜色

    计算eigengenes

    模块的自动合并

    模块与临床形状的关系热图 (关键数据)

    红色模块样本表达情况(相关性大)

    产生了很多数据(各个模块的和临床性状)

    后续挖掘核心基因时,需要用到Cytoscape,生成绘图所需要的数据


    加载数据,进行聚类

    1. library(WGCNA)
    2. #读取目录名称,方便复制粘贴
    3. dir()
    4. #加载数据
    5. load('DEG_TB_LTBI_step13.Rdata')
    6. #这里行为样品名,列为基因名
    7. ################################样品聚类####################
    8. datExpr = t(dataset_TB_LTBI_DEG)
    9. #初次聚类
    10. sampleTree = hclust(dist(datExpr), method = "average")
    11. # Plot the sample tree: Open a graphic output window of size 20 by 15 inches
    12. # The user should change the dimensions if the window is too large or too small.
    13. sizeGrWindow(12,9)
    14. #pdf(file='sampleCluestering.pdf',width = 12,height = 9)
    15. par(cex=0.6)
    16. par(mar=c(0,4,2,0))
    17. plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,
    18. cex.axis = 1.5, cex.main = 2)
    19. #结果图片自己导出PDF,文件名=1_sampleClustering.pdf
    20. ### Plot a line to show the cut
    21. abline(h = 87, col = "red")##剪切高度不确定,故无红线
    22. dev.off()

    初次聚类观察

    自己定义红线位置,进行切割划分

    本例发现右侧有些样本孤立,适合被剔除,设置红线87切割。

    左侧也被切成两块,需要做处理,保留。

    1. ### Determine cluster under the line
    2. clust = cutreeStatic(sampleTree, cutHeight = 87, minSize = 10)
    3. table(clust)
    4. #clust
    5. #0 1 2
    6. #5 57 40
    7. ### clust 1 contains the samples we want to keep.
    8. keepSamples = (clust==1|clust==2)
    9. datExpr0 = datExpr[keepSamples, ]
    10. dim(datExpr0) #[1] 97 2813
    11. #保存数据
    12. save(datExpr0,file='datExpr0_cluster_filter.Rdata')

    载入性状数据

    匹配样本名称,性状数据与表达数据保证一致

    1. #################### 载入性状数据###########################
    2. #加载自己的性状数据
    3. load('design_TB_LTBI.Rdata')
    4. traitData=design
    5. #Loading clinical trait data
    6. #traitData = read.table("trait_D.txt",row.names=1,header=T,comment.char = "",check.names=F)########trait file name can be changed######性状数据文件名,根据实际修改,如果工作路径不是实际性状数据路径,需要添加正确的数据路径
    7. dim(traitData)
    8. #names(traitData)
    9. # remove columns that hold information we do not need.
    10. #allTraits = traitData
    11. dim(traitData)
    12. names(traitData)
    13. # Form a data frame analogous to expression data that will hold the clinical traits.
    14. fpkmSamples = rownames(datExpr0)
    15. traitSamples =rownames(traitData)
    16. #匹配样本名称,性状数据与表达数据保证一致
    17. traitRows = match(fpkmSamples, traitSamples)
    18. datTraits = traitData[traitRows,]
    19. rownames(datTraits)
    20. collectGarbage()

    增加形状信息后,再次聚类

    1. # Re-cluster samples
    2. sampleTree2 = hclust(dist(datExpr0), method = "average")
    3. # Convert traits to a color representation: white means low, red means high, grey means missing entry
    4. traitColors = numbers2colors(datTraits, signed = FALSE)
    5. # Plot the sample dendrogram and the colors underneath.
    6. #sizeGrWindow(20,20)
    7. ##pdf(file="2_Sample dendrogram and trait heatmap.pdf",width=12,height=12)
    8. plotDendroAndColors(sampleTree2, traitColors,
    9. groupLabels = names(datTraits),
    10. main = "Sample dendrogram and trait heatmap")
    11. dev.off()

    下方红色,大致分成了两类,效果不错。

    网络构建

    1. #############################network constr########################################
    2. # Allow multi-threading within WGCNA. At present this call is necessary.
    3. # Any error here may be ignored but you may want to update WGCNA if you see one.
    4. # Caution: skip this line if you run RStudio or other third-party R environments.
    5. # See note above.
    6. enableWGCNAThreads()
    7. # Choose a set of soft-thresholding powers
    8. powers = c(1:15)
    9. # Call the network topology analysis function
    10. sft = pickSoftThreshold(datExpr0, powerVector = powers, verbose = 5)
    11. # Plot the results:
    12. sizeGrWindow(15, 9)
    13. #pdf(file="3_Scale independence.pdf",width=9,height=5)
    14. #pdf(file="Rplot03.pdf",width=9,height=5)
    15. par(mfrow = c(1,2))
    16. cex1 = 0.9
    17. # Scale-free topology fit index as a function of the soft-thresholding power
    18. plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
    19. xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
    20. main = paste("Scale independence"));
    21. text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
    22. labels=powers,cex=cex1,col="red");
    23. # this line corresponds to using an R^2 cut-off of h
    24. abline(h=0.90,col="red")
    25. # Mean connectivity as a function of the soft-thresholding power
    26. plot(sft$fitIndices[,1], sft$fitIndices[,5],
    27. xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
    28. main = paste("Mean connectivity"))
    29. text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")
    30. dev.off()

    选取soft-thresholding powers

    测试阈值,注意观察,突破红线的附近时取值,下方代码时候的是自适应的方法选取 soft-thresholding powers

    1. ######chose the softPower
    2. #datExpr0= datExpr0[,-1]
    3. softPower =sft$powerEstimate
    4. adjacency = adjacency(datExpr0, power = softPower)
    5. ##### Turn adjacency into topological overlap
    6. TOM = TOMsimilarity(adjacency);
    7. dissTOM = 1-TOM
    8. # Call the hierarchical clustering function
    9. geneTree = hclust(as.dist(dissTOM), method = "average");
    10. # Plot the resulting clustering tree (dendrogram)
    11. #sizeGrWindow(12,9)
    12. pdf(file="4_Gene clustering on TOM-based dissimilarity.pdf",width=12,height=9)
    13. plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
    14. labels = FALSE, hang = 0.04)
    15. dev.off()

    基于tom的差异的基因聚类,绘制聚类树

    根据聚类情况,设置颜色

    1. # We like large modules, so we set the minimum module size relatively high:
    2. minModuleSize = 30
    3. # Module identification using dynamic tree cut:
    4. dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
    5. deepSplit = 2, pamRespectsDendro = FALSE,
    6. minClusterSize = minModuleSize);
    7. table(dynamicMods)
    8. # Convert numeric lables into colors
    9. dynamicColors = labels2colors(dynamicMods)
    10. table(dynamicColors)
    11. # Plot the dendrogram and colors underneath
    12. #sizeGrWindow(8,6)
    13. pdf(file="5_Dynamic Tree Cut.pdf",width=8,height=6)
    14. plotDendroAndColors(geneTree, dynamicColors, "Dynamic Tree Cut",
    15. dendroLabels = FALSE, hang = 0.03,
    16. addGuide = TRUE, guideHang = 0.05,
    17. main = "Gene dendrogram and module colors")
    18. dev.off()

    计算eigengenes

    1. # Calculate eigengenes
    2. MEList = moduleEigengenes(datExpr0, colors = dynamicColors)
    3. MEs = MEList$eigengenes
    4. # Calculate dissimilarity of module eigengenes
    5. MEDiss = 1-cor(MEs);
    6. # Cluster module eigengenes
    7. METree = hclust(as.dist(MEDiss), method = "average")
    8. # Plot the result
    9. #sizeGrWindow(7, 6)
    10. pdf(file="6_Clustering of module eigengenes.pdf",width=7,height=6)
    11. plot(METree, main = "Clustering of module eigengenes",
    12. xlab = "", sub = "")
    13. MEDissThres = 0.25######剪切高度可修改
    14. # Plot the cut line into the dendrogram
    15. abline(h=MEDissThres, col = "red")
    16. dev.off()

    模块的自动合并

    1. # Call an automatic merging function
    2. merge = mergeCloseModules(datExpr0, dynamicColors, cutHeight = MEDissThres, verbose = 3)
    3. # The merged module colors
    4. mergedColors = merge$colors
    5. # Eigengenes of the new merged modules:
    6. mergedMEs = merge$newMEs
    7. #sizeGrWindow(12, 9)
    8. pdf(file="7_merged dynamic.pdf", width = 9, height = 6)
    9. plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
    10. c("Dynamic Tree Cut", "Merged dynamic"),
    11. dendroLabels = FALSE, hang = 0.03,
    12. addGuide = TRUE, guideHang = 0.05)
    13. dev.off()
    14. # Rename to moduleColors
    15. moduleColors = mergedColors
    16. # Construct numerical labels corresponding to the colors
    17. colorOrder = c("grey", standardColors(50))
    18. moduleLabels = match(moduleColors, colorOrder)-1
    19. MEs = mergedMEs
    20. # Save module colors and labels for use in subsequent parts
    21. save(MEs, TOM, dissTOM, moduleLabels, moduleColors, geneTree, sft, file = "networkConstruction-stepByStep.RData")

    模块与临床形状的关系热图 (关键数据)

    1. #############################relate modules to external clinical triats######################################
    2. # Define numbers of genes and samples
    3. nGenes = ncol(datExpr0)
    4. nSamples = nrow(datExpr0)
    5. moduleTraitCor = cor(MEs, datTraits, use = "p")
    6. moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples)
    7. #sizeGrWindow(10,6)
    8. pdf(file="8_Module-trait relationships.pdf",width=10,height=6)
    9. # Will display correlations and their p-values
    10. textMatrix = paste(signif(moduleTraitCor, 2), "\n(",
    11. signif(moduleTraitPvalue, 1), ")", sep = "")
    12. dim(textMatrix) = dim(moduleTraitCor)
    13. par(mar = c(6, 8.5, 3, 3))
    14. # Display the correlation values within a heatmap plot #修改性状类型 data.frame
    15. labeledHeatmap(Matrix = moduleTraitCor,
    16. xLabels = names(data.frame(datTraits)),
    17. yLabels = names(MEs),
    18. ySymbols = names(MEs),
    19. colorLabels = FALSE,
    20. colors = greenWhiteRed(50),
    21. textMatrix = textMatrix,
    22. setStdMargins = FALSE,
    23. cex.text = 0.5,
    24. zlim = c(-1,1),
    25. main = paste("Module-trait relationships"))
    26. dev.off()

    挑选相关性最高的,具有统计学意义的(p<0.05),red模块最佳!

    红色模块样本表达情况(相关性大)

    产生了很多数据(各个模块的和临床性状)

    1. ######## Define variable weight containing all column of datTraits
    2. ###MM and GS
    3. # names (colors) of the modules
    4. modNames = substring(names(MEs), 3)
    5. geneModuleMembership = as.data.frame(cor(datExpr0, MEs, use = "p"))
    6. MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples))
    7. names(geneModuleMembership) = paste("MM", modNames, sep="")
    8. names(MMPvalue) = paste("p.MM", modNames, sep="")
    9. #names of those trait
    10. traitNames=names(data.frame(datTraits))
    11. class(datTraits)
    12. geneTraitSignificance = as.data.frame(cor(datExpr0, datTraits, use = "p"))
    13. GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples))
    14. names(geneTraitSignificance) = paste("GS.", traitNames, sep="")
    15. names(GSPvalue) = paste("p.GS.", traitNames, sep="")
    16. ####plot MM vs GS for each trait vs each module
    17. ##########example:royalblue and CK
    18. module="red"
    19. column = match(module, modNames)
    20. moduleGenes = moduleColors==module
    21. trait="TB"
    22. traitColumn=match(trait,traitNames)
    23. sizeGrWindow(7, 7)
    24. #par(mfrow = c(1,1))
    25. verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
    26. abs(geneTraitSignificance[moduleGenes, traitColumn]),
    27. xlab = paste("Module Membership in", module, "module"),
    28. ylab = paste("Gene significance for ",trait),
    29. main = paste("Module membership vs. gene significance\n"),
    30. cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
    31. ######
    32. for (trait in traitNames){
    33. traitColumn=match(trait,traitNames)
    34. for (module in modNames){
    35. column = match(module, modNames)
    36. moduleGenes = moduleColors==module
    37. if (nrow(geneModuleMembership[moduleGenes,]) > 1){####进行这部分计算必须每个模块内基因数量大于2,由于前面设置了最小数量是30,这里可以不做这个判断,但是grey有可能会出现1个gene,它会导致代码运行的时候中断,故设置这一步
    38. #sizeGrWindow(7, 7)
    39. pdf(file=paste("9_", trait, "_", module,"_Module membership vs gene significance.pdf",sep=""),width=7,height=7)
    40. par(mfrow = c(1,1))
    41. verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
    42. abs(geneTraitSignificance[moduleGenes, traitColumn]),
    43. xlab = paste("Module Membership in", module, "module"),
    44. ylab = paste("Gene significance for ",trait),
    45. main = paste("Module membership vs. gene significance\n"),
    46. cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
    47. dev.off()
    48. }
    49. }
    50. }
    51. #####
    52. names(datExpr0)
    53. probes = names(datExpr0)
    54. #################export GS and MM###############
    55. geneInfo0 = data.frame(probes= probes,
    56. moduleColor = moduleColors)
    57. for (Tra in 1:ncol(geneTraitSignificance))
    58. {
    59. oldNames = names(geneInfo0)
    60. geneInfo0 = data.frame(geneInfo0, geneTraitSignificance[,Tra],
    61. GSPvalue[, Tra])
    62. names(geneInfo0) = c(oldNames,names(geneTraitSignificance)[Tra],
    63. names(GSPvalue)[Tra])
    64. }
    65. for (mod in 1:ncol(geneModuleMembership))
    66. {
    67. oldNames = names(geneInfo0)
    68. geneInfo0 = data.frame(geneInfo0, geneModuleMembership[,mod],
    69. MMPvalue[, mod])
    70. names(geneInfo0) = c(oldNames,names(geneModuleMembership)[mod],
    71. names(MMPvalue)[mod])
    72. }
    73. geneOrder =order(geneInfo0$moduleColor)
    74. geneInfo = geneInfo0[geneOrder, ]
    75. write.table(geneInfo, file = "10_GS_and_MM.xls",sep="\t",row.names=F)
    1. ####################################################Visualizing the gene network#######################################################
    2. nGenes = ncol(datExpr0)
    3. nSamples = nrow(datExpr0)
    4. # Transform dissTOM with a power to make moderately strong connections more visible in the heatmap
    5. plotTOM = dissTOM^7
    6. # Set diagonal to NA for a nicer plot
    7. diag(plotTOM) = NA
    8. # Call the plot function
    9. sizeGrWindow(9,9) #这个耗电脑内存
    10. pdf(file="12_Network heatmap plot_all gene.pdf",width=9, height=9)
    11. TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")
    12. dev.off()
    13. nSelect = 400
    14. # For reproducibility, we set the random seed
    15. set.seed(10)
    16. select = sample(nGenes, size = nSelect)
    17. selectTOM = dissTOM[select, select]
    18. # There's no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
    19. selectTree = hclust(as.dist(selectTOM), method = "average")
    20. selectColors = moduleColors[select]
    21. # Open a graphical window
    22. #sizeGrWindow(9,9)
    23. # Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
    24. # the color palette; setting the diagonal to NA also improves the clarity of the plot
    25. plotDiss = selectTOM^7
    26. diag(plotDiss) = NA
    27. pdf(file="13_Network heatmap plot_selected genes.pdf",width=9, height=9)
    28. TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes")
    29. dev.off()
    30. ####################################################Visualizing the gene network of eigengenes####################################################
    31. #sizeGrWindow(5,7.5)
    32. pdf(file="14_Eigengene dendrogram and Eigengene adjacency heatmap.pdf", width=5, height=7.5)
    33. par(cex = 0.9)
    34. plotEigengeneNetworks(MEs, "", marDendro = c(0,4,1,2), marHeatmap = c(3,4,1,2), cex.lab = 0.8, xLabelsAngle= 90)
    35. dev.off()
    36. #or devide into two parts
    37. # Plot the dendrogram
    38. #sizeGrWindow(6,6);
    39. pdf(file="15_Eigengene dendrogram_2.pdf",width=6, height=6)
    40. par(cex = 1.0)
    41. plotEigengeneNetworks(MEs, "Eigengene dendrogram", marDendro = c(0,4,2,0), plotHeatmaps = FALSE)
    42. dev.off()
    43. pdf(file="15_Eigengene adjacency heatmap_2.pdf",width=6, height=6)
    44. # Plot the heatmap matrix (note: this plot will overwrite the dendrogram plot)
    45. par(cex = 1.0)
    46. plotEigengeneNetworks(MEs, "Eigengene adjacency heatmap", marHeatmap = c(3,4,2,2), plotDendrograms = FALSE, xLabelsAngle = 90)
    47. dev.off()

    后续挖掘核心基因时,需要用到Cytoscape,生成绘图所需要的数据

    1. ###########################Exporting to Cytoscape all one by one ##########################
    2. # Select each module
    3. '''
    4. Error in exportNetworkToCytoscape(modTOM, edgeFile = paste("CytoscapeInput-edges-", :
    5. Cannot determine node names: nodeNames is NULL and adjMat has no dimnames.
    6. datExpr0 格式需要dataframe
    7. '''
    8. modules =module
    9. for (mod in 1:nrow(table(moduleColors)))
    10. {
    11. modules = names(table(moduleColors))[mod]
    12. # Select module probes
    13. probes = names(data.frame(datExpr0)) #
    14. inModule = (moduleColors == modules)
    15. modProbes = probes[inModule]
    16. modGenes = modProbes
    17. # Select the corresponding Topological Overlap
    18. modTOM = TOM[inModule, inModule]
    19. dimnames(modTOM) = list(modProbes, modProbes)
    20. # Export the network into edge and node list files Cytoscape can read
    21. cyt = exportNetworkToCytoscape(modTOM,
    22. edgeFile = paste("CytoscapeInput-edges-", modules , ".txt", sep=""),
    23. nodeFile = paste("CytoscapeInput-nodes-", modules, ".txt", sep=""),
    24. weighted = TRUE,
    25. threshold = 0.02,
    26. nodeNames = modProbes,
    27. altNodeNames = modGenes,
    28. nodeAttr = moduleColors[inModule])
    29. }

    关系网络的构建完毕,绘图找核心基因,Cytoscape 到底怎么玩?

  • 相关阅读:
    MySQL基本操作
    MySQL案例详解 三:MMM高可用架构及其故障切换
    Jmeter使用
    docker 查看容器挂载情况
    阿里云分布式深度学习训练架构Whale
    [CVE-2021-45105] Apache Log4j2 漏洞复现与原理详细分析
    Benchmarking Detection Transfer Learning with Vision Transformers(2021-11)
    Java基础this关键字02
    如何使用oracle数据库的length()、lengthb()、replace()、regexp_substr()函数
    Ajax(2)
  • 原文地址:https://blog.csdn.net/zzh1464501547/article/details/133872767