码农知识堂 - 1000bd
  •   Python
  •   PHP
  •   JS/TS
  •   JAVA
  •   C/C++
  •   C#
  •   GO
  •   Kotlin
  •   Swift
  • 单细胞聚类,究竟聚类聚成多少类比较合适?全代码分享


    单细胞数据分析到最后一步往往都需要聚类,进而给亚群命名。但是我们通常纠结resolution到底选多大为好,究竟聚成多少类比较合适?今天我们使用Silhouette来确定多少类比较合适。

    关注微信:生信小博士

    选择最优聚类有多种方式,今天着重于Silhouette Method。

    Determining optimal number of clusters (k)

    Before we do the actual clustering, we need to identity the Optimal number of clusters (k) for this data set of wholesale customers. The popular way of determining number of clusters are

    1. Elbow Method
    2. Silhouette Method
    3. Gap Static Method

    Elbow and Silhouette methods are direct methods and gap statistic method is the statistics method.

    In this demonstration, we are going to see how silhouette method is used.

    1 最优聚类 数据前处理

    如果你已经准备好前期的数据,此步可以省略。

    这一步的目的:会得到一个名为inttegtaed_data的seurat对象。

    1. slide_files <- list.files("~/silicosis/spatial/prcessed_visum_for_progeny/data/",recursive = TRUE,
    2. all.files = TRUE,full.names = TRUE,pattern = "_")
    3. integrated_data <- map(slide_files, readRDS)
    4. print("You managed to load everything")
    5. print("Object size")
    6. print(object.size(integrated_data))
    7. # Calculate HVG per sample - Here we assume that batch and patient effects aren't as strong
    8. # since cell-types and niches should be greater than the number of batches
    9. Assays(integrated_data [[1]])
    10. str(integrated_data [[1]])
    11. def_assay="Spatial"
    12. hvg_list <- map(integrated_data, function(x) {
    13. DefaultAssay(x) <- def_assay
    14. x <- FindVariableFeatures(x, selection.method = "vst",
    15. nfeatures = 3000)
    16. x@assays[[def_assay]]@var.features
    17. }) %>% unlist()
    18. hvg_list <- table(hvg_list) %>%
    19. sort(decreasing = TRUE)
    20. gene_selection_plt <- hvg_list %>% enframe() %>%
    21. group_by(value) %>%
    22. mutate(value = as.numeric(value)) %>%
    23. summarize(ngenes = length(name)) %>%
    24. ggplot(aes(x = value, y = ngenes)) +
    25. geom_bar(stat = "identity")
    26. gene_selection <- hvg_list[1:3000] %>% names()
    27. #########
    28. # Create merged object ---------------------------------
    29. integrated_data <- purrr::reduce(integrated_data,
    30. merge,
    31. merge.data = TRUE)
    32. print("You managed to merge everything")
    33. print("Object size")
    34. print(object.size(integrated_data))
    35. # Default assay ---------------------------------------
    36. #DefaultAssay(integrated_data) <- def_assay
    37. # Process it before integration -----------------------
    38. integrated_data <- integrated_data %>%
    39. ScaleData(verbose = FALSE) %>%
    40. RunPCA(features = gene_selection,
    41. npcs = 30,
    42. verbose = FALSE)
    43. original_pca_plt <- DimPlot(object = integrated_data,
    44. reduction = "pca",
    45. pt.size = .1,
    46. group.by = "orig.ident")
    47. head(integrated_data@meta.data)
    48. batch_covars="orig.ident"
    49. # Integrate the data -----------------------
    50. integrated_data <- RunHarmony(integrated_data,
    51. group.by.vars = batch_covars,
    52. plot_convergence = TRUE,
    53. assay.use = def_assay,
    54. max.iter.harmony = 20)
    55. # Corrected dimensions -----------------------
    56. corrected_pca_plt <- DimPlot(object = integrated_data,
    57. reduction = "harmony",
    58. pt.size = .1,
    59. group.by = "orig.ident")
    60. # Create the UMAP with new reduction -----------
    61. integrated_data <- integrated_data %>%
    62. RunUMAP(reduction = "harmony", dims = 1:30,
    63. reduction.name = "umap_harmony") %>%
    64. RunUMAP(reduction = "pca", dims = 1:30,
    65. reduction.name = "umap_original")
    66. integrated_data <- FindNeighbors(integrated_data,
    67. reduction = "harmony",
    68. dims = 1:30)
    69. head(integrated_data@meta.data)
    70. colnames(integrated_data@meta.data)=gsub(pattern ="_snn_res.",replacement = "_before",x = colnames(integrated_data@meta.data) )

    准备好的数据如下。这里的数据你可以使用seurat的pbmc教程中的数据,作为练习。

    然后进行下一步

    2 开始最优聚类

    全代码如下

    1. optimize=TRUE
    2. ################opt cluster--------------
    3. if(optimize) {
    4. # Clustering and optimization -------------------------
    5. print("Optimizing clustering")
    6. seq_res <- seq(0.5, 1.5, 0.1)
    7. # seq_res=1
    8. integrated_data <- FindClusters(integrated_data,
    9. resolution = seq_res,
    10. verbose = F)
    11. clustree_plt <- clustree::clustree(integrated_data,
    12. prefix = paste0(DefaultAssay(integrated_data), "_snn_res."))
    13. # Optimize clustering ------------------------------------------------------
    14. cell_dists <- dist(integrated_data@reductions$harmony@cell.embeddings,
    15. method = "euclidean")
    16. head(cell_dists)
    17. cluster_info <- integrated_data@meta.data[,grepl(paste0(DefaultAssay(integrated_data),"_snn_res"),
    18. colnames(integrated_data@meta.data))] %>%
    19. dplyr::mutate_all(as.character) %>%
    20. dplyr::mutate_all(as.numeric)
    21. head(cluster_info)[,1:9]
    22. # head(cell_dists)[,1:9]
    23. si= silhouette(cluster_info[,1], cell_dists) %>%head()
    24. si
    25. silhouette_res <- apply(cluster_info, 2, function(x){
    26. si <- silhouette(x, cell_dists)
    27. if(!any(is.na(si))) {
    28. mean(si[, 'sil_width'])
    29. } else {
    30. NA
    31. }
    32. })
    33. head(silhouette_res)
    34. integrated_data[["opt_clust_integrated"]] <- integrated_data[[names(which.max(silhouette_res))]]
    35. Idents(integrated_data) = "opt_clust_integrated"
    最终会得到一个opt_clust_integrated,代表最优聚类的清晰度。如果你的数据比较大,可以调整   seq_res <- seq(0.5, 1.5, 0.1),把1.5调到3,甚至更高。

    上述代码的部分运行结果:

    根据下面的图,resolution为0.7是最优聚类

    3 找到最优聚类之后,把多余的聚类结果删掉,节省内存

    全代码如下

    1. # Reduce meta-data -------------------------------------------------------------------------
    2. spam_cols <- grepl(paste0(DefaultAssay(integrated_data), "_snn_res"),
    3. colnames(integrated_data@meta.data)) |
    4. grepl("seurat_clusters",colnames(integrated_data@meta.data))
    5. integrated_data@meta.data <- integrated_data@meta.data[,!spam_cols]
    6. } else {
    7. print("Not Optimizing clustering")
    8. seq_res <- seq(0.2, 1.6, 0.2)
    9. integrated_data <- FindClusters(integrated_data,
    10. resolution = seq_res,
    11. verbose = F)
    12. clustree_plt <- clustree(integrated_data,
    13. prefix = paste0(DefaultAssay(integrated_data),
    14. "_snn_res."))
    15. integrated_data <- FindClusters(integrated_data,
    16. resolution = default_resolution,
    17. verbose = F)
    18. integrated_data[["opt_clust_integrated"]] <- integrated_data[["seurat_clusters"]]
    19. spam_cols <- grepl(paste0(DefaultAssay(integrated_data), "_snn_res"),
    20. colnames(integrated_data@meta.data)) |
    21. grepl("seurat_clusters",colnames(integrated_data@meta.data))
    22. integrated_data@meta.data <- integrated_data@meta.data[,!spam_cols]
    23. }

    4 保存对象

    1. # Save object ------------------------------------------------------
    2. saveRDS(integrated_data, file ="~/silicosis/spatial/integrated_slides/integrated_slides.rds" )

     关注微信:生信小博士

    参考: 

    https://medium.com/codesmart/r-series-k-means-clustering-silhouette-794774b46586#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6ImEwNmFmMGI2OGEyMTE5ZDY5MmNhYzRhYmY0MTVmZjM3ODgxMzZmNjUiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJhenAiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJhdWQiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJzdWIiOiIxMTIyMzMxNDE1MDU0MDY1NzA5OTMiLCJlbWFpbCI6ImFudGkuY2FuY2VyLmZpZ2h0ZXJAZ21haWwuY29tIiwiZW1haWxfdmVyaWZpZWQiOnRydWUsIm5iZiI6MTY5ODE5ODYzMCwibmFtZSI6InlhbmcgeWFuZyIsInBpY3R1cmUiOiJodHRwczovL2xoMy5nb29nbGV1c2VyY29udGVudC5jb20vYS9BQ2c4b2NKUjVFOGJoU2wxRUVtUDcyVXhPUExGU2VlNlcza2hlZVItVnVaZDZiei09czk2LWMiLCJnaXZlbl9uYW1lIjoieWFuZyIsImZhbWlseV9uYW1lIjoieWFuZyIsImxvY2FsZSI6ImVuIiwiaWF0IjoxNjk4MTk4OTMwLCJleHAiOjE2OTgyMDI1MzAsImp0aSI6ImJlNjYyNmYyMWMwZGE4MjgzZWUzMWEzMGU0ZGY2MWQ0Y2M3YzlkODgifQ.FddjhGbYiuFDFjWFmjGcAXNdbuR3XyqzqWWv8NvyJnREj48qhcFsI4HHv0DWnwewOsmr2jnCnC51nInSyBvgZJSjAQMdUAVpoZTwvroliR3wB6q8In29eXhZ2N_WgqVqoTcMT5yECF4oJrVZgDYxG5t17KAIYso3pvbpCy-5oVq8zBPZ8M8HRd83upKAODDC4bAKtGsGqlFYDV-bD7L1pXsRw3HWebcyY7bfk74FVtnveGyk-A0VIHjDIAdxxjqMiZmuntRX7PUV-nXhSeiBxzI8W4kjHO8uQKlwaCgjyOARTNQ2b-AOY5f5gr6BP0kin9DN6rk7QuPfrvVIgg4yGwicon-default.png?t=N7T8https://medium.com/codesmart/r-series-k-means-clustering-silhouette-794774b46586#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6ImEwNmFmMGI2OGEyMTE5ZDY5MmNhYzRhYmY0MTVmZjM3ODgxMzZmNjUiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJhenAiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJhdWQiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJzdWIiOiIxMTIyMzMxNDE1MDU0MDY1NzA5OTMiLCJlbWFpbCI6ImFudGkuY2FuY2VyLmZpZ2h0ZXJAZ21haWwuY29tIiwiZW1haWxfdmVyaWZpZWQiOnRydWUsIm5iZiI6MTY5ODE5ODYzMCwibmFtZSI6InlhbmcgeWFuZyIsInBpY3R1cmUiOiJodHRwczovL2xoMy5nb29nbGV1c2VyY29udGVudC5jb20vYS9BQ2c4b2NKUjVFOGJoU2wxRUVtUDcyVXhPUExGU2VlNlcza2hlZVItVnVaZDZiei09czk2LWMiLCJnaXZlbl9uYW1lIjoieWFuZyIsImZhbWlseV9uYW1lIjoieWFuZyIsImxvY2FsZSI6ImVuIiwiaWF0IjoxNjk4MTk4OTMwLCJleHAiOjE2OTgyMDI1MzAsImp0aSI6ImJlNjYyNmYyMWMwZGE4MjgzZWUzMWEzMGU0ZGY2MWQ0Y2M3YzlkODgifQ.FddjhGbYiuFDFjWFmjGcAXNdbuR3XyqzqWWv8NvyJnREj48qhcFsI4HHv0DWnwewOsmr2jnCnC51nInSyBvgZJSjAQMdUAVpoZTwvroliR3wB6q8In29eXhZ2N_WgqVqoTcMT5yECF4oJrVZgDYxG5t17KAIYso3pvbpCy-5oVq8zBPZ8M8HRd83upKAODDC4bAKtGsGqlFYDV-bD7L1pXsRw3HWebcyY7bfk74FVtnveGyk-A0VIHjDIAdxxjqMiZmuntRX7PUV-nXhSeiBxzI8W4kjHO8uQKlwaCgjyOARTNQ2b-AOY5f5gr6BP0kin9DN6rk7QuPfrvVIgg4yGw

  • 相关阅读:
    SCI投稿:投稿附言cover letter的写法和模板!
    汇聚荣拼多多电商好不好?
    【广州华锐互动】VR虚拟仿真技术为航测实践教学提供了哪些帮助?
    用Python实现广度优先搜索
    IIR滤波器的设计与实现(内含设计IIR滤波器的高效方法)
    Mybatis配置文件总览
    Flink系列之Flink流式计算引擎基础理论
    iOS16 中的 3 种新字体宽度样式
    常用彩色模型及转换
    ssh 远程服务器 Host key verification failed.【known_hosts】
  • 原文地址:https://blog.csdn.net/qq_52813185/article/details/134027916
  • 最新文章
  • 攻防演习之三天拿下官网站群
    数据安全治理学习——前期安全规划和安全管理体系建设
    企业安全 | 企业内一次钓鱼演练准备过程
    内网渗透测试 | Kerberos协议及其部分攻击手法
    0day的产生 | 不懂代码的"代码审计"
    安装scrcpy-client模块av模块异常,环境问题解决方案
    leetcode hot100【LeetCode 279. 完全平方数】java实现
    OpenWrt下安装Mosquitto
    AnatoMask论文汇总
    【AI日记】24.11.01 LangChain、openai api和github copilot
  • 热门文章
  • 十款代码表白小特效 一个比一个浪漫 赶紧收藏起来吧!!!
    奉劝各位学弟学妹们,该打造你的技术影响力了!
    五年了,我在 CSDN 的两个一百万。
    Java俄罗斯方块,老程序员花了一个周末,连接中学年代!
    面试官都震惊,你这网络基础可以啊!
    你真的会用百度吗?我不信 — 那些不为人知的搜索引擎语法
    心情不好的时候,用 Python 画棵樱花树送给自己吧
    通宵一晚做出来的一款类似CS的第一人称射击游戏Demo!原来做游戏也不是很难,连憨憨学妹都学会了!
    13 万字 C 语言从入门到精通保姆级教程2021 年版
    10行代码集2000张美女图,Python爬虫120例,再上征途
Copyright © 2022 侵权请联系2656653265@qq.com    京ICP备2022015340号-1
正则表达式工具 cron表达式工具 密码生成工具

京公网安备 11010502049817号