• Coelho2021_GMGCv1 计算丰度的方法


    文章链接

    复旦大学赵兴明领衔团队Nature发文构建全球微生物基因目录
    Coelho, L.P., Alves, R., del Río, Á.R. et al. Towards the biogeography of prokaryotic genes. Nature 601, 252–256 (2022). https://doi.org/10.1038/s41586-021-04233-4

    文章相关内容

    • 文章中使用的丰度计算方法在方法部分进行了文字描述:

      Metagenomic annotation and abundance profiling

      The metagenomes were mapped to the catalogue using minimap 281 ^{281} 281,
      after read trimming and filtering as described in ‘Contig assembly and ORF prediction’. A unigene was considered as detected in a sample if it had reads mapping to it unambiguously. Gene and functional abundance profiles were then computed with NGLess 61 ^{61} 61 as well as Jug 64 ^{64} 64 scripts provided in the profiles-all directory of the supplementary software. In brief, abundance was estimated as the number of short reads mapping to a given sequence, with multiple mappers (short reads mapping to more than one sequence) being distributed by unique mapper abundance. For cross-sample comparisons, these results were normalized by library size.

    • “Contig assembly and ORF prediction” 这一部分似乎没有在正文部分中出现, 不过也没有必要了解, 反正就当作使用 clean reads.

    • 随后, 使用 minimap 将 clean reads 比对到 unigene 上.

      • 需要注意, unigene 是通过氨基酸一致性进行聚类的, 而 reads 是核苷酸序列, 无法直接比对.
      • 文章中的做法似乎是将 unigene 对应的核苷酸序列作为被比对的序列, 忽略了核苷酸多态性, 因此可能造成对总比对率的低估.
    • “NGLess and Jug scripts” 在 Github 上提供, 接下来将具体分析.

    Codes

    count1

    • 在存储库的 README 中没有直接对 output 的描述, 仅可以从部分调用脚本中发现存储基因丰度信息的文件夹为 outputs.txt :
      gfiles = CachedFunction(glob, 'outputs.txt/*.txt.gz')
      
      • 1
    • 随后可以找到生成这些丰度文件的代码位置:
      @TaskGenerator
      def count1(samfile):
          from ngless import NGLess
          import tempfile
      
          ofname = 'outputs/{}.txt'.format(get_sample(samfile))
          ofname_u = 'outputs/{}.unique.txt'.format(get_sample(samfile))
          sname = tempfile.NamedTemporaryFile(suffix='.sam', delete=False)
          sname.close()
          try:
              sname = sname.name
              expand_sort(samfile, sname)
      
              sc = NGLess.NGLess('0.7')
              e = sc.env
      
              e.sam = sc.samfile_(sname, headers=HEADERS_FILE)
              sc.write_(sc.count_(e.sam, features=['seqname'], multiple='{unique_only}', discard_zeros=True),
                      ofile=ofname_u + '.gz')
              sc.write_(sc.count_(e.sam, features=['seqname'], discard_zeros=True),
                      ofile=ofname + '.gz')
              sc.run(auto_install=False, ncpus='auto')
              return ofname
          finally:
              os.unlink(sname)
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
    • 简而言之: 该脚本以 sam 文件作为输入, 生成包含或不包含 “{unique_only}” 标签的丰度文件
      • 文件是通过以下几步计算得到的:
        1. 通过 expand_sort 对文件进行排序
        2. 通过 sc.count_ 进行计数
        3. sc.run 中规定的其他处理

    expand_sort

    • expand_sort 仅通过 linux 中的 sort 命令对 sam 文件进行排序, 使用参数 --merge,
      • 可以设置所需忽略的基因, 同时忽略含 @ 的行
      with TemporaryDirectory() as tdir:
          for line in lzma.open(ifile, 'rt'):
              if line[0] == '@':
                  continue
              if line.split('\t')[2] in _gene_blacklist :
                  continue
              block.append(line)
              if len(block) == BUFSIZE:
                  write_block(tdir)
          write_block(tdir)
          subprocess.check_call(['sort', '--merge', '-o', ofile] + partials)
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11

    count_

    • 无法从 NGLess 文档 中找到 count_ 函数, 但似乎与 1.5.0 版本中的 count 相似:

      If the features to count are ['seqname'],
      then each read will be assigned to the name of reference it matched
      and only an input set of mapped reads is necessary.

      mode indicates how to handle reads that (partially) overlap one or more features.
      Possible values for mode are
      {union}, {intersection_non_empty} and {intersection_strict}
      (default: {union}).
      For every position of a mapped read, collect all features into a set.
      These sets of features are then handled in different modes.

      • {union} the union of all the sets.
        A read is counted for every feature it overlaps.

      Consider the following illustration of the effect of different mode options:

      Reference *************************
      Feature A      =======
      Feature B            ===========
      Feature C                 ========
      Read_1       -----
      Read_2             -----
      Read_3                    -----
      Position     12345 12345  12345
      
      Read position          1    2    3    4    5
      Read_1 feature sets    -    -    A    A    A
      Read_2 feature sets    A    A  A,B    B    B
      Read_3 feature sets  B,C  B,C  B,C  B,C  B,C
      
                 union  intersection_non_empty  intersection_strict
      Read_1         A                       A                    -
      Read_2     A & B                       -                    -
      Read_3     B & C                   B & C                B & C
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18

      How to handle multiple mappers
      (inserts which have more than one “hit” in the reference)
      is defined by the multiple argument:

      • {unique_only}: only use uniquely mapped inserts

    总结

    • 本文计算基因丰度时, 是基于 minimap 的 mapping 结果, 随后通过自定义的脚本 (NGLess.NGLess('0.7').count_) 计算一个 reads 对应的全部 gene, 最后再进行唯一匹配和非唯一匹配的矫正
  • 相关阅读:
    MYSQL一站式学习,看完即学完
    客户需求太多,如何有效沟通完成项目?
    RPA主要有那些特征,多久可以学会?
    Linux设备树详解
    安卓将图片分割或者拉伸或者旋转到指定尺寸并保存到本地
    云原生Kubernetes系列 | 通过容器互联搭建wordpress博客系统
    2010-2019年208个地级市城乡收入差距泰尔指数
    专利申请中的期限及期限的延长
    两种AI 图像生成技术:MidJourney 和 Stable Diffusion
    使用Golang与Web3.js进行区块链开发
  • 原文地址:https://blog.csdn.net/weixin_44881875/article/details/128101618