复旦大学赵兴明领衔团队Nature发文构建全球微生物基因目录
Coelho, L.P., Alves, R., del Río, Á.R. et al. Towards the biogeography of prokaryotic genes. Nature 601, 252–256 (2022). https://doi.org/10.1038/s41586-021-04233-4
文章中使用的丰度计算方法在方法部分进行了文字描述:
Metagenomic annotation and abundance profiling
…
The metagenomes were mapped to the catalogue using minimap 281 ^{281} 281,
after read trimming and filtering as described in ‘Contig assembly and ORF prediction’. A unigene was considered as detected in a sample if it had reads mapping to it unambiguously. Gene and functional abundance profiles were then computed with NGLess 61 ^{61} 61 as well as Jug 64 ^{64} 64 scripts provided in the profiles-all directory of the supplementary software. In brief, abundance was estimated as the number of short reads mapping to a given sequence, with multiple mappers (short reads mapping to more than one sequence) being distributed by unique mapper abundance. For cross-sample comparisons, these results were normalized by library size.…
“Contig assembly and ORF prediction” 这一部分似乎没有在正文部分中出现, 不过也没有必要了解, 反正就当作使用 clean reads.
随后, 使用 minimap 将 clean reads 比对到 unigene 上.
“NGLess and Jug scripts” 在 Github 上提供, 接下来将具体分析.
outputs.txt
:gfiles = CachedFunction(glob, 'outputs.txt/*.txt.gz')
@TaskGenerator
def count1(samfile):
from ngless import NGLess
import tempfile
ofname = 'outputs/{}.txt'.format(get_sample(samfile))
ofname_u = 'outputs/{}.unique.txt'.format(get_sample(samfile))
sname = tempfile.NamedTemporaryFile(suffix='.sam', delete=False)
sname.close()
try:
sname = sname.name
expand_sort(samfile, sname)
sc = NGLess.NGLess('0.7')
e = sc.env
e.sam = sc.samfile_(sname, headers=HEADERS_FILE)
sc.write_(sc.count_(e.sam, features=['seqname'], multiple='{unique_only}', discard_zeros=True),
ofile=ofname_u + '.gz')
sc.write_(sc.count_(e.sam, features=['seqname'], discard_zeros=True),
ofile=ofname + '.gz')
sc.run(auto_install=False, ncpus='auto')
return ofname
finally:
os.unlink(sname)
expand_sort
对文件进行排序sc.count_
进行计数sc.run
中规定的其他处理expand_sort
仅通过 linux 中的 sort
命令对 sam 文件进行排序, 使用参数 --merge
,
@
的行with TemporaryDirectory() as tdir:
for line in lzma.open(ifile, 'rt'):
if line[0] == '@':
continue
if line.split('\t')[2] in _gene_blacklist :
continue
block.append(line)
if len(block) == BUFSIZE:
write_block(tdir)
write_block(tdir)
subprocess.check_call(['sort', '--merge', '-o', ofile] + partials)
无法从 NGLess 文档 中找到 count_
函数, 但似乎与 1.5.0
版本中的 count
相似:
If the features to count are
['seqname']
,
then each read will be assigned to the name of reference it matched
and only an input set of mapped reads is necessary.
mode indicates how to handle reads that (partially) overlap one or more features.
Possible values for mode are
{union}
,{intersection_non_empty}
and{intersection_strict}
(default:{union}
).
For every position of a mapped read, collect all features into a set.
These sets of features are then handled in different modes.
{union}
the union of all the sets.
A read is counted for every feature it overlaps.Consider the following illustration of the effect of different
mode
options:Reference ************************* Feature A ======= Feature B =========== Feature C ======== Read_1 ----- Read_2 ----- Read_3 ----- Position 12345 12345 12345 Read position 1 2 3 4 5 Read_1 feature sets - - A A A Read_2 feature sets A A A,B B B Read_3 feature sets B,C B,C B,C B,C B,C union intersection_non_empty intersection_strict Read_1 A A - Read_2 A & B - - Read_3 B & C B & C B & C
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
How to handle multiple mappers
(inserts which have more than one “hit” in the reference)
is defined by the multiple argument:
{unique_only}
: only use uniquely mapped inserts
NGLess.NGLess('0.7').count_
) 计算一个 reads 对应的全部 gene, 最后再进行唯一匹配和非唯一匹配的矫正