NECAT(https://github.com/xiaochuanle/NECAT)是一种用于 Nanopore 长噪声reads的纠错和从头组装工具。该软件结果发表在Nature Communications
上,可以去看原文学习:
Chen Y, Nie F, Xie S Q, et al. Efficient assembly of nanopore reads via highly accurate and intact error correction[J]. Nature Communications, 2021, 12(1): 1-10.
目前最新版本是2020-8-03日更新的。
下载直接使用conda了,方便:
conda activate necat
conda install -c bioconda necat
## 输出软件信息有点点问题,不影响使用
necat
Smartmatch is experimental at /home/debian/bin/miniconda3/envs/necat/share/necat-0.0.1_update20200803-1/bin/Plgd/Project.pm line 263.
Usage: necat.pl correct|assemble|bridge|config cfg_fname
correct: correct rawreads
assemble: generate contigs
bridge: bridge contigs
config: generate default config file ## 生成config file
根据github中quickstart介绍,首先需要将软件放入Linux环境变量中确保正常运行,之后开始:
生成命令:
necat config config.txt
config.txt
文件内容:
PROJECT=
ONT_READ_LIST=
GENOME_SIZE=
THREADS=4
MIN_READ_LENGTH=3000
PREP_OUTPUT_COVERAGE=40
OVLP_FAST_OPTIONS=-n 500 -z 20 -b 2000 -e 0.5 -j 0 -u 1 -a 1000
OVLP_SENSITIVE_OPTIONS=-n 500 -z 10 -e 0.5 -j 0 -u 1 -a 1000
CNS_FAST_OPTIONS=-a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0
CNS_SENSITIVE_OPTIONS=-a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0
TRIM_OVLP_OPTIONS=-n 100 -z 10 -b 2000 -e 0.5 -j 1 -u 1 -a 400
ASM_OVLP_OPTIONS=-n 100 -z 10 -b 2000 -e 0.5 -j 1 -u 0 -a 400
NUM_ITER=2
CNS_OUTPUT_COVERAGE=30
CLEANUP=1
USE_GRID=false
GRID_NODE=0
GRID_OPTIONS=
SMALL_MEMORY=0
FSA_OL_FILTER_OPTIONS=
FSA_ASSEMBLE_OPTIONS=
FSA_CTG_BRIDGE_OPTIONS=
POLISH_CONTIGS=true
根据自己的物种和数据名修改上面的config.txt
文件。
nanopore数据文件read_list.txt
,文件中nanopore数据的内容不必相同,可以存在fastq,fasta,或者gzip格式,我们的拟南芥仅为一个文件:
## cat read_list.txt
/home/debian/data/08.arabidopsis_t2t_genome/CRR302667/CRR302667.fastq.gz
设置必须的参数值:
PROJECT=Arabidopsis ## 输出结果文件的文件名
ONT_READ_LIST=read_list.txt ## nanopore数据路径及文件名
GENOME_SIZE=138000000 ## 基因组大小
THREADS=20 ## 20个线程
MIN_READ_LENGTH=3000 ## 最短length
PREP_OUTPUT_COVERAGE=40 ## 设定corrected reads的覆盖度,这里是40X
OVLP_FAST_OPTIONS=-n 500 -z 20 -b 2000 -e 0.5 -j 0 -u 1 -a 1000
OVLP_SENSITIVE_OPTIONS=-n 500 -z 10 -e 0.5 -j 0 -u 1 -a 1000
CNS_FAST_OPTIONS=-a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0
CNS_SENSITIVE_OPTIONS=-a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0
TRIM_OVLP_OPTIONS=-n 100 -z 10 -b 2000 -e 0.5 -j 1 -u 1 -a 400
ASM_OVLP_OPTIONS=-n 100 -z 10 -b 2000 -e 0.5 -j 1 -u 0 -a 400
NUM_ITER=2
CNS_OUTPUT_COVERAGE=30 ## 最长的30X数据用于后续组装
CLEANUP=1
USE_GRID=false
GRID_NODE=0
GRID_OPTIONS=
SMALL_MEMORY=0
FSA_OL_FILTER_OPTIONS=
FSA_ASSEMBLE_OPTIONS=
FSA_CTG_BRIDGE_OPTIONS=
POLISH_CONTIGS=true ## 设置最后bridge结果是否polish
necat矫正
## run necat
time necat correct config.txt
Note
:
(1)The pipeline only corrects longest 40X (PREP_OUTPUT_COVERAGE) raw reads.
The corrected reads are in the files ./ecoli/1-consensus/cns_iter${NUM_ITER}/cns.fasta.gz
(2)The longest 30X (CNS_OUTPUT_COVERAGE) corrected reads are extracted for assembly,
which are in the file ./ecoli/1-consensus/cns_final.fasta.gz
necat组装,使用的数据是Arabidopsis/1-consensus/cns_final.fasta.gz
,如果上面的矫正未运行,那么这步也会运行矫正:
time necat assemble config.txt
assemble结果为:
Arabidopsis/4-fsa/contigs.fasta
necat bidge命令:
time necat bridge config.txt
bridge结果:
Arabidopsis/6-bridge_contigs/bridged_contigs.fasta
注意:
If POLISH_CONTIGS is set, the pipeline uses the corrected reads to polish the bridged contigs.
The polished contigs are in the file ./ecoli/6-bridge_contigs/polished_contigs.fasta
因为我们上面也是设置的POLISH_CONTIGS=true
,所以最终birdge的结果还进行了polish。从结果文件大小可以看出两个文件存在细微差别,polish的长约17k:
ll Arabidopsis/6-bridge_contigs/*_contigs.fasta
-rw-r--r-- 1 debian debian 128470228 9月 2 02:20 Arabidopsis/6-bridge_contigs/bridged_contigs.fasta
-rw-r--r-- 1 debian debian 128487860 9月 2 04:22 Arabidopsis/6-bridge_contigs/polished_contigs.fasta
此处拟南芥组装得到的基因组大小为~128.48Mb
,设置了20个threads,运行速度一天,整体结束(500G RAM and 28cpu debian sever)。命令前面使用time
是为了输出软件运行时间。
参考:
https://github.com/xiaochuanle/NECAT (github地址)
https://www.nature.com/articles/s41467-020-20236-7 (文章)