prokka-原核及病毒基因组高效便捷注释

文章目录

简介
安装
使用
下游分析
- 提取16S序列物种注释
统计5S/16S/23S/tRNA数量
统计基因的数量、位置等信息
注意
参考

简介

Prokka: rapid prokaryotic genome annotation
全基因组注释是在一组基因组DNA序列中识别感兴趣的特征，并用有用的信息标记它们的过程。Prokka是一款软件工具，可以快速注释细菌、古菌和病毒基因组，并生成符合标准的输出文件。

安装

conda create prokka -c conda-forge -c bioconda -c defaults prokka=1.14
# 1.13版本会报blastp <2.2，实际上已经安装blastp 2.10
1
2

Test
Type prokka and it should output its help screen.
Type prokka --version and you should see an output like prokka 1.x
Type prokka --listdb and it will show you what databases it has installed to use.
运行时出现如下报错则重新按照上述命令安装。

[20:43:45] Prokka needs blastp 2.2 or higher. Please upgrade and try again.
1

使用

 prokka contigs.fa
# Look for a folder called PROKKA_yyyymmdd (today's date) and look at stats
prokka --force --outdir mydir --prefix mygenome contigs.fa

time prokka  --force --cpu 100 --outdir ecoli_prokka --prefix ecoli  ../Ecoli_k12/Ecoli_k12.fasta
#  大肠杆菌8核1min30s，100核50s
1
2
3
4
5
6

预测病毒基因

nohup time prokka --force  --centre X --compliant --cpus 80 --kingdom  Viruses --outdir Viruses_kingdom  --prefix 
../Viral_prediction/Virsorter_Virfinder_Deepvirfinder_share_at_least_two_method.fa &>prokka.log&

nohup time prokka  --cpus 80 --kingdom  Viruses --outdir Viruses_kingdom  ../Viral_prediction/Virsorter_Virfinder_Deepvirfinder_share_at_least_two_method.fa &>prokka.log&
1
2
3
4

Contig ID must <= 37 chars long: k141_4519235_length_122628_cov_55.0330

默认使用Barrnap 预测rRNA

–rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)

下游分析

提取16S序列物种注释

# *ffn文件中保存着所有预测基因的核酸序列，可以通过匹配comment中的关键词来提取相应序列
# 5S ribosomal RNA
# 16S ribosomal RNA
# 23S ribosomal RNA
# tRNA
# 提取16S
for i in *ffn;do bioawk -c fastx -v sample=${i%.ffn} '$comment~/16S ribosomal RNA/{print ">"sample"_"$name"_len"length($seq)"\t"$comment"\n"$seq}' $i >../Prokka_16S/${i%.ffn}_16S.fna;done &
# 构建PM seq.list 和metadata.tsv(第一行ID"\t"Group)
 mkdir -p  ../PM_out; grep -c ">" *|awk -F: '$2!=0{print $1}'|while read i ; do printf "${i%.fna}\t$PWD/$i\n";done >../PM_out/16S_seq.list
# 构建metadata.tsv
awk 'BEGIN{print "ID\tGroup"}{print $1"\t"$1}' 16S_seq.list  >16S_metadata.tsv
# 提取23S
for i in *ffn;do res=$(bioawk -c fastx -v sample=${i%.ffn} '$comment~/23S ribosomal RNA/{print ">"sample"_"$name"_len"length($seq)"\t"$comment"\n"$seq}' $i);if [ -n "$res" ];then echo "$res" >../Prokka_23S/${i%.ffn}_23S.fna;fi;done

#  PM-pipline
nohup time PM-pipeline -D S -d 0.97 -m 16S_metadata.tsv -i 16S_seq.list -o 16S_Silva_out -f F  -t  100 -L 123456 &>16S.log
# -d 比对的相似性

# 统计
[u@h@Single_Sample]$
for i in *16S;do awk -v id=${i%_16S} 'NR==2{$1=$2=$3="";gsub(/; /,";",$0);print id,$0}' ${i}/classification.txt;done > /mnt/nfs/yutao/972Isolates/829_Comp95_Cont5_Isolates/829_Comp95_Cont5_Isolates_16S_Taxonomy.tsv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

统计5S/16S/23S/tRNA数量

*.txt文件保存着各个基因种类数量，可以通过合并成表格来统计

[u@h@101raw_genomes_processed_prokka]$ cat Y322-2.txt
organism: Genus species strain
contigs: 2122
bases: 13615552
CDS: 11428
rRNA: 5
tRNA: 90
tmRNA: 2
1
2
3
4
5
6
7
8

统计基因的数量、位置等信息

需要注意prokka注释的基因数量比较少


(base) [yutao@myosin Genome_integration]$ (printf "genome\ttotal_num\tnames\tlocus\tlength\n"; head 826.id |while read i;do num=$(grep -c "16S ribosomal RNA" Prokka_out/${i%.fna}.tsv);locus=($(grep  "16S ribosomal RNA" Prokka_out/${i%.fna}.tsv|awk '{print $1}'));name=($(grep  "16S ribosomal RNA" Prokka_out/${i%.fna}.tsv|awk '{print $4}'));len=($(grep  "16S ribosomal RNA" Prokka_out/${i%.fna}.tsv|awk '{print $3}'));if [ $num -eq 0 ];then locus=0;name=0;len=0;fi;printf "${i}\t$num\t${name[*]}\t${locus[*]}\t${len[*]}\n";done) 
1
2

选择最大的16S序列求均值

(base) [yutao@myosin ColdSeepDB_ANI99_to_3179MAGs_Prokka]$ awk -F"\t" '$2!=0 &&  NR>1{print $NF}' tmp/3179MAGs_prokka_16S.tsv |awk '{m=$1;for(i=1;i<=NF;i++)if($i>m)m=$i;print "max of line",NR": ",m}' |awk '{s+=$NF}END{print s/NR}'
1

注意

如果contig ID中包含"|“，将会被替换成”_"

[15:52:55] This is prokka 1.14.6
[15:54:16] Changing illegal '|' to '_' in sequence name: HTR7|k141_1252805
1
2

序列id名称过长
序列名称过程使用--centre X --compliant参数

Contig ID must <= 37 chars long: k127_1068279_length_11625_cov_149.1380
[22:08:01] Please rename your contigs OR try '--centre X --compliant' to generate clean contig names.
1
2

参考

prokka github

相关阅读:
C学生数据库_将链表保存进数据库
TensorFlow在推荐系统中的分布式训练优化实践
达梦数据冲刺科创板：拟募资24亿冯裕才曾为华科教授
11个常见的分类特征的编码技术
CSS 基本选择器
【动手学深度学习】李沐——循环神经网络
电脑清理c盘怎么清理全教程，教你彻底清理所有垃圾
DSCNet：基于拓扑几何约束的动态蛇形卷积管状结构分割
干货 | Burpsuite的使用tips总结
jdbc&数据库连接池&jdbcTemplate教程

原文地址：https://blog.csdn.net/qq_42491125/article/details/114268910