在进行变异检测时,以群体基因组重测序数据为例,涉及到的个体基本都是上百个,而其中大多数流程均是重复的步骤。
本文将基于GATK进行SNP calling的流程写入循环,便于批量分析。
1.工作目录work_dir/
2.参考基因组ref_genome.fa
3.Reads列表read_list.txt
4.测序平台Illumina
5.调用线程数
1.参考基因组ref_genome.fa
2.重测序数据sample1/sample1_1.fq.gz
、sample1/sample1_2.fq.gz
……
3.Reads列表:read_list.txt
生成方法:预先将存放各个个体Reads的文件夹放入一个文件夹work_dir/
然后使用下列命令生成:
ls work_dir/ > read_list.txt
bash GATK_pipeline.sh work_dir/ ref_genome.fa read_list.txt Illumina 10
GATK_pipeline.sh
#---------------------------------------------------------------#
# objection defined by user #
#---------------------------------------------------------------#
set -au
# 1.
# Master dir.:
WORK_dir=$1
# 2.
# Reference genome:
REF=$2
# 3.
# Read list:
READ_list=$3
# 4.
# Seqencing platform:
PL=$4
# 5.
# number of threads:
NT=$5
#---------------------------------------------------------------#
# main loop for SNPs calling by gatk pipeline #
#---------------------------------------------------------------#
#READ_list.txt is a list of read groups.
while read -r READ
do
SAMPLE=SM_${READ}
ID=${READ}
READ1="${WORK_dir}${READ}_1.fq"
READ2="${WORK_dir}${READ}_2.fq"
OUT="${READ}"
#1.
#Alignning reads to reference genome by BWA-MEM2-mem, producing a .sam data
bwa-mem2 \
mem \
-M \
-t ${NT} \
-R "@RG\tID:${ID}\tSM:${SAMPLE}\tPL:${PL}" \
${REF} \
${READ1} \
${READ2} \
> ${OUT}.sam
#2.
#Sorting .sam by gatk-SortSam, producing a .bam data
gatk \
SortSam \
-I ${OUT}.sam \
-O ${OUT}.bam \
-SO coordinate \
-VALIDATION_STRINGENCY LENIENT \
-CREATE_INDEX true \
-TMP_DIR ./${OUT}tmp.sort
#3.
#Marking dupulications in .bam by gatk-MarkDuplicates
#producing a .dup.bam and .dup.txt data
gatk \
MarkDuplicates \
-I ${OUT}.bam \
-O ${OUT}.dup.bam \
-M ${OUT}.dup.txt \
-REMOVE_DUPLICATES true \
-VALIDATION_STRINGENCY LENIENT \
-CREATE_INDEX true \
-TMP_DIR ${OUT}tmp.dup
#4.
#QC by samtools-flagstat, producing a .dup.bam.stat data
samtools \
flagstat \
${OUT}.dup.bam \
> ${OUT}.dup.bam.stat
#5.
#Calling SNPs by gatk-HaplotypeCaller, producing a .dup.vcf data
gatk \
HaplotypeCaller \
-R ${REF} \
-I ${OUT}.dup.bam \
-O ${OUT}.dup.vcf
done < $READ_list
##