• 基于GATK流程化进行SNP calling


    在进行变异检测时,以群体基因组重测序数据为例,涉及到的个体基本都是上百个,而其中大多数流程均是重复的步骤。
    本文将基于GATK进行SNP calling的流程写入循环,便于批量分析。
    在这里插入图片描述

    1 涉及变量

    1.工作目录work_dir/
    2.参考基因组ref_genome.fa
    3.Reads列表read_list.txt
    4.测序平台Illumina
    5.调用线程数

    2 调用数据

    1.参考基因组ref_genome.fa
    2.重测序数据sample1/sample1_1.fq.gzsample1/sample1_2.fq.gz……
    3.Reads列表:read_list.txt
    生成方法:预先将存放各个个体Reads的文件夹放入一个文件夹work_dir/然后使用下列命令生成:

    ls work_dir/ > read_list.txt
    
    • 1

    3 主要脚本

    usage:

    bash GATK_pipeline.sh work_dir/ ref_genome.fa read_list.txt Illumina 10
    
    • 1

    GATK_pipeline.sh

    
    #---------------------------------------------------------------#
    #                objection defined by user                      #
    #---------------------------------------------------------------#
    
    set -au
    
    # 1.
    # Master dir.:
    WORK_dir=$1
    
    # 2.
    # Reference genome:
    REF=$2
    
    # 3.
    # Read list:
    READ_list=$3
    
    # 4.
    # Seqencing platform:
    PL=$4
    
    # 5.
    # number of threads:
    NT=$5
    
    #---------------------------------------------------------------#
    #         main loop for SNPs calling by gatk pipeline           #
    #---------------------------------------------------------------#
    
    #READ_list.txt is a list of read groups.
    while read -r READ
    
    do
    
    SAMPLE=SM_${READ}
    ID=${READ}
    READ1="${WORK_dir}${READ}_1.fq"
    READ2="${WORK_dir}${READ}_2.fq"
    OUT="${READ}"
    
    #1.
    #Alignning reads to reference genome by BWA-MEM2-mem, producing a .sam data
    bwa-mem2 \
    	mem \
    	-M \
    	-t ${NT} \
    	-R "@RG\tID:${ID}\tSM:${SAMPLE}\tPL:${PL}" \
    	${REF} \
    	${READ1} \
    	${READ2} \
    	> ${OUT}.sam
    
    #2.
    #Sorting .sam by gatk-SortSam, producing a .bam data
    gatk \
    	SortSam \
    	-I ${OUT}.sam \
    	-O ${OUT}.bam \
    	-SO coordinate \
    	-VALIDATION_STRINGENCY LENIENT \
    	-CREATE_INDEX true \
    	-TMP_DIR ./${OUT}tmp.sort
    #3.
    #Marking dupulications in .bam by gatk-MarkDuplicates
    #producing a .dup.bam and .dup.txt data
    gatk \
    	MarkDuplicates \
    	-I ${OUT}.bam \
    	-O ${OUT}.dup.bam \
    	-M ${OUT}.dup.txt \
    	-REMOVE_DUPLICATES true \
    	-VALIDATION_STRINGENCY LENIENT \
    	-CREATE_INDEX true \
    	-TMP_DIR ${OUT}tmp.dup
    
    #4.
    #QC by samtools-flagstat, producing a .dup.bam.stat data
    samtools \
    	flagstat \
    	${OUT}.dup.bam \
    	> ${OUT}.dup.bam.stat
    
    #5.
    #Calling SNPs by gatk-HaplotypeCaller, producing a .dup.vcf data
    gatk \
    	HaplotypeCaller \
    	-R ${REF} \
    	-I ${OUT}.dup.bam \
    	-O ${OUT}.dup.vcf
    
    done < $READ_list
    ##
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
  • 相关阅读:
    职业成功指南:10条核心原则(上)丨三叠云
    Oracle数据库---JDBC连接
    Redis key基本使用
    Spark3.x入门到精通-阶段一(入门&yarn集群&java和scale双语开发)
    48.【C++map映射】
    相似度loss汇总,pytorch code
    idea 常用插件和常用快捷键 - 记录
    排序算法-冒泡排序
    考虑车轮纵向滑动的无人自行车平衡控制实现
    A Recommendation for interface-based programming
  • 原文地址:https://blog.csdn.net/SUN5_The_answer/article/details/134501802