• 开箱即用版本 满分室间质评之GATK Somatic SNV+Indel+CNV+SV


    最近准备为sliverworkspace 图形化生信平台开发报告设计器,需要一个较为复杂的pipeline作为测试数据,就想起来把之前的 满分室间质评之GATK Somatic SNV+Indel+CNV+SV(下)性能优化翻出来用一下。跑了一遍发现还是各种问题,于是想把pipeline改造成免部署、首次运行初始化环境的版本,以便需要时候能够直接运行起来,于是有了本文。

    一句话描述就是:开箱即用的pipeline,能够自动初始化环境、安装所需软件、下载ref文件和数据库的版本

    为了让pipeline能够直接运行,无需部署,这里使用docker容器保证运行环境的一致性:见文章:基于docker的生信基础环境镜像构建,这里采用的方案是带ssh服务的docker+conda环境,整个pipeline在一个通用容器中运行。

    本文代码较长,可能略复杂,想直接运行的可以下载 workflow文件,导入sliverworkspace图形化生信平台直接运行。

    相关代码和workflow文件可以访问笔者github项目地址或这gitee项目地址

    导入操作

    在这里插入图片描述

    分析流程整体概览

    在这里插入图片描述

    docker 镜像拉取及部署配置

    # 拉取docker镜像
    docker     pull     doujiangbaozi/sliverworkspace:1.10
    
    • 1
    • 2

    docker-compose.yml配置文件

    version: "3"
    services:
      GATK:
        image: doujiangbaozi/sliverworkspace:1.10
        container_name: GATK
        volumes:
          - /home/sliver/Data/data:/opt/data:rw                                #挂载输入数据目录
          - /home/sliver/Manufacture/gatk/envs:/root/mambaforge-pypy3/envs     #挂载envs目录
          - /home/sliver/Manufacture/sliver/ref:/opt/ref:rw                    #挂载reference目录
          - /home/sliver/Manufacture/gatk/result:/opt/result:rw                #挂载中间文件和结果目录
        environment:
          - TZ=Asia/Shanghai                                                   #设置时区
          - PS=20191124                                                        #设置ssh密码
          - PT=9024                                                            #设置ssh连接端口
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    docker 镜像部署运行

    # 在docker-compose.yml文件同级目录下运行
    docker-compose up -d
    
    # 或者docker-compose -f docker-compose.yml所在目录
    docker-compose -f somedir/docker-compose.yml up -d
    
    # 可以通过ssh连接到docker 运行pipeline命令了,连接端口和密码见docker-compose.yml配置文件相关字段
    ssh root@127.0.0.1 -p9024
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    测试数据

    测试数据来自2017年卫计委室间质评提供的bed文件(pipeline会自动下载)和测试数据,修改命名以匹配pipeline输入端,也可以替换为自己的数据文件,因为室间质评目前参考基因组还停留在hg19版本,所以本流程仍然使用hg19(GRCH37),有需要的可以自行替换。后期会提供hg38(GRCH38)版本

    文件名(按照需要有调整)文件大小MD5
    B1701_R1.fq.gz4.85G07d3cdccee41dbb3adf5d2e04ab28e5b
    B1701_R2.fq.gz4.77Gc2aa4a8ab784c77423e821b9f7fb00a7
    B1701NC_R1.fq.gz3.04G4fc21ad05f9ca8dc93d2749b8369891b
    B1701NC_R2.fq.gz3.11Gbc64784f2591a27ceede1727136888b9

    变量名称

    # 变量初始化赋值
      sn=1701                               #样本编号
      pn=GS03                               #pipeline 代号
      version_openjdk=8.0.332               #java openjdk 版本                         
      version_cnvkit=0.9.10                 #cnvkit 版本
      version_manta=1.6.0                   #manta 版本
      version_gatk=4.3.0.0                  #gatk 版本                                 
      version_sambamba=1.0.1                #sambamba 版本                             
      version_samtools=1.17                 #samtools 版本                             
      version_bwa=0.7.17                    #bwa 版本                                  
      version_fastp=0.23.2                  #fastp 版本                                
      version_vep=108                       #vep 版本                                  
      envs=/root/mambaforge-pypy3/envs	    #mamba envs 目录                           
      threads=32                         	#最大线程数                                   
      memory=32G                        	#内存占用                                    
      scatter=8                          	#Mutect2 分拆并行运行interval list 份数          
      event=2                          	    #gatk FilterMutectCalls --max-events-in-region 数值
      snv_tlod=16.00                      	#snv 过滤参数 tload 值                        
      snv_vaf=0.01                       	#snv 过滤参数 丰度/突变频率                        
      snv_depth=500                        	#snv 过滤参数 支持reads数/depth 测序深度            
      cnv_dep=1000                       	#cnv 过滤参数 支持reads数/depth 测序深度            
      cnv_min=-0.5                       	#cnv 过滤参数 log2最小值                        
      cnv_max=0.5                        	#cnv 过滤参数 log2 最大值                       
      sv_score=200                        	#sv  过滤参数 score                           
    
    # 以上变量个可以具体根据需求调整
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26

    表格:

    变量名变量值备注
    sn1701样本编号
    pnGS03pipeline 代号 GATK Somatic 03版本
    version_openjdk8.0.332java openjdk 版本
    version_cnvkit0.9.10cnvkit 版本
    version_manta1.6.0manta版本
    version_gatk4.3.0.0gatk 版本
    version_sambamba1.0.1sambamba 版本
    version_samtools1.17samtools 版本
    version_bwa0.7.17bwa 版本
    version_fastp0.23.2fastp 版本
    version_vep108vep 版本
    envs/root/mambaforge-pypy3/envsmamba envs 目录
    threads32最大线程数
    memory32G内存占用
    scatter8Mutect2 分拆并行运行interval list 份数
    event2gatk FilterMutectCalls --max-events-in-region 数值
    snv_tlod16.00snv 过滤参数 tload 值
    snv_vaf0.01snv 过滤参数 丰度/突变频率
    snv_depth500snv 过滤参数 支持reads数/depth 测序深度
    cnv_dep1000cnv 过滤参数 支持reads数/depth 测序深度
    cnv_min-0.5cnv 过滤参数 log2最小值
    cnv_max0.5cnv 过滤参数 log2 最大值
    sv_score200sv 过滤参数 score

    Pipeline/workflow 具体步骤:

    1. fastp 默认参数过滤,看下原始数据质量,clean data

      #conda检测环境是否存在,首次运行不存在创建该环境并安装软件
      if [ ! -d "${envs}/${pn}.fastp" ]; then
        echo "Creating the environment ${pn}.fastp"
        mamba create -n ${pn}.fastp -y fastp=${version_fastp}
      fi
      
      mamba	activate ${pn}.fastp
      
      mkdir -p ${result}/${sn}/trimmed
      
      fastp -w 16 \
      	-i ${data}/GS03/${sn}_tumor_R1.fq.gz  \
          -I ${data}/GS03/${sn}_tumor_R2.fq.gz  \
          -o ${result}/${sn}/trimmed/${sn}_tumor_R1_trimmed.fq.gz \
          -O ${result}/${sn}/trimmed/${sn}_tumor_R2_trimmed.fq.gz \
          -h ${result}/${sn}/trimmed/${sn}_tumor_fastp.html \
          -j ${result}/${sn}/trimmed/${sn}_tumor_fastp.json &
      
      fastp -w 16 \
      	-i ${data}/GS03/${sn}_normal_R1.fq.gz  \
          -I ${data}/GS03/${sn}_normal_R2.fq.gz  \
          -o ${result}/${sn}/trimmed/${sn}_normal_R1_trimmed.fq.gz \
          -O ${result}/${sn}/trimmed/${sn}_normal_R2_trimmed.fq.gz \
          -h ${result}/${sn}/trimmed/${sn}_normal_fastp.html \
          -j ${result}/${sn}/trimmed/${sn}_normal_fastp.json &
      wait
      
      mamba	deactivate
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
      • 28
    2. normal文件fastq比对到参考基因组,sort 排序,mark duplicate 得到 marked.bam

      #conda检测环境是否存在,首次运行不存在创建该环境并安装软件
      if [ ! -d "${envs}/${pn}.align" ]; then
        mamba create -n ${pn}.align -y bwa=${version_bwa} samtools=${version_samtools} 
      fi
      
      #从github下载sambamba static 比 mamba 安装的版本速度快1倍以上.这是个很诡异的地方
      if [ ! -f "${envs}/${pn}.align/bin/sambamba" ]; then
      	aria2c https://github.com/biod/sambamba/releases/download/v${version_sambamba}/sambamba-${version_sambamba}-linux-amd64-static.gz -d ${envs}/${pn}.align/bin
      	gzip -cdf ${envs}/${pn}.align/bin/sambamba-${version_sambamba}-linux-amd64-static.gz  >  ${envs}/${pn}.align/bin/sambamba 
      	chmod a+x ${envs}/${pn}.align/bin/sambamba
      fi
      
      mamba	activate ${pn}.align
      
      mkdir	-p /opt/ref/hg19
      #如果没有检测到参考基因组序列,则下载序列并使用bwa创建索引
      if [ ! -f "/opt/ref/hg19/hg19.fasta" ]; then
      	aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.gz -d /opt/ref/hg19 -o hg19.fasta.gz
      	cd /opt/ref/hg19 && gzip -d /opt/ref/hg19/hg19.fasta.gz
      	if  [ ! -f /opt/ref/hg19.fasta.amb ] ||
      		[ ! -f /opt/ref/hg19/hg19.fasta.ann ] ||
      		[ ! -f /opt/ref/hg19/hg19.fasta.bwt ] ||
      		[ ! -f /opt/ref/hg19/hg19.fasta.pac ] ||
          	[ ! -f /opt/ref/hg19/hg19.fasta.sa ]; then
          	bwa index /opt/ref/hg19/hg19.fasta
      	fi
      elif [ -f "/opt/ref/hg19/ucsc.hg19.fasta.gz.aria2" ]; then
      		aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.gz -c -d /opt/ref/hg19 -o hg19.fasta.gz
      		cd /opt/ref/hg19 && gzip -d /opt/ref/hg19/hg19.fasta.gz
      	if  [ ! -f /opt/ref/hg19.fasta.amb ] ||
      		[ ! -f /opt/ref/hg19/hg19.fasta.ann ] ||
      		[ ! -f /opt/ref/hg19/hg19.fasta.bwt ] ||
      		[ ! -f /opt/ref/hg19/hg19.fasta.pac ] ||
          	[ ! -f /opt/ref/hg19/hg19.fasta.sa ]; then
          	bwa index /opt/ref/hg19/hg19.fasta
      	fi
      fi
      #检测samtools索引是否存在,如不存在则使用samtools创建参考基因组索引
      if [ ! -f "/opt/ref/hg19/hg19.fasta.fai" ]; then
      	samtools faidx /opt/ref/hg19/hg19.fasta
      fi
      
      
      mkdir -p ${result}/${sn}/aligned
      #比对基因组管道输出成bam文件,管道输出排序
      bwa mem \
          -t ${threads} -M \
          -R "@RG\\tID:${sn}_normal\\tLB:${sn}_normal\\tPL:Illumina\\tPU:Miseq\\tSM:${sn}_normal" \
          /opt/ref/hg19/hg19.fasta  ${result}/${sn}/trimmed/${sn}_normal_R1_trimmed.fq.gz ${result}/${sn}/trimmed/${sn}_normal_R2_trimmed.fq.gz \
          | sambamba view -S -f bam -l 0 /dev/stdin \
          | sambamba sort -t ${threads} -m 2G --tmpdir=${result}/${sn}/aligned -o ${result}/${sn}/aligned/${sn}_normal_sorted.bam /dev/stdin
      
      #防止linux打开文件句柄数超过限制,报错退出
      ulimit -n 10240
      
      #使用sambamba对sorted bam文件标记重复
      sambamba markdup \
      	--tmpdir ${result}/${sn}/aligned \
      	-t ${threads} ${result}/${sn}/aligned/${sn}_normal_sorted.bam ${result}/${sn}/aligned/${sn}_normal_marked.bam 
      
      #修改marked bam文件索引名,gatk和sambamba索引文件名需要保持一致
      mv  ${result}/${sn}/aligned/${sn}_normal_marked.bam.bai ${result}/${sn}/aligned/${sn}_normal_marked.bai
      #删除sorted bam文件
      rm -f ${result}/${sn}/aligned/${sn}_normal_sorted.bam*
      
      mamba	deactivate
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
      • 28
      • 29
      • 30
      • 31
      • 32
      • 33
      • 34
      • 35
      • 36
      • 37
      • 38
      • 39
      • 40
      • 41
      • 42
      • 43
      • 44
      • 45
      • 46
      • 47
      • 48
      • 49
      • 50
      • 51
      • 52
      • 53
      • 54
      • 55
      • 56
      • 57
      • 58
      • 59
      • 60
      • 61
      • 62
      • 63
      • 64
      • 65
      • 66
    3. tumor文件fastq比对到参考基因组,sort 排序,mark duplicate 得到 marked.bam,与第2步类似

      if [ ! -d "${envs}/${pn}.align" ]; then
        mamba create -n ${pn}.align -y bwa=${version_bwa} samtools=${version_samtools} 
      fi
      
      #从github下载sambamba static 比 mamba 安装的版本速度快1倍以上.
      if [ ! -f "${envs}/${pn}.align/bin/sambamba" ]; then
      	aria2c https://github.com/biod/sambamba/releases/download/v${version_sambamba}/sambamba-${version_sambamba}-linux-amd64-static.gz -d ${envs}/${pn}.align/bin
      	gzip -cdf ${envs}/${pn}.align/bin/sambamba-${version_sambamba}-linux-amd64-static.gz  >  ${envs}/${pn}.align/bin/sambamba 
      	chmod a+x ${envs}/${pn}.align/bin/sambamba
      fi
      
      mamba	activate ${pn}.align
      
      mkdir	-p /opt/ref/hg19
      
      if [ ! -f "/opt/ref/hg19/hg19.fasta" ]; then
      	aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.gz -d /opt/ref/hg19 -o hg19.fasta.gz
      	cd /opt/ref/hg19 && gzip -d /opt/ref/hg19/hg19.fasta.gz
      	if  [ ! -f /opt/ref/hg19.fasta.amb ] ||
      		[ ! -f /opt/ref/hg19/hg19.fasta.ann ] ||
      		[ ! -f /opt/ref/hg19/hg19.fasta.bwt ] ||
      		[ ! -f /opt/ref/hg19/hg19.fasta.pac ] ||
          	[ ! -f /opt/ref/hg19/hg19.fasta.sa ]; then
          	bwa index /opt/ref/hg19/hg19.fasta
      	fi
      elif [ -f "/opt/ref/hg19/ucsc.hg19.fasta.gz.aria2" ]; then
      	aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.gz -c -d /opt/ref/hg19 -o hg19.fasta.gz
      	cd /opt/ref/hg19 && gzip -d /opt/ref/hg19/hg19.fasta.gz
      	if  [ ! -f /opt/ref/hg19.fasta.amb ] ||
      		[ ! -f /opt/ref/hg19/hg19.fasta.ann ] ||
      		[ ! -f /opt/ref/hg19/hg19.fasta.bwt ] ||
      		[ ! -f /opt/ref/hg19/hg19.fasta.pac ] ||
          	[ ! -f /opt/ref/hg19/hg19.fasta.sa ]; then
          	bwa index /opt/ref/hg19/hg19.fasta
      	fi
      fi
      
      if [ ! -f "/opt/ref/hg19/hg19.fasta.fai" ]; then
      	samtools faidx /opt/ref/hg19/hg19.fasta
      fi
      
      
      mkdir	-p ${result}/${sn}/aligned
      
      bwa	mem \
          -t ${threads} -M \
          -R "@RG\\tID:${sn}_tumor\\tLB:${sn}_tumor\\tPL:Illumina\\tPU:Miseq\\tSM:${sn}_tumor" \
          /opt/ref/hg19/hg19.fasta  ${result}/${sn}/trimmed/${sn}_tumor_R1_trimmed.fq.gz ${result}/${sn}/trimmed/${sn}_tumor_R2_trimmed.fq.gz \
          | sambamba view -S -f bam -l 0 /dev/stdin \
          | sambamba sort -t ${threads} -m 2G --tmpdir=${result}/${sn}/aligned -o ${result}/${sn}/aligned/${sn}_tumor_sorted.bam /dev/stdin
      ulimit -n 10240
      sambamba  markdup \
      	--tmpdir ${result}/${sn}/aligned \
      	-t ${threads} ${result}/${sn}/aligned/${sn}_tumor_sorted.bam ${result}/${sn}/aligned/${sn}_tumor_marked.bam
      mv  ${result}/${sn}/aligned/${sn}_tumor_marked.bam.bai ${result}/${sn}/aligned/${sn}_tumor_marked.bai
      rm  -f ${result}/${sn}/aligned/${sn}_tumor_sorted.bam*
      
      mamba	deactivate
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
      • 28
      • 29
      • 30
      • 31
      • 32
      • 33
      • 34
      • 35
      • 36
      • 37
      • 38
      • 39
      • 40
      • 41
      • 42
      • 43
      • 44
      • 45
      • 46
      • 47
      • 48
      • 49
      • 50
      • 51
      • 52
      • 53
      • 54
      • 55
      • 56
      • 57
      • 58
    4. 对上述bam文件生成重新校准表,为后续BQSR使用;Generates recalibration table for Base Quality Score Recalibration (BQSR)

      #conda检测环境是否存在,首次运行不存在创建该环境并安装软件
      if [ ! -f "${envs}/gatk/bin/gatk" ]; then
      	mkdir -p ${envs}/gatk/bin
      	#替代下载地址
      	#https://github.com/broadinstitute/gatk/releases/download/${version_gatk}/gatk-${version_gatk}.zip
      	if [ -f ${envs}/gatk/bin/gatk.zip.aria2 ]; then
      		aria2c -x 16 -s 32 https://download.yzuu.cf/broadinstitute/gatk/releases/download/${version_gatk}/gatk-${version_gatk}.zip -c -d ${envs}/gatk/bin -o gatk.zip
      	else 
      		aria2c -x 16 -s 32 https://download.yzuu.cf/broadinstitute/gatk/releases/download/${version_gatk}/gatk-${version_gatk}.zip -d ${envs}/gatk/bin -o gatk.zip
      	fi
      	apt-get install -y unzip
      	cd ${envs}/gatk/bin 
      	unzip -o gatk.zip 
      	mv ${envs}/gatk/bin/gatk-${version_gatk}/* ${envs}/gatk/bin/
      	rm -rf ${envs}/gatk/bin/gatk-${version_gatk}
      	#chmod +x ${envs}/bin/gatk
      	cd ${result}
      fi
      
      if [ ! -x "$(command -v python)" ]; then
      	mamba env create -f ${envs}/gatk/bin/gatkcondaenv.yml
      fi
      
      if [ ! -x "$(command -v java)" ]; then
      	mamba install -y openjdk=${version_openjdk}
      fi
      
      if [ ! -x "$(command -v tabix)" ]; then
      	mamba install -y tabix
      fi
      
      mamba activate gatk
      
      #这里有个巨坑,broadinstitute ftp站点bundle目录提供的hg19版本参考文件,默认格式运行会报错,提示没有索引,使用gatk创建索引仍然报错,其实是gz格式需要使用bgzip重新压缩并且使用tabix创建索引才行
      if [ ! -f "/opt/ref/hg19/dbsnp_138.hg19.vcf.gz" ]; then
      	aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/dbsnp_138.hg19.vcf.gz -d /opt/ref/hg19
      	gzip -k -f -d /opt/ref/hg19/dbsnp_138.hg19.vcf.gz > /opt/ref/hg19/dbsnp_138.hg19.vcf
      	bgzip -f --threads ${threads} /opt/ref/hg19/dbsnp_138.hg19.vcf > /opt/ref/hg19/dbsnp_138.hg19.vcf.gz
      	tabix -f /opt/ref/hg19/dbsnp_138.hg19.vcf.gz
      else 
      	if [ -f "/opt/ref/hg19/dbsnp_138.hg19.vcf.gz.aria2" ]; then
      		echo 'download continue...'
      		aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/dbsnp_138.hg19.vcf.gz -c -d /opt/ref/hg19
      	fi
      	if [ ! -f "/opt/ref/hg19/dbsnp_138.hg19.vcf.gz.tbi" ]; then
      		gzip -k -f -d /opt/ref/hg19/dbsnp_138.hg19.vcf.gz > /opt/ref/hg19/dbsnp_138.hg19.vcf
      		bgzip -f --threads ${threads} /opt/ref/hg19/dbsnp_138.hg19.vcf > /opt/ref/hg19/dbsnp_138.hg19.vcf.gz
      		tabix -f /opt/ref/hg19/dbsnp_138.hg19.vcf.gz
      	fi
      fi
      
      if [ ! -f "/opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz" ]; then
      	aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz -d /opt/ref/hg19
      	gzip -k -f -d /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz > /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf
      	bgzip -f --threads ${threads} /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf > /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
      	tabix -f /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
      else 
      	if [ -f "/opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz.aria2" ]; then
      		echo 'download continue...'
      		aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz -c -d /opt/ref/hg19
      	fi
      	if [ ! -f "/opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz.tbi" ]; then
      		gzip -k -f -d /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz > /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf
      		bgzip -f --threads ${threads} /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf > /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
      		tabix -f /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
      	fi
      fi
      
      if [ ! -f "/opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz" ]; then
      	aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz -d /opt/ref/hg19
      	gzip -k -f -d /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz > /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf
      	bgzip -f --threads ${threads} /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf > /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz
      	tabix -f /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz
      else 
      	if [ -f "/opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz.aria2" ]; then
      		echo 'download continue...'
      		aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz -c -d /opt/ref/hg19
      	fi
      	if [ ! -f "/opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz.tbi" ]; then
      		gzip -k -f -d /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz > /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf
      		bgzip -f --threads ${threads} /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf > /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz
      		tabix -f /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz
      	fi
      fi
      
      if [ ! -f "/opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz" ]; then
      	aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.indels.hg19.sites.vcf.gz -d /opt/ref/hg19
      	gzip -k -f -d /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz > /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf
      	bgzip -f --threads ${threads} /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf > /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz
      	tabix -f /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz
      else 
      	if [ -f "/opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz.aria2" ]; then
      		echo 'download continue...'
      		aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.indels.hg19.sites.vcf.gz -c -d /opt/ref/hg19
      	fi
      	if [ ! -f "/opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz.tbi" ]; then
      		gzip -k -f -d /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz > /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf
      		bgzip -f --threads ${threads} /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf > /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz
      		tabix -f /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz
      	fi
      fi
      #创建参考序列hg19的dict字典文件
      if [ ! -f "/opt/ref/hg19/hg19.dict" ]; then
      	gatk CreateSequenceDictionary -R /opt/ref/hg19/hg19.fasta -O /opt/ref/hg19/hg19.dict
      fi
      #根据下载的Illumina_pt2.bed 文件创建interval list文件,坐标转换,其实坐标0修改为1
      if [ ! -f "/opt/ref/hg19/Illumina_pt2.interval_list" ]; then
      	#sed 's/chr//; s/\t/ /g' /opt/ref/hg19/Illumina_pt2.bed > /opt/ref/hg19/Illumina_pt2.processed.bed
      	mkdir  -p /opt/ref/hg19
      	rm     -f /opt/ref/hg19/Illumina_pt2.bed
      	
      	aria2c  https://raw.fgit.cf/doujiangbaozi/sliverworkspace-util/main/somatic/projects/Illumina_pt2.bed -d /opt/ref/hg19
      	#aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/projects/Illumina_pt2.bed -d /opt/ref/hg19
      	gatk BedToIntervalList \
      		-I	/opt/ref/hg19/Illumina_pt2.bed \
      		-SD	/opt/ref/hg19/hg19.dict \
      		-O	/opt/ref/hg19/Illumina_pt2.interval_list
      fi
      
      mkdir -p ${result}/${sn}/recal
      
      gatk BaseRecalibrator \
              --known-sites /opt/ref/hg19/dbsnp_138.hg19.vcf.gz \
              --known-sites /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz \
      		--known-sites /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz \
              -L /opt/ref/hg19/Illumina_pt2.interval_list \
              -R /opt/ref/hg19/hg19.fasta \
              -I ${result}/${sn}/aligned/${sn}_tumor_marked.bam \
              -O ${result}/${sn}/recal/${sn}_tumor_recal.table &
              
      gatk BaseRecalibrator \
              --known-sites /opt/ref/hg19/dbsnp_138.hg19.vcf.gz \
              --known-sites /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz \
      		--known-sites /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz \
              -L /opt/ref/hg19/Illumina_pt2.interval_list \
              -R /opt/ref/hg19/hg19.fasta \
              -I ${result}/${sn}/aligned/${sn}_normal_marked.bam \
              -O ${result}/${sn}/recal/${sn}_normal_recal.table &
      
      wait
      
      mamba deactivate
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
      • 28
      • 29
      • 30
      • 31
      • 32
      • 33
      • 34
      • 35
      • 36
      • 37
      • 38
      • 39
      • 40
      • 41
      • 42
      • 43
      • 44
      • 45
      • 46
      • 47
      • 48
      • 49
      • 50
      • 51
      • 52
      • 53
      • 54
      • 55
      • 56
      • 57
      • 58
      • 59
      • 60
      • 61
      • 62
      • 63
      • 64
      • 65
      • 66
      • 67
      • 68
      • 69
      • 70
      • 71
      • 72
      • 73
      • 74
      • 75
      • 76
      • 77
      • 78
      • 79
      • 80
      • 81
      • 82
      • 83
      • 84
      • 85
      • 86
      • 87
      • 88
      • 89
      • 90
      • 91
      • 92
      • 93
      • 94
      • 95
      • 96
      • 97
      • 98
      • 99
      • 100
      • 101
      • 102
      • 103
      • 104
      • 105
      • 106
      • 107
      • 108
      • 109
      • 110
      • 111
      • 112
      • 113
      • 114
      • 115
      • 116
      • 117
      • 118
      • 119
      • 120
      • 121
      • 122
      • 123
      • 124
      • 125
      • 126
      • 127
      • 128
      • 129
      • 130
      • 131
      • 132
      • 133
      • 134
      • 135
      • 136
      • 137
      • 138
      • 139
      • 140
      • 141
      • 142
    5. 使用校准表对bam碱基质量校准,因为这一步gatk效率感人,所以同时计算insertsize,拆分interval list(后续mutect2并行运行需要),运行cnvkit batch,运行samtools depth计算测序深度,samtools flagstat 统计mapping比例及质量

      mkdir -p ${result}/${sn}/bqsr
      mkdir -p ${result}/${sn}/stat
      mkdir -p ${result}/${sn}/cnv
      mkdir -p ${result}/${sn}/interval
      
      mamba activate gatk
      
      gatk ApplyBQSR \
      	--bqsr-recal-file ${result}/${sn}/recal/${sn}_tumor_recal.table \
      	-L /opt/ref/hg19/Illumina_pt2.interval_list \
      	-R /opt/ref/hg19/hg19.fasta \
          -I ${result}/${sn}/aligned/${sn}_tumor_marked.bam \
          -O ${result}/${sn}/bqsr/${sn}_tumor_bqsr.bam &
      
      
      gatk ApplyBQSR \
          --bqsr-recal-file ${result}/${sn}/recal/${sn}_normal_recal.table \
      	-L /opt/ref/hg19/Illumina_pt2.interval_list \
      	-R /opt/ref/hg19/hg19.fasta \
          -I ${result}/${sn}/aligned/${sn}_normal_marked.bam \
          -O ${result}/${sn}/bqsr/${sn}_normal_bqsr.bam &
      
      
      gatk CollectInsertSizeMetrics \
        -I ${result}/${sn}/aligned/${sn}_tumor_marked.bam \
        -O ${result}/${sn}/stat/${sn}_tumor_insertsize_metrics.txt \
        -H ${result}/${sn}/stat/${sn}_tumor_insertsize_histogram.pdf &
      
      
      gatk CollectInsertSizeMetrics \
        -I ${result}/${sn}/aligned/${sn}_normal_marked.bam \
        -O ${result}/${sn}/stat/${sn}_normal_insertsize_metrics.txt \
        -H ${result}/${sn}/stat/${sn}_normal_insertsize_histogram.pdf &
      
      rm -f ${result}/${sn}/interval/*.interval_list
      gatk SplitIntervals \
      	-L /opt/ref/hg19/Illumina_pt2.interval_list \
      	-R /opt/ref/hg19/hg19.fasta \
          -O ${result}/${sn}/interval \
          --scatter-count ${scatter} &
      
      mamba deactivate
      
      if [ ! -d "${envs}/${pn}.cnvkit" ]; then
        	mamba create -n ${pn}.cnvkit -y cnvkit=${version_cnvkit}
      fi
      
      if [ ! -f "/opt/ref/hg19/refFlat.txt" ]; then
      	aria2c -x 16 -s 16 http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz -d /opt/ref/hg19
      	cd /opt/ref/hg19 && gzip -d refFlat.txt.gz
      fi
      
      mamba activate ${pn}.cnvkit
      
      cnvkit.py batch \
      	${result}/${sn}/aligned/${sn}_tumor_marked.bam \
          --normal ${result}/${sn}/aligned/${sn}_normal_marked.bam \
          --method hybrid \
          --targets /opt/ref/hg19/Illumina_pt2.bed \
          --annotate /opt/ref/hg19/refFlat.txt \
          --output-reference ${result}/${sn}/cnv/${sn}_reference.cnn \
          --output-dir ${result}/${sn}/cnv/ \
          --diagram \
          -p ${threads} &
      
      mamba deactivate
      
      mamba activate ${pn}.align
      
      samtools depth -a -b /opt/ref/hg19/Illumina_pt2.bed  --threads ${threads} \
      	${result}/${sn}/aligned/${sn}_tumor_marked.bam > \
          ${result}/${sn}/stat/${sn}_tumor_marked.depth  &
      
      samtools depth -a -b /opt/ref/hg19/Illumina_pt2.bed  --threads ${threads} \
      	${result}/${sn}/aligned/${sn}_normal_marked.bam > \
      	${result}/${sn}/stat/${sn}_normal_marked.depth &
      
      samtools flagstat --threads ${threads} \
      	${result}/${sn}/aligned/${sn}_tumor_marked.bam  > \
          ${result}/${sn}/stat/${sn}_tumor_marked.flagstat   &
      
      samtools flagstat --threads ${threads} \
      	${result}/${sn}/aligned/${sn}_normal_marked.bam > \
          ${result}/${sn}/stat/${sn}_normal_marked.flagstat &
      	
      mamba deactivate
      
      wait
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
      • 28
      • 29
      • 30
      • 31
      • 32
      • 33
      • 34
      • 35
      • 36
      • 37
      • 38
      • 39
      • 40
      • 41
      • 42
      • 43
      • 44
      • 45
      • 46
      • 47
      • 48
      • 49
      • 50
      • 51
      • 52
      • 53
      • 54
      • 55
      • 56
      • 57
      • 58
      • 59
      • 60
      • 61
      • 62
      • 63
      • 64
      • 65
      • 66
      • 67
      • 68
      • 69
      • 70
      • 71
      • 72
      • 73
      • 74
      • 75
      • 76
      • 77
      • 78
      • 79
      • 80
      • 81
      • 82
      • 83
      • 84
      • 85
      • 86
      • 87
      • 88
    6. 计算堆叠数据( pileup metrics )以便后续评估污染,也可以根据拆分的interval list并行处理,处理之后合并。

      #官方巨坑,默认提供的small_exac_common_3_b37.vcf.gz默认染色体坐标不是以chr开头而是数字
      mamba activate gatk
      #这里有个巨坑,从broadinstitute ftp 站点bundle Mutect2目录下载的参考文件,与同样下载的参考序列基因组坐标系不一致,参考基因组参考序列是chr1这种格式,这个af-only-gnomad是1,2,3这种格式,需要编写脚本处理
      if [ ! -f "/opt/ref/hg19/small_exac_common_3_b37.processed.vcf.gz" ]; then
      	if [ ! -f "/opt/ref/hg19/small_exac_common_3_b37.vcf.gz" ]; then
      		aria2c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/GetPileupSummaries/small_exac_common_3_b37.vcf.gz -d /opt/ref/hg19
      	elif [ -f "/opt/ref/hg19/small_exac_common_3_b37.vcf.gz.aria2" ]; then
      		aria2c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/GetPileupSummaries/small_exac_common_3_b37.vcf.gz -c -d /opt/ref/hg19
      	fi
      	if [ ! -f "${envs}/VcfProcessUtil.py" ]; then
      		aria2c https://raw.fgit.cf/doujiangbaozi/sliverworkspace-util/main/somatic/VcfProcessUtil.py -d ${envs}/
      		#aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/VcfProcessUtil.py -d ${envs}/
      		chmod a+x ${envs}/VcfProcessUtil.py
      	fi
      	${envs}/VcfProcessUtil.py \
      		-f /opt/ref/hg19/small_exac_common_3_b37.vcf.gz \
      		-o /opt/ref/hg19/small_exac_common_3_b37.processed.vcf
      	cd /opt/ref/hg19
      	bgzip -f --threads ${threads} small_exac_common_3_b37.processed.vcf
      	tabix -f small_exac_common_3_b37.processed.vcf.gz
      fi
      
      
      for i in `ls ${result}/${sn}/interval/*.interval_list`;
      do
          echo $i
          gatk GetPileupSummaries \
              -R /opt/ref/hg19/hg19.fasta \
              -I ${result}/${sn}/bqsr/${sn}_tumor_bqsr.bam \
              -O ${i%.*}-tumor-pileups.table \
              -V /opt/ref/hg19/small_exac_common_3_b37.processed.vcf.gz \
              -L $i \
              --interval-set-rule INTERSECTION &
          
          gatk GetPileupSummaries \
              -R /opt/ref/hg19/hg19.fasta \
              -I ${result}/${sn}/bqsr/${sn}_normal_bqsr.bam \
              -O ${i%.*}-normal-pileups.table \
              -V /opt/ref/hg19/small_exac_common_3_b37.processed.vcf.gz \
              -L $i \
              --interval-set-rule INTERSECTION &
          
      done
      wait
      
      
      
      tables=
      for i in `ls ${result}/${sn}/interval/*-tumor-pileups.table`;
      do
          tables="$tables -I $i"
      done
      
      gatk GatherPileupSummaries \
          --sequence-dictionary /opt/ref/hg19/hg19.dict \
          $tables \
          -O ${result}/${sn}/stat/${sn}_tumor_pileups.table
      
      nctables=
      for i in `ls ${result}/${sn}/interval/*-normal-pileups.table`;
      do
          nctables="$nctables -I $i"
      done
      
      gatk GatherPileupSummaries \
          --sequence-dictionary /opt/ref/hg19/hg19.dict \
          $nctables \
          -O ${result}/${sn}/stat/${sn}_normal_pileups.table
      	
      mamba deactivate
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
      • 28
      • 29
      • 30
      • 31
      • 32
      • 33
      • 34
      • 35
      • 36
      • 37
      • 38
      • 39
      • 40
      • 41
      • 42
      • 43
      • 44
      • 45
      • 46
      • 47
      • 48
      • 49
      • 50
      • 51
      • 52
      • 53
      • 54
      • 55
      • 56
      • 57
      • 58
      • 59
      • 60
      • 61
      • 62
      • 63
      • 64
      • 65
      • 66
      • 67
      • 68
      • 69
      • 70
    7. 使用GetPileupSummaries计算结果评估跨样本污染,结果用于后面 FilterMutectCall 过滤Mutect2输出结果

      mamba activate gatk
      
      gatk CalculateContamination \
          -matched ${result}/${sn}/stat/${sn}_normal_pileups.table \
          -I ${result}/${sn}/stat/${sn}_tumor_pileups.table \
          -O ${result}/${sn}/stat/${sn}_contamination.table
      	
      mamba deactivate
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
    8. Mutect2 call 突变,使用拆分的interval list,结束后将结果合并;同时并行运行manta call sv突变

      mkdir -p ${result}/${sn}/sv
      mkdir -p ${result}/${sn}/snv
      
      mamba activate gatk
      #这里有个巨坑,从broadinstitute ftp 站点bundle Mutect2目录下载的参考文件,与同样下载的参考序列基因组坐标系不一致,参考基因组参考序列是chr1这种格式,这个af-only-gnomad是1,2,3这种格式,需要编写脚本处理;hg38貌似没有这个问题,hg19的数据都不维护了么?
      if [ ! -f "/opt/ref/hg19/af-only-gnomad.raw.sites.b37.processed.vcf.gz" ]; then
      	if [ ! -f "/opt/ref/hg19/af-only-gnomad.raw.sites.b37.vcf.gz" ]; then
      		aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/af-only-gnomad.raw.sites.b37.vcf.gz -d /opt/ref/hg19
      	elif [ -f "/opt/ref/hg19/af-only-gnomad.raw.sites.b37.vcf.gz.aria2" ]; then
      		aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/af-only-gnomad.raw.sites.b37.vcf.gz -c -d /opt/ref/hg19
      	fi
      	if [ ! -f "${envs}/VcfProcessUtil.py" ]; then
      		aria2c https://raw.fgit.cf/doujiangbaozi/sliverworkspace-util/main/somatic/VcfProcessUtil.py -d ${envs}/
      		#aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/VcfProcessUtil.py -d ${envs}/
      		chmod a+x ${envs}/VcfProcessUtil.py
      	fi
      	${envs}/VcfProcessUtil.py \
      		-f /opt/ref/hg19/af-only-gnomad.raw.sites.b37.vcf.gz \
      		-o /opt/ref/hg19/af-only-gnomad.raw.sites.b37.processed.vcf
      	cd /opt/ref/hg19
      	bgzip -f --threads ${threads} af-only-gnomad.raw.sites.b37.processed.vcf
      	tabix -f af-only-gnomad.raw.sites.b37.processed.vcf.gz
      fi
      
      
      if [ ! -f "/opt/ref/hg19/Illumina_pt2.bed.gz" ]; then
      	bgzip -f -c /opt/ref/hg19/Illumina_pt2.bed > /opt/ref/hg19/Illumina_pt2.bed.gz
      	tabix -f -p bed /opt/ref/hg19/Illumina_pt2.bed.gz
      else
      	if [ ! -f "/opt/ref/hg19/Illumina_pt2.bed.gz.tbi" ]; then
      		tabix -f -p bed /opt/ref/hg19/Illumina_pt2.bed.gz
      	fi
      fi
      mamba deactivate
      
      if [ ! -d "${envs}/${pn}.manta" ]; then
        mamba create -n ${pn}.manta -y manta=1.6.0
      fi
      
      mamba activate ${pn}.manta
      
      rm -f ${result}/${sn}/sv/runWorkflow.py*
      configManta.py  \
          --normalBam ${result}/${sn}/bqsr/${sn}_normal_bqsr.bam \
          --tumorBam  ${result}/${sn}/bqsr/${sn}_tumor_bqsr.bam \
          --referenceFasta /opt/ref/hg19/hg19.fasta \
          --exome \
          --callRegions /opt/ref/hg19/Illumina_pt2.bed.gz \
          --runDir ${result}/${sn}/sv
      
      rm -rf ${result}/${sn}/sv/workspace
      python ${result}/${sn}/sv/runWorkflow.py -m local -j ${threads} &
      
      mamba deactivate
      
      mamba activate gatk
      
      rm -f ${result}/${sn}/snv/vcf-file.list
      touch ${result}/${sn}/snv/vcf-file.list
      for i in `ls ${result}/${sn}/interval/*.interval_list`;
      do
         rm -f ${i%.*}_bqsr.vcf.gz
         gatk Mutect2 \
              -R /opt/ref/hg19/hg19.fasta \
              -I ${result}/${sn}/bqsr/${sn}_tumor_bqsr.bam  -tumor  ${sn}_tumor  \
              -I ${result}/${sn}/bqsr/${sn}_normal_bqsr.bam -normal ${sn}_normal \
              -L $i \
              -O ${i%.*}_bqsr.vcf.gz \
              --max-mnp-distance 10 \
              --germline-resource /opt/ref/hg19/af-only-gnomad.raw.sites.b37.processed.vcf.gz \
              --native-pair-hmm-threads ${threads} &
          echo ${i%.*}_bqsr.vcf.gz >> ${result}/${sn}/snv/vcf-file.list
      done
      wait
      
      rm -f ${result}/${sn}/snv/${sn}_bqsr.vcf.gz.stats
      stats=
      for z in `ls ${result}/${sn}/interval/*_bqsr.vcf.gz.stats`;
      do
          stats="$stats -stats $z"
      done
      
      gatk MergeMutectStats $stats \
      	-O ${result}/${sn}/snv/${sn}_bqsr.vcf.gz.stats
      
      gatk MergeVcfs \
          -I ${result}/${sn}/snv/vcf-file.list \
          -O ${result}/${sn}/snv/${sn}_bqsr.vcf.gz
      	
      mamba deactivate
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
      • 28
      • 29
      • 30
      • 31
      • 32
      • 33
      • 34
      • 35
      • 36
      • 37
      • 38
      • 39
      • 40
      • 41
      • 42
      • 43
      • 44
      • 45
      • 46
      • 47
      • 48
      • 49
      • 50
      • 51
      • 52
      • 53
      • 54
      • 55
      • 56
      • 57
      • 58
      • 59
      • 60
      • 61
      • 62
      • 63
      • 64
      • 65
      • 66
      • 67
      • 68
      • 69
      • 70
      • 71
      • 72
      • 73
      • 74
      • 75
      • 76
      • 77
      • 78
      • 79
      • 80
      • 81
      • 82
      • 83
      • 84
      • 85
      • 86
      • 87
      • 88
      • 89
      • 90
    9. FilterMutectCalls 对Mutect结果突变过滤

      mamba activate gatk
      
      gatk FilterMutectCalls \
          --max-events-in-region ${event} \
          --contamination-table ${result}/${sn}/stat/${sn}_contamination.table \
          -R /opt/ref/hg19/hg19.fasta \
          -V ${result}/${sn}/snv/${sn}_bqsr.vcf.gz \
          -O ${result}/${sn}/snv/${sn}_filtered.vcf.gz
      	
      mamba deactivate
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
    10. 使用Vep注释过滤结果

      #conda检测环境是否存在,首次运行不存在创建该环境并安装软件
      if [ ! -d "${envs}/${pn}.vep" ]; then
        echo "Creating the environment ${pn}.vep"
        mamba create -n ${pn}.vep -y ensembl-vep=${version_vep}
      fi
      
      mkdir -p /opt/result/${sn}/vcf
      #检测vep注释数据库是否存在如果不存在则先下载
      if [ ! -d "/opt/ref/vep-cache/homo_sapiens/${version_vep}_GRCh37" ]; then
      	aria2c -x 16 -s 48 https://ftp.ensembl.org/pub/release-${version_vep}/variation/indexed_vep_cache/homo_sapiens_vep_${version_vep}_GRCh37.tar.gz -d /opt/ref/
      	tar -zxvf /opt/ref/homo_sapiens_vep_${version_vep}_GRCh37.tar.gz -C /opt/ref/vep-cache/
      elif [ -f "/opt/ref/homo_sapiens_vep_${version_vep}_GRCh37.tar.gz.aria2" ]; then
      	aria2c -x 16 -s 48 https://ftp.ensembl.org/pub/release-${version_vep}/variation/indexed_vep_cache/homo_sapiens_vep_${version_vep}_GRCh37.tar.gz -c -d /opt/ref/
      	tar -zxvf /opt/ref/homo_sapiens_vep_${version_vep}_GRCh37.tar.gz -C /opt/ref/vep-cache/
      fi
      
      if [ ! -d "/opt/ref/vep-cache/homo_sapiens_refseq/${version_vep}_GRCh37" ]; then
      	aria2c -x 16 -s 48 http://ftp.ensembl.org/pub/release-${version_vep}/variation/vep/homo_sapiens_refseq_vep_${version_vep}_GRCh37.tar.gz -d /opt/ref/
      	tar -zxvf /opt/ref/homo_sapiens_refseq_vep_${version_vep}_GRCh37.tar.gz -C /opt/ref/vep-cache/
      elif [ -f "/opt/ref/homo_sapiens_refseq_vep_${version_vep}_GRCh37.tar.gz.aria2" ]; then
      	aria2c -x 16 -s 48 http://ftp.ensembl.org/pub/release-${version_vep}/variation/vep/homo_sapiens_refseq_vep_${version_vep}_GRCh37.tar.gz -c -d /opt/ref/
      	tar -zxvf /opt/ref/homo_sapiens_refseq_vep_${version_vep}_GRCh37.tar.gz -C /opt/ref/vep-cache/
      fi
      
      mamba activate ${pn}.vep
      
      mkdir -p ${result}/${sn}/annotation
      vep \
          -i ${result}/${sn}/snv/${sn}_filtered.vcf.gz  \
          -o ${result}/${sn}/annotation/${sn}_filtered_vep.tsv \
          --offline \
          --cache \
          --cache_version ${version_vep} \
          --everything \
          --dir_cache /opt/ref/vep-cache/ \
          --dir_plugins /opt/ref/vep-cache/Plugins \
          --species homo_sapiens \
          --assembly GRCh37 \
          --fasta /opt/ref/hg19/hg19.fasta \
          --refseq \
          --force_overwrite \
          --format vcf \
          --tab \
          --shift_3prime 1  \
          --offline
      	
      mamba deactivate
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
      • 28
      • 29
      • 30
      • 31
      • 32
      • 33
      • 34
      • 35
      • 36
      • 37
      • 38
      • 39
      • 40
      • 41
      • 42
      • 43
      • 44
      • 45
      • 46
      • 47
    11. 使用脚本处理注释结果和过滤vcf结果,输出和室间质评要求格式的数据表格

      mamba activate ${pn}.cnvkit
      
      if [ ! -f "${envs}/MatchedSnvVepAnnotationFilter.py" ]; then
      	aria2c https://raw.fgit.cf/doujiangbaozi/sliverworkspace-util/main/somatic/MatchedSnvVepAnnotationFilter.py -d ${envs}/
      	#aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/MatchedSnvVepAnnotationFilter.py -d ${envs}/
      	chmod a+x ${envs}/MatchedSnvVepAnnotationFilter.py
      fi
      
      ${envs}/MatchedSnvVepAnnotationFilter.py \
      	-e normal_artifact   \
          -e germline   \
          -i strand_bias   \
          -i clustered_events   \
          --min-vaf=${snv_vaf} \
          --min-tlod=${snv_tlod} \
          --min-depth=${snv_depth} \
          -v ${result}/${sn}/snv/${sn}_filtered.vcf.gz   \
          -a ${result}/${sn}/annotation/${sn}_filtered_vep.tsv   \
          -o ${result}/${sn}/annotation/${sn}.result.SNV.tsv
      	
      mamba deactivate
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
    12. 使用cnvkit提供工具输出分布图和热图

      mamba activate ${pn}.cnvkit
      
      cnvkit.py scatter ${result}/${sn}/cnv/${sn}_tumor_marked.cnr \
      	-s ${result}/${sn}/cnv/${sn}_tumor_marked.cns \
      	-i ' ' \
      	-n ${sn}_normal \
          -o ${result}/${sn}/cnv/${sn}_cnv_scatter.png -t  &&
       
      cnvkit.py heatmap ${result}/${sn}/cnv/${sn}_tumor_marked.cns \
      	-o ${result}/${sn}/cnv/${sn}_cnv_heatmap.png
      
      mamba deactivate
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
    13. 使用cnvkit call 根据cnvkit batch输出结果推算拷贝数

      mamba activate ${pn}.cnvkit
      
      cnvkit.py call ${result}/${sn}/cnv/${sn}_tumor_marked.cns \
      	-o ${result}/${sn}/cnv/${sn}_tumor_marked.call.cns
      	
      mamba deactivate
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
    14. 编写脚本处理cnvkit输出,计算cnv基因,exon位置,gain/lost,cn数

      mamba activate ${pn}.cnvkit
      
      if [ ! -f "${envs}/CnvAnnotationFilter.py" ]; then
      	aria2c https://raw.fgit.cf/doujiangbaozi/sliverworkspace-util/main/somatic/CnvAnnotationFilter.py -d ${envs}/
      	#aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/CnvAnnotationFilter.py -d ${envs}/
      	chmod a+x ${envs}/CnvAnnotationFilter.py
      fi
      
      if [ ! -f "/opt/ref/hg19/hg19_refGene.txt" ]; then
      	aria2c -x 16 -s 16 http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz -d /opt/ref/hg19 -o hg19_refGene.txt.gz
      	cd /opt/ref/hg19 && gzip -d hg19_refGene.txt.gz
      fi
      
      python ${envs}/CnvAnnotationFilter.py  \
        -r /opt/ref/hg19/hg19_refGene.txt \
        -i ${cnv_min} \
        -x ${cnv_max} \
        -D ${cnv_dep} \
        -f ${result}/${sn}/cnv/${sn}_tumor_marked.call.cns \
        -o ${result}/${sn}/cnv/${sn}.result.CNV.tsv
        
      mamba deactivate
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
    15. 编写脚本处理manta的输出,获取最终sv输出结果,起始位置,基因、频率等

        mamba activate ${pn}.cnvkit
        
        if [ ! -f "${envs}/SvAnnotationFilter.py" ]; then
        	aria2c https://raw.fgit.cf/doujiangbaozi/sliverworkspace-util/main/somatic/SvAnnotationFilter.py -d ${envs}/
        	#aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/SvAnnotationFilter.py -d ${envs}/
        	chmod a+x ${envs}/SvAnnotationFilter.py
        fi
        
        if [ ! -f "/opt/ref/hg19/hg19_refGene.txt" ]; then
        	aria2c -x 16 -s 16 http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz -d /opt/ref/hg19 -o hg19_refGene.txt.gz
        	cd /opt/ref/hg19 && gzip -d hg19_refGene.txt.gz
        fi
        
        ${envs}/SvAnnotationFilter.py \
          -r /opt/ref/hg19/hg19_refGene.txt \
          -s ${sv_score} \
          -f ${result}/${sn}/sv/results/variants/somaticSV.vcf.gz \
          -o ${result}/${sn}/sv/${sn}.result.SV.tsv
          
        mamba deactivate
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    1. 根据之前fastp,samtools depth,samtools flagstat,gatk CollectInsertSizeMetrics等输出,给出综合 QC数据
        mamba activate ${pn}.cnvkit
        
        if [ ! -f "${envs}/MatchedQcProcessor.py" ]; then
        	aria2c https://raw.fgit.cf/doujiangbaozi/sliverworkspace-util/main/somatic/MatchedQcProcessor.py -d ${envs}/
        	#aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/MatchedQcProcessor.py -d ${envs}/
        	chmod a+x ${envs}/MatchedQcProcessor.py
        fi
        
        ${envs}/MatchedQcProcessor.py  --bed /opt/ref/hg19/Illumina_pt2.bed \
            --out ${result}/${sn}/stat/${sn}.result.QC.tsv \
            --sample-fastp=${result}/${sn}/trimmed/${sn}_tumor_fastp.json \
            --sample-depth=${result}/${sn}/stat/${sn}_tumor_marked.depth \
            --sample-flagstat=${result}/${sn}/stat/${sn}_tumor_marked.flagstat \
            --sample-insertsize=${result}/${sn}/stat/${sn}_tumor_insertsize_metrics.txt \
            --normal-fastp=${result}/${sn}/trimmed/${sn}_normal_fastp.json \
            --normal-depth=${result}/${sn}/stat/${sn}_normal_marked.depth \
            --normal-flagstat=${result}/${sn}/stat/${sn}_normal_marked.flagstat  \
            --normal-insertsize=${result}/${sn}/stat/${sn}_normal_insertsize_metrics.txt
        	
        mamba deactivate
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20

    最终输出

    文件名备注
    1701/1701.result.SNV.tsvSNV最终突变结果
    1701/1701/cnv/1701_cnv_heatmap.pngCNV结果热图
    1701/cnv/1701_cnv_scatter.pngCNV结果分布图
    1701/cnv/1701.result.CNV.tsvCNV最终结果
    1701.result.SV.tsvSV最终结果
    1701.result.QC.tsv最终质控结果
  • 相关阅读:
    Linux之FinalShell的安装和使用
    关于使用后端实现动态表单功能的心得
    JMeter —— 接口自动化测试(数据驱动)
    elementUI选择框,value为0时无法获取的问题以及解决办法
    RabbitMQ快速入门(详细)
    VUE+Element添加高德地图的功能
    修改docker 修改容器配置
    MySQL 快速入门之第一章 账号管理、建库以及四大引擎
    Linux【安全 02】OpenSSH漏洞修复(离线升级最新版本流程)网盘分享3个安装包+26个离线依赖
    谨慎使用多线程中的 fork !!!!
  • 原文地址:https://blog.csdn.net/weixin_39900139/article/details/133632711