• 变异检测准确性评估软件hap.py使用


    变异检测准确性评估软件hap.py使用

    1. hap.py简介

    Hap.py 是illumina官方开发的在单倍型水平上比较二倍体基因型的工具。可用于针对金标准变异数据集(例如NA12878样本)对检测的变异结果进行基准测试,以判断检测结果的准确性, 也可以用于比较和评估两个不同变异检测软件检测结果的差异。

    2. hap.py 的使用

    安装就不多说了,怕麻烦的同学建议直接使用docker镜像,省时省力。docker pull pkrusche/hap.py
    下面直接上基本使用命令行吧。

    hap.py 
    	hap.py/example/happy/PG_NA12878_hg38-chr21.vcf.gz \
        hap.py/example/happy/NA12878-GATK3-chr21.vcf.gz \
        -f hap.py/example/happy/PG_Conf_hg38-chr21.bed.gz \
        -r hap.py/example/happy/hg38.chr21.fa \
        --threads 5 \
        -o gatk-all
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    其中,

    1. -r hap.py/example/happy/hg38.chr21.fa 用于指定使用的参考序列。 也可以通过设置环境变量HGREF来指定export HGREF=hap.py/example/happy/hg38.chr21.fa , 其中hap.py/example/happy/hg38.chr21.fa 需要换成你实际使用的fasta序列
    2. hap.py/example/happy/PG_NA12878_hg38-chr21.vcf.gzhap.py/example/happy/NA12878-GATK3-chr21.vcf.gz 为待比较的两个VCF文件,替换为你实际VCF文件即可。
    3. -f hap.py/example/happy/PG_Conf_hg38-chr21.bed.gz 指定需要比较的区间,即只有该bed区间内的位点用于比较。
    4. --threads 5 指定线程
    5. -o out_prefix 指定输出文件前缀

    3. 结果说明

    结果包含以下几个文件:

    Output FileContents
    gatk-all.summary.csvSummary statistics
    gatk-all.extended.csvExtended statistics
    gatk-all.roc.all.csvAll precision / recall data points that were calculated
    gatk-all.vcf.gzAnnotated VCF according to https://github.com/ga4gh/benchmarking-tools/tree/master/doc/ref-impl
    gatk-all.vcf.gz.tbiVCF Tabix Index
    gatk-all.metrics.jsonJSON file containing all computed metrics and tables.
    gatk-all.roc.Locations.INDEL.csvROC for ALL indels only.
    gatk-all.roc.Locations.SNP.csvROC for ALL SNPs only.
    gatk-all.roc.Locations.INDEL.PASS.csvROC for PASSing indels only.
    gatk-all.roc.Locations.SNP.PASS.csvROC for PASSing SNPs only.

    其中两个比较重要的文件为:gatk-all.summary.csvgatk-all.vcf.gz
    1)gatk-all.summary.csv包含直接的统计结果,包含变异位点数,精确度,召回率,F1值等统计信息,可以直观的看到两者之间的差异。gatk-all.extended.csv 是所有的统计信息,包含了gatk-all.summary.csv里面的信息,以及其他拓展信息。相关字段说明如下(就不中文解释了,有疑问可以一起讨论):

    Stratification ColumnDescription
    TypeVariant type (SNP / INDEL)
    SubtypeVariant Subtype (ti/tv/indel length, see above
    SubsetSubset of the genome/stratification region
    FilterVariant filters: PASS, SEL, ALL, or a particular filter from the query VCF
    GenotypeGenotype of benchmarked variants (het / homalt / hetalt)
    QQ.FieldWhich field from the original VCF was used to produce QQ values in truth and query
    QQQQ threshold for ROC values
    Metric ColumnDescription
    METRIC.RecallRecall for truth variant representation = TRUTH.TP / (TRUTH.TP + TRUTH.FN)
    METRIC.PrecisionPrecision of query variants = QUERY.TP / (QUERY.TP + QUERY.FP)
    METRIC.Frac_NAFraction of non-assessed query calls = QUERY.UNK / QUERY.TOTAL
    METRIC.F1_ScoreHarmonic mean of precision and recall = 2METRIC.RecallMetric.Precision/(METRIC.Recall + METRIC.Precision)
    TRUTH.TOTALTotal number of truth variants
    TRUTH.TPNumber of true-positive calls in truth representation (counted via the truth sample column)
    TRUTH.FNNumber of false-negative calls = calls in truth without matching query call
    QUERY.TOTALTotal number of query calls
    QUERY.TPNumber of true positive calls in query representation (counted via the query sample column)
    QUERY.FPNumber of false-positive calls in the query file (mismatched query calls within the confident regions)
    QUERY.UNKNumber of query calls outside the confident regions
    FP.gtNumber of genotype mismatches (alleles match, but different zygosity)
    FP.alNumber of allele mismatches (variants matched by position and not by haplotype)
    TRUTH.TOTAL.TiTv_ratioTransition / Transversion ratio for all truth variants
    TRUTH.TOTAL.het_hom_ratioHet/Hom ratio for all truth variants
    TRUTH.FN.TiTv_ratioTransition / Transversion ratio for false-negative variants
    TRUTH.FN.het_hom_ratioHet/Hom ratio for false-negative variants
    TRUTH.TP.TiTv_ratioTransition / Transversion ratio for true positive variants
    TRUTH.TP.het_hom_ratioHet/Hom ratio for true positive variants
    QUERY.FP.TiTv_ratioTransition / Transversion ratio for false positive variants
    QUERY.FP.het_hom_ratioHet/Hom ratio for false-positive variants
    QUERY.TOTAL.TiTv_ratioTransition / Transversion ratio for all query variants
    QUERY.TOTAL.het_hom_ratioHet/Hom ratio for all query variants
    QUERY.TP.TiTv_ratioTransition / Transversion ratio for true positive variants (query representation)
    QUERY.TP.het_hom_ratioHet/Hom ratio for true positive variants (query representation)
    QUERY.UNK.TiTv_ratioTransition / Transversion ratio for unknown variants
    QUERY.UNK.het_hom_ratioHet/Hom ratio for unknown variants
    Subset.SizeWhen using stratification regions, this gives the number of nucleotides contained in the current subset
    Subset.IS_CONF.SizeThis gives the number of confident bases (-f regions) in the current subset

    2)gatk-all.vcf.gz 是VCF格式的文件,文件最后两列(TRUTH和QUERY列)包含两个比较VCF样本的位点对比的情况。
    在这里插入图片描述
    其中涉及GT:BD:BK:BI:BVT:BLT:QQ等多个字段

    fieldDescription
    GTGenotype
    BDDecision for call (TP/FP/FN/N)
    BKSub-type for decision (match/mismatch type)
    BIAdditional comparison information
    QQVariant quality for ROC creation
    BVTHigh-level variant type (SNP/INDEL)
    BLTHigh-level location type (het/homref/hetalt/homalt/nocall)

    hap.py的基本用法就酱了,喜欢可以点赞加收藏留着吃灰去吧。

    参考:
    https://github.com/Illumina/hap.py
    https://github.com/Illumina/hap.py/blob/master/doc/happy.md

  • 相关阅读:
    安卓学习路线参考
    邮戳锁StampedLock
    MySQL高级篇知识点——索引优化与查询优化
    Vuex、localStorage和sessionStorage:如何选择合适的数据存储方式?
    docker的默认路径存储不足
    为什么阿里人能够快速成长?看完他们 Java 架构进化笔记,我秒懂!
    golang 图像验证码
    使用cmd登录阿里云服务器
    ASP.NET MVC企业级程序设计 (接上个作品加了添加)
    NodeRed系列—创建mqtt broker(mqtt服务器),并使用mqttx进行消息发送验证
  • 原文地址:https://blog.csdn.net/qq_28723681/article/details/127836407