Hap.py 是illumina官方开发的在单倍型水平上比较二倍体基因型的工具。可用于针对金标准变异数据集(例如NA12878样本)对检测的变异结果进行基准测试,以判断检测结果的准确性, 也可以用于比较和评估两个不同变异检测软件检测结果的差异。
安装就不多说了,怕麻烦的同学建议直接使用docker镜像,省时省力。docker pull pkrusche/hap.py
下面直接上基本使用命令行吧。
hap.py
hap.py/example/happy/PG_NA12878_hg38-chr21.vcf.gz \
hap.py/example/happy/NA12878-GATK3-chr21.vcf.gz \
-f hap.py/example/happy/PG_Conf_hg38-chr21.bed.gz \
-r hap.py/example/happy/hg38.chr21.fa \
--threads 5 \
-o gatk-all
其中,
-r hap.py/example/happy/hg38.chr21.fa 用于指定使用的参考序列。 也可以通过设置环境变量HGREF来指定export HGREF=hap.py/example/happy/hg38.chr21.fa , 其中hap.py/example/happy/hg38.chr21.fa 需要换成你实际使用的fasta序列hap.py/example/happy/PG_NA12878_hg38-chr21.vcf.gz 和 hap.py/example/happy/NA12878-GATK3-chr21.vcf.gz 为待比较的两个VCF文件,替换为你实际VCF文件即可。 -f hap.py/example/happy/PG_Conf_hg38-chr21.bed.gz 指定需要比较的区间,即只有该bed区间内的位点用于比较。--threads 5 指定线程-o out_prefix 指定输出文件前缀结果包含以下几个文件:
| Output File | Contents |
|---|---|
| gatk-all.summary.csv | Summary statistics |
| gatk-all.extended.csv | Extended statistics |
| gatk-all.roc.all.csv | All precision / recall data points that were calculated |
| gatk-all.vcf.gz | Annotated VCF according to https://github.com/ga4gh/benchmarking-tools/tree/master/doc/ref-impl |
| gatk-all.vcf.gz.tbi | VCF Tabix Index |
| gatk-all.metrics.json | JSON file containing all computed metrics and tables. |
| gatk-all.roc.Locations.INDEL.csv | ROC for ALL indels only. |
| gatk-all.roc.Locations.SNP.csv | ROC for ALL SNPs only. |
| gatk-all.roc.Locations.INDEL.PASS.csv | ROC for PASSing indels only. |
| gatk-all.roc.Locations.SNP.PASS.csv | ROC for PASSing SNPs only. |
其中两个比较重要的文件为:gatk-all.summary.csv 和 gatk-all.vcf.gz。
1)gatk-all.summary.csv包含直接的统计结果,包含变异位点数,精确度,召回率,F1值等统计信息,可以直观的看到两者之间的差异。gatk-all.extended.csv 是所有的统计信息,包含了gatk-all.summary.csv里面的信息,以及其他拓展信息。相关字段说明如下(就不中文解释了,有疑问可以一起讨论):
| Stratification Column | Description |
|---|---|
| Type | Variant type (SNP / INDEL) |
| Subtype | Variant Subtype (ti/tv/indel length, see above |
| Subset | Subset of the genome/stratification region |
| Filter | Variant filters: PASS, SEL, ALL, or a particular filter from the query VCF |
| Genotype | Genotype of benchmarked variants (het / homalt / hetalt) |
| QQ.Field | Which field from the original VCF was used to produce QQ values in truth and query |
| QQ threshold for ROC values |
| Metric Column | Description |
|---|---|
| METRIC.Recall | Recall for truth variant representation = TRUTH.TP / (TRUTH.TP + TRUTH.FN) |
| METRIC.Precision | Precision of query variants = QUERY.TP / (QUERY.TP + QUERY.FP) |
| METRIC.Frac_NA | Fraction of non-assessed query calls = QUERY.UNK / QUERY.TOTAL |
| METRIC.F1_Score | Harmonic mean of precision and recall = 2METRIC.RecallMetric.Precision/(METRIC.Recall + METRIC.Precision) |
| TRUTH.TOTAL | Total number of truth variants |
| TRUTH.TP | Number of true-positive calls in truth representation (counted via the truth sample column) |
| TRUTH.FN | Number of false-negative calls = calls in truth without matching query call |
| QUERY.TOTAL | Total number of query calls |
| QUERY.TP | Number of true positive calls in query representation (counted via the query sample column) |
| QUERY.FP | Number of false-positive calls in the query file (mismatched query calls within the confident regions) |
| QUERY.UNK | Number of query calls outside the confident regions |
| FP.gt | Number of genotype mismatches (alleles match, but different zygosity) |
| FP.al | Number of allele mismatches (variants matched by position and not by haplotype) |
| TRUTH.TOTAL.TiTv_ratio | Transition / Transversion ratio for all truth variants |
| TRUTH.TOTAL.het_hom_ratio | Het/Hom ratio for all truth variants |
| TRUTH.FN.TiTv_ratio | Transition / Transversion ratio for false-negative variants |
| TRUTH.FN.het_hom_ratio | Het/Hom ratio for false-negative variants |
| TRUTH.TP.TiTv_ratio | Transition / Transversion ratio for true positive variants |
| TRUTH.TP.het_hom_ratio | Het/Hom ratio for true positive variants |
| QUERY.FP.TiTv_ratio | Transition / Transversion ratio for false positive variants |
| QUERY.FP.het_hom_ratio | Het/Hom ratio for false-positive variants |
| QUERY.TOTAL.TiTv_ratio | Transition / Transversion ratio for all query variants |
| QUERY.TOTAL.het_hom_ratio | Het/Hom ratio for all query variants |
| QUERY.TP.TiTv_ratio | Transition / Transversion ratio for true positive variants (query representation) |
| QUERY.TP.het_hom_ratio | Het/Hom ratio for true positive variants (query representation) |
| QUERY.UNK.TiTv_ratio | Transition / Transversion ratio for unknown variants |
| QUERY.UNK.het_hom_ratio | Het/Hom ratio for unknown variants |
| Subset.Size | When using stratification regions, this gives the number of nucleotides contained in the current subset |
| Subset.IS_CONF.Size | This gives the number of confident bases (-f regions) in the current subset |
2)gatk-all.vcf.gz 是VCF格式的文件,文件最后两列(TRUTH和QUERY列)包含两个比较VCF样本的位点对比的情况。

其中涉及GT:BD:BK:BI:BVT:BLT:QQ等多个字段
| field | Description |
|---|---|
| GT | Genotype |
| BD | Decision for call (TP/FP/FN/N) |
| BK | Sub-type for decision (match/mismatch type) |
| BI | Additional comparison information |
| Variant quality for ROC creation | |
| BVT | High-level variant type (SNP/INDEL) |
| BLT | High-level location type (het/homref/hetalt/homalt/nocall) |
hap.py的基本用法就酱了,喜欢可以点赞加收藏留着吃灰去吧。
参考:
https://github.com/Illumina/hap.py
https://github.com/Illumina/hap.py/blob/master/doc/happy.md