Genotyping HLA alleles in samples that have been sequenced using Illumina short-reads is a challenging (Brandt et al (2015)) and non-trivial task. Different approaches have been developed in recent years (Warren et al (2012), Liu et al (2013), Bai et al (2014), Dilthey et al (2015), Xie et al (2017)).
Here I tested the accuracy of Heng Li’s approach of using the human genome reference GRC38 with the addition of ALT contigs, decoy sequences, and importantly, an array of ~500 known HLA haplotypes taken from the IPD-IMGT/HLA database (v3.18). I also offer a comparison of whether the genotyping accuracy increases by using the most recent version v3.34 of the HLA database containing now about ~5000 haplotypes.
Most probably the new additions correspond to rare haplotypes seen in not so well studied populations, however by increasing the number of HLA haplotypes we increase the power to discriminate between alleles, specially when we try to genotype populations for which the known HLA haplotypes are not so well studied.
For the comparisons presented here I have used 150 samples from three populations (EUR, EAS, AFR) that have been sequenced by the Polaris project at high-coverage using Illumina sequencing. See here for details on the data.
Gourraud, P. 2014. HLA Diversity in the 1000 Genomes Dataset