Truth set reference for the human Y chromosome

A ’truth set’ has be defined as a sample for which we did the best one can do in order to know all the correct SNVs and structural variants of that sample. These high-confidence call sets have been built based on a combination of sequencing approaches such as the integration of short-read Illumina and long-read Pacbio/Nanopore data. Such useful data sets have only been assembled for a few reference samples (NA12878, CEPH family, Ashkenazim Trio, etc) and, importantly here, always for the autosomal sections of the genome. Knowing the ’truth’ enables anyone to test any possible combination of QC steps, mappers, callers, variants filters, parameters in each step, etc, and thus assess which combination of them gets closer to the ‘truth’ and therefore happens to be the best for a particular purpose. A simple example is to test to different versions of a similar pipeline.

The project is to build several ’truth’ sets based on data available for samples which have been sequenced using a wide array of technologies. Relevant here are Illumina, PacBio, 10xGenomics and Nanopore data. As has already been done for the autosomes, it is possible to integrate the data and build a ’truth’ set for the Y chromosome of these samples. Such resource is not available for the community doing Y chromosome research and I find it useful.

References:

https://gatkforums.broadinstitute.org/gatk/discussion/10912/what-is-truth-or-how-an-accident-of-nature-can-illuminate-our-path

https://www.nature.com/articles/s41592-018-0054-7

https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1006834

https://jimb.stanford.edu/giab/

https://www.illumina.com/platinumgenomes.html

https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/

https://www.ebi.ac.uk/ena/data/view/PRJEB3246