Classification of human Y chromosome haplogroups based on dense and sparse genetic data using machine learning approaches
Kuupäev
2022
Autorid
Ajakirja pealkiri
Ajakirja ISSN
Köite pealkiri
Kirjastaja
Tartu Ülikool
Abstrakt
The genetic data of human Y chromosomes is classified into haplogroup categories based
on the underlying phylogenetic tree, where a haplogroup represents a monophyletic clade
on the tree. Current methods for the assignment of these categories work by representing
a known human Y chromosome phylogeny as tree data structure. For an individual
Y chromosome to be assigned a haplogroup using this representation, strategies based
on breadth-first search (BFS) are often used. The tree is traversed in a manner that
paths showing supporting evidence from mutations are further explored eventually leading
to a leaf node and final classification. This strategy shows high efficiency when
dense genotyping/sequencing data are available. However, in case of lower density
genetic data such as genotyping arrays or ancient DNA data, BFS-based strategies often
fail to reach a leaf node due to uncertainty and lack of information of where to go next.
In this work we leverage the increasing availability of world-wide panels of Y chromosome
data with available curated haplogroup categories. We present a novel method
on the application of a K-nearest neighbors classifier to both low-density and high-density
types of data. The main goal is to assess the extent to which this approach can be useful in
the challenging cases where BSF-based methods fail to produce a tractable and meaningful
result. To achieve this, we have employed different DNA sequence encodings together
with dimensionality reduction techniques. We have also investigated a novel method of
DNA representation using Word2vec contextual embeddings. The DNA snippets are
represented as text words and the whole DNA sequence is a text sentence. Encoding the
DNA sequences in this manner gives rich contextual information that helps in haplogroup
classification and can be extended to other applications in genomics.
The results show that classification accuracy is high (>98%) with next-generation
sequencing (NGS) and genotyping arrays, high-density and lower-density data classes
respectively. Performance however is low (<60% on average) when classifying ancient
DNA data, which has the lowest level of resolution and higher levels of error. We observe
that in many of the challenging cases KNN fails to correctly predict the label at its finest
degree of resolution but does classifies correctly at the main category level which can be
useful in practice.
Kirjeldus
Märksõnad
Y chromosome, machine learning, haplogroup classification