Classification of human Y chromosome haplogroups based on dense and sparse genetic data using machine learning approaches

dc.contributor.advisorRoy, Kallol, juhendaja
dc.contributor.advisorKarmin, Monika, juhendaja
dc.contributor.authorEspinosa, Jose Rodrigo Flores
dc.contributor.otherTartu Ülikool. Loodus- ja täppisteaduste valdkondet
dc.contributor.otherTartu Ülikool. Arvutiteaduse instituutet
dc.date.accessioned2023-08-30T11:26:34Z
dc.date.available2023-08-30T11:26:34Z
dc.date.issued2022
dc.description.abstractThe genetic data of human Y chromosomes is classified into haplogroup categories based on the underlying phylogenetic tree, where a haplogroup represents a monophyletic clade on the tree. Current methods for the assignment of these categories work by representing a known human Y chromosome phylogeny as tree data structure. For an individual Y chromosome to be assigned a haplogroup using this representation, strategies based on breadth-first search (BFS) are often used. The tree is traversed in a manner that paths showing supporting evidence from mutations are further explored eventually leading to a leaf node and final classification. This strategy shows high efficiency when dense genotyping/sequencing data are available. However, in case of lower density genetic data such as genotyping arrays or ancient DNA data, BFS-based strategies often fail to reach a leaf node due to uncertainty and lack of information of where to go next. In this work we leverage the increasing availability of world-wide panels of Y chromosome data with available curated haplogroup categories. We present a novel method on the application of a K-nearest neighbors classifier to both low-density and high-density types of data. The main goal is to assess the extent to which this approach can be useful in the challenging cases where BSF-based methods fail to produce a tractable and meaningful result. To achieve this, we have employed different DNA sequence encodings together with dimensionality reduction techniques. We have also investigated a novel method of DNA representation using Word2vec contextual embeddings. The DNA snippets are represented as text words and the whole DNA sequence is a text sentence. Encoding the DNA sequences in this manner gives rich contextual information that helps in haplogroup classification and can be extended to other applications in genomics. The results show that classification accuracy is high (>98%) with next-generation sequencing (NGS) and genotyping arrays, high-density and lower-density data classes respectively. Performance however is low (<60% on average) when classifying ancient DNA data, which has the lowest level of resolution and higher levels of error. We observe that in many of the challenging cases KNN fails to correctly predict the label at its finest degree of resolution but does classifies correctly at the main category level which can be useful in practice.et
dc.identifier.urihttps://hdl.handle.net/10062/91803
dc.language.isoenget
dc.publisherTartu Ülikoolet
dc.rightsopenAccesset
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/*
dc.subjectY chromosomeet
dc.subjectmachine learninget
dc.subjecthaplogroup classificationet
dc.subject.othermagistritöödet
dc.subject.otherinformaatikaet
dc.subject.otherinfotehnoloogiaet
dc.subject.otherinformaticset
dc.subject.otherinfotechnologyet
dc.titleClassification of human Y chromosome haplogroups based on dense and sparse genetic data using machine learning approacheset
dc.typeThesiset

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Espinosa_MSc_computer_science_2022.pdf
Size:
595.55 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: