Classification of human Y chromosome haplogroups based on dense and sparse genetic data using machine learning approaches

Espinosa, Jose Rodrigo Flores

Classification of human Y chromosome haplogroups based on dense and sparse genetic data using machine learning approaches

dc.contributor.advisor	Roy, Kallol, juhendaja
dc.contributor.advisor	Karmin, Monika, juhendaja
dc.contributor.author	Espinosa, Jose Rodrigo Flores
dc.contributor.other	Tartu Ülikool. Loodus- ja täppisteaduste valdkond	et
dc.contributor.other	Tartu Ülikool. Arvutiteaduse instituut	et
dc.date.accessioned	2023-08-30T11:26:34Z
dc.date.available	2023-08-30T11:26:34Z
dc.date.issued	2022
dc.description.abstract	The genetic data of human Y chromosomes is classified into haplogroup categories based on the underlying phylogenetic tree, where a haplogroup represents a monophyletic clade on the tree. Current methods for the assignment of these categories work by representing a known human Y chromosome phylogeny as tree data structure. For an individual Y chromosome to be assigned a haplogroup using this representation, strategies based on breadth-first search (BFS) are often used. The tree is traversed in a manner that paths showing supporting evidence from mutations are further explored eventually leading to a leaf node and final classification. This strategy shows high efficiency when dense genotyping/sequencing data are available. However, in case of lower density genetic data such as genotyping arrays or ancient DNA data, BFS-based strategies often fail to reach a leaf node due to uncertainty and lack of information of where to go next. In this work we leverage the increasing availability of world-wide panels of Y chromosome data with available curated haplogroup categories. We present a novel method on the application of a K-nearest neighbors classifier to both low-density and high-density types of data. The main goal is to assess the extent to which this approach can be useful in the challenging cases where BSF-based methods fail to produce a tractable and meaningful result. To achieve this, we have employed different DNA sequence encodings together with dimensionality reduction techniques. We have also investigated a novel method of DNA representation using Word2vec contextual embeddings. The DNA snippets are represented as text words and the whole DNA sequence is a text sentence. Encoding the DNA sequences in this manner gives rich contextual information that helps in haplogroup classification and can be extended to other applications in genomics. The results show that classification accuracy is high (>98%) with next-generation sequencing (NGS) and genotyping arrays, high-density and lower-density data classes respectively. Performance however is low (<60% on average) when classifying ancient DNA data, which has the lowest level of resolution and higher levels of error. We observe that in many of the challenging cases KNN fails to correctly predict the label at its finest degree of resolution but does classifies correctly at the main category level which can be useful in practice.	et
dc.identifier.uri	https://hdl.handle.net/10062/91803
dc.language.iso	eng	et
dc.publisher	Tartu Ülikool	et
dc.rights	openAccess	et
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject	Y chromosome	et
dc.subject	machine learning	et
dc.subject	haplogroup classification	et
dc.subject.other	magistritööd	et
dc.subject.other	informaatika	et
dc.subject.other	infotehnoloogia	et
dc.subject.other	informatics	et
dc.subject.other	infotechnology	et
dc.title	Classification of human Y chromosome haplogroups based on dense and sparse genetic data using machine learning approaches	et
dc.type	Thesis	et

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1

Nimi:: Espinosa_MSc_computer_science_2022.pdf
Suurus:: 595.55 KB
Formaat:: Adobe Portable Document Format
Kirjeldus:

Lae alla

Litsentsi pakett

Nüüd näidatakse 1 - 1 1

Nimi:: license.txt
Suurus:: 1.71 KB
Formaat:: Item-specific license agreed upon to submission
Kirjeldus:

Lae alla

Kollektsioonid

LTAT magistritööd – Master's theses