Potential of Artificial Genomes in Genome-wide Association Studies
Kuupäev
2021
Autorid
Ajakirja pealkiri
Ajakirja ISSN
Köite pealkiri
Kirjastaja
Abstrakt
The ability of genome-wide association studies (GWAS) to identify new disease-associated
variants is highly dependent on sample size. At the same time, having a large enough dataset
is currently a barrier for researchers, since many genomic databases are not freely accessible
due to privacy concerns. As a potential solution to this problem, generative adversarial
networks (GANs) have recently demonstrated the ability to create realistic synthetic artificial
human genomes (AGs), which could serve as anonymous surrogates for inaccessible data. This
study describes the preliminary steps towards exploring the possible applicability of AGs in
GWAS. Using Estonian type 2 diabetes (T2D) data, AGs were generated by training the model
independently on case and control groups, using coherent principal component analysis (PCA)
results as a stopping criterion. Due to computational limitations, genomes were split into 1,000
SNPs chunks for the training. Obtained AGs were stitched back to full chromosomes and compared
to real genomes based on population structure, estimated via PCA, and minor allele frequency
(MAF) correlation. Additionally, relationships between case and control groups were
assessed using the same methods. Subsequently, GWAS was conducted on both Estonian data
and AGs. It was discovered that stitched AGs cluster differently from real genomes. Besides,
we showed that AG cases and controls represent distinct pseudo-population structures. Furthermore,
for AGs, differences in MAFs between cases and controls were greater than for real
genomes. Eventually, AGs performed poorly in GWAS, showing highly inflated results, possibly
due to the systematic differences in MAFs between case and control groups. In this study,
we address several potential barriers for AGs serving as anonymous proxies in GWAS applications
and provide directions for future research, suggesting alternative training approaches and
potential technical improvements.
Kirjeldus
Märksõnad
B110 Bioinformatics, medical informatics, biomathematics, biometrics, B790 Clinical genetics