Potential of Artificial Genomes in Genome-wide Association Studies



Journal Title

Journal ISSN

Volume Title



The ability of genome-wide association studies (GWAS) to identify new disease-associated variants is highly dependent on sample size. At the same time, having a large enough dataset is currently a barrier for researchers, since many genomic databases are not freely accessible due to privacy concerns. As a potential solution to this problem, generative adversarial networks (GANs) have recently demonstrated the ability to create realistic synthetic artificial human genomes (AGs), which could serve as anonymous surrogates for inaccessible data. This study describes the preliminary steps towards exploring the possible applicability of AGs in GWAS. Using Estonian type 2 diabetes (T2D) data, AGs were generated by training the model independently on case and control groups, using coherent principal component analysis (PCA) results as a stopping criterion. Due to computational limitations, genomes were split into 1,000 SNPs chunks for the training. Obtained AGs were stitched back to full chromosomes and compared to real genomes based on population structure, estimated via PCA, and minor allele frequency (MAF) correlation. Additionally, relationships between case and control groups were assessed using the same methods. Subsequently, GWAS was conducted on both Estonian data and AGs. It was discovered that stitched AGs cluster differently from real genomes. Besides, we showed that AG cases and controls represent distinct pseudo-population structures. Furthermore, for AGs, differences in MAFs between cases and controls were greater than for real genomes. Eventually, AGs performed poorly in GWAS, showing highly inflated results, possibly due to the systematic differences in MAFs between case and control groups. In this study, we address several potential barriers for AGs serving as anonymous proxies in GWAS applications and provide directions for future research, suggesting alternative training approaches and potential technical improvements.



B110 Bioinformatics, medical informatics, biomathematics, biometrics, B790 Clinical genetics