Download PDFOpen PDF in browserGeneralization Capability of the Diet Network Model on Genomic DataEasyChair Preprint 16386 pages•Date: October 11, 2019AbstractMachine learning in genomics seems promising to predict risk and reveal new insights on complex diseases. Nonetheless, it's application poses an important challenge, the fat data problem: the number of input features is orders of magnitude larger than the number of training examples. Models trained on this type of data are prone to overfit. The Diet Network, a novel neural architecture, has been proposed to address this problem. It was demonstrated that this architecture can perform a population classification task on the Thousand Genomes Project (1000G), a fat dataset. In our work, we evaluate the generalization capability of the approach. To do so, we retrained the Diet Network on the 1000G dataset using a different set of SNPs to get a model optimized to be applied on other cohorts. We tested this model on individuals from an alternative dataset, the Human Genome Diversity Project (HGDP). We also evaluated population classifications made by the model on individuals from previously unseen populations, by training the Diet Network on non-admixed populations and investigating the model's predictions for admixed individuals. In our work we demonstrate that the Diet Network has the capability of generalizing to new datasets and to capture the intricacies of admixed genomes. Keyphrases: Ancestry Inference, Fat Data, deep learning, generalization, genomics, interpretability
|