Download PDFOpen PDF in browser

Generalization Capability of the Diet Network Model on Genomic Data

EasyChair Preprint 1638

6 pagesDate: October 11, 2019

Abstract

Machine learning in genomics seems promising to predict risk and reveal new insights on complex diseases. Nonetheless, it's application poses an important challenge, the fat data problem: the number of input features is orders of magnitude larger than the number of training examples. Models trained on this type of data are prone to overfit. The Diet Network, a novel neural architecture, has been proposed to address this problem. It was demonstrated that this architecture can perform a population classification task on the Thousand Genomes Project (1000G), a fat dataset. In our work, we evaluate the generalization capability of the approach. To do so, we retrained the Diet Network on the 1000G dataset using a different set of SNPs to get a model optimized to be applied on other cohorts. We tested this model on individuals from an alternative dataset, the Human Genome Diversity Project (HGDP). We also evaluated population classifications made by the model on individuals from previously unseen populations, by training the Diet Network on non-admixed populations and investigating the model's predictions for admixed individuals. In our work we demonstrate that the Diet Network has the capability of generalizing to new datasets and to capture the intricacies of admixed genomes.

Keyphrases: Ancestry Inference, Fat Data, deep learning, generalization, genomics, interpretability

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
@booklet{EasyChair:1638,
  author    = {Camille Rochefort-Boulanger and Léo Choinière and Jean-Christophe Grenier and Pierre-Luc Carrier and Julie Hussin},
  title     = {Generalization Capability of the Diet Network Model on Genomic Data},
  howpublished = {EasyChair Preprint 1638},
  year      = {EasyChair, 2019}}
Download PDFOpen PDF in browser