View: session overviewtalk overview
Abstract
A critical challenge in developing and applying machine learning (ML) algorithms to biological data lies in performing effective model validation. In this presentation, I will share insights from two studies. The first case focuses on an unsupervised learning problem —inferring 3D DNA models from HiC contact maps— and discusses how the choice of validation metric directly influences the perceived performance and ranking of different methods. The second case examines the impact of various cross-validation strategies on algorithm performance in the context of inferring effectors from protein sequences. These cases highlight the importance of selecting appropriate validation strategies to ensure reliable and accurate ML models in biological research.
- 15:00: #67 Aurélie Mercadié
- 15:03: #75 Lucie Khamvongsa-Charbonnier
- 15:06: #80 Rose Marin
- 15:09: #95 Rémi-Vinh Coudert
- 15:12: #105 Morgane Terezol
- 15:15: #118 Anne-Laure Abraham
- 15:18: #122 Julie Lao
- 15:21: #126 Thomas Stosskopf
- 15:24: #134 Karine Massau
- 15:27: #139 Vincent Lombard
- 15:30: #190 Yohan Hernandez Courbevoie
- 15:33: #191 Nicolas Homberg
- 15:36: #195 Eva Mercier
- 15:39: #199 Lou Bergogne
- 15:42: #219 Benoît Bergk Pinto
- 15:45: #231 Juliette Audemard
- 15:48: #232 Victor Lefebvre
- 15:51: #241 Marouane Boumlik
- 15:54: #245 Juliette Cooke
- 15:57: #251 Pierre Berriet
16:30 | Semi-supervised learning improves regulatory sequence prediction with unlabeled sequences ABSTRACT. Background Genome-wide association studies have systematically identified thousands of single nucleotide polymorphisms (SNPs) associated with complex genetic diseases. However, the majority of those SNPs were found in non-coding genomic regions, preventing the understanding of the underlying causal mechanism. Predicting molecular processes based on the DNA sequence represents a promising approach to understand the role of those non-coding SNPs. Over the past years, deep learning was successfully applied to regulatory sequence prediction using supervised learning. Supervised learning required DNA sequences associated with functional data for training, whose amount is strongly limited by the finite size of the human genome. Conversely, the amount of mammalian DNA sequences is exponentially increasing due to ongoing large sequencing projects, but without functional data in most cases. Results To alleviate the limitations of supervised learning, we propose a paradigm shift with semi-supervised learning, which does not only exploit labeled sequences (e.g. human genome with ChIP-seq experiment), but also unlabeled sequences available in much larger amounts (e.g. from other species without ChIP-seq experiment, such as chimpanzee). Conclusion Our approach is flexible and can be plugged into any neural architecture including shallow and deep networks, and shows strong predictive performance improvements compared to supervised learning in most cases (up to 70%). Availability and implementation https://forgemia.inra.fr/raphael.mourad/deepgnn. Reference Mourad, R. (2023). Semi-supervised learning improves regulatory sequence prediction with unlabeled sequences. BMC bioinformatics, 24(1), 1-15. |
16:50 | scEVE: a scRNA-seq ensemble clustering algorithm that leverages the extrinsic variability to prevent over-clustering. PRESENTER: Yanis Asloudj ABSTRACT. Clustering analyses play a fundamental role in single-cell data science. Hundreds of methods have been developed to conduct this analysis, but they all generate different results. Benchmarks and reviews make this issue obvious, and they also show that no single clustering method outperforms all the others. Thus, to address this issue and to generate clustering results robust to the method used, scRNA-seq ensemble clustering algorithms have been developed. They usually tackle this issue by trying to minimize the differences across multiple clustering solutions. In this paper, we propose a novel approach to tackle it. We name ”extrinsic variability” the variations in the clustering solutions that are due to methodological choices. We hypothesize that this extrinsic variability is not an issue to be minimized, but rather an informative signal, and that it can be leveraged to prevent over-clustering. To verify our hypothesis, we have developed scEVE, an algorithm that embraces this approach. In this paper, we present the algorithm of scEVE. We apply it on a human glioblastoma scRNA-seq dataset, and we compare its performance to three state of the art ensemble clustering algorithms, on two different scRNA-seq datasets. We start by presenting scEVE, and how it effectively prevents over-clustering. Then, we showcase it on the public glioblastoma dataset, and we reveal the existence of a sub-cluster of cancer cells, that we characterize biologically. Finally, we show that scEVE performs well compared to existing methods, on top of addressing two main challenges in scRNA-seq clustering. Overall, our work shows that the extrinsic variability can be informative and we present scEVE, an algorithm that leverages it to generate a multi-resolution clustering with explicit consensus values. |
17:10 | localScore: an R package to highlight optimal and suboptimal segment in a sequence with associated p-values computation PRESENTER: Sabine Mercier ABSTRACT. Highlighting atypical segments of a sequence is an important goal in very diverse domains. In the case where no prior information on the length of the segment to be highlighted is known, Karlin and Altschul defined in 1990 the local score for biological sequence analysis, and an asymptotic approximation of its distribution is proposed in 1992. There exist now many other theoretical results to establish the local score p-value in different contexts. We developed an R package gathering these results for a sequence modeled by independent and identically distributed variables. It allows to compute the local score, the suboptimal scores, their position, and proposes to establish the local score p-value using the different theoretical methods available so far. An automatic analysis is also proposed to perform the most appropriate method according to the analyzed sequence. We present here the package and different examples of application. Comparisons with other tools used depending on the context of application are also given. The localScore package is available on the Comprehensive R Archive Network. It is distributed under the GPL-2 licence for the core program (and various licenses for embedded Eigen library) |
16:30 | “PhaseImpute” an NF-Core pipeline for genetic imputation PRESENTER: Louis Le Nézet ABSTRACT. Genotype and low coverage sequencing data provide cost-effective avenues for genomic research but inherently exhibit respectively the limitations of being low resolution and low quality. This sparsity in genetic data poses challenges, particularly in non-model organisms lacking species-specific phased panels, necessitating accurate haplotype phasing for effective downstream analyses. Genetic imputation serves as a valuable tool to supplement these methods by completing missing data and assigning probabilities to variant observations, thus enhancing data resolution and quality. However, ensuring the reliability of imputed genotypes requires rigorous tool validation and thorough exploration of parameter impacts, critical for maximizing the utility of these techniques in genomic studies. To tackle these challenges, we introduce a NF-Core compliant pipeline tailored for phasing, imputation, and validation. Our pipeline equips users with advanced genomic analysis tools for phasing and imputation analysis. This enables them to leverage the full potential of their genetic data by filling in missing information, harmonizing datasets, and enhancing the resolution of genetic analyses. By adhering to NF-Core guidelines, our pipeline ensures users have access to a suite of high standard, versioned, and rigorously tested up-to-date tools, complemented by robust community - a crucial aspect for FAIR (i.e. Findable, Accessible, Interoperable, Reusable) analysis. In summary, our dedicated NF-Core pipeline offers a comprehensive solution for genomic imputation, covering phasing, target file imputation, and validation stages. Using this pipeline streamlines researchers' genomic investigations, harnessing innovative tools to ensure dependable imputation outcomes across various model and non-model species. Furthermore, by adhering to FAIR principles, they contribute to standardized, reproducible, and community-supported genomic analysis. |
16:50 | Benchmarking feature selection and feature extraction methods to improve the performances of machine-learning algorithms for patient classification using metabolomics biomedical data PRESENTER: Justine Labory ABSTRACT. Objectives: Classification tasks are an open challenge in the field of biomedicine. While several machine-learning techniques exist to accomplish this objective, several peculiarities associated with biomedical data, especially when it comes to omics measurements, prevent their use or good performance achievements. Omics approaches aim to understand a complex biological system through systematic analysis of its content at the molecular level. On the other hand, omics data are heterogeneous, sparse and affected by the classical “curse of dimensionality” problem, i.e. having much fewer observation, samples (n) than omics features (p). Furthermore, a major problem with multi-omics data is the imbalance either at the class or feature level. The objective of this work is to study whether feature extraction and/or feature selection techniques can improve the performances of classification machine-learning algorithms on omics measurements. Methods: Among all omics, metabolomics has emerged as a powerful tool in cancer research, facilitating a deeper understanding of the complex metabolic landscape associated with tumorigenesis and tumor progression. Thus, we selected three publicly available metabolomics datasets, and we applied several feature extraction techniques both linear and non-linear, coupled or not with feature selection methods, and evaluated the performances regarding patient classification in the different configurations for the three datasets. Results: We provide general workflow and guidelines on when to use those techniques depending on the characteristics of the data available. To further test the extension of our approach to other omics data, we have included a transcriptomics and a proteomics data. Overall, for all datasets, we showed that applying supervised feature selection improves the performances of feature extraction methods for classification purposes. Scripts used to perform all analyses are available at: https://github.com/Plant-Net/Metabolomic_project/. |
17:10 | Constructing a robust framework to benchmark deconvolution algorithms PRESENTER: Elise Amblard ABSTRACT. Bulk omic data are routinely collected during the course of treatment of cancer patients to be used in diagnostic and prognostic models. Unfortunately, these data are often misinterpreted because of underlying hidden information ignored by analyses. For instance, not considering cell type composition of the sample can lead to erroneous conclusions. Those proportions can be estimated from bulk data using deconvolution algorithms. However, there is currently no comprehensive multi-omic framework for fairly comparing these methods, and no consensus on the optimal algorithm to use. These limitations are holding back the widespread adoption of deconvolution approaches in computational oncology. In order to help define a consensus by better comparing deconvolution methods, we built a reproducible benchmark workflow. It provides a toolkit to judge the performance of deconvolution algorithms and to compare methods and omics. We evaluated algorithms for the deconvolution of transcriptomic and methylation data, using various performance metrics and gold-standard datasets. The workflow involves computing a comprehensive benchmark score and P-values for each algorithm, enabling a straightforward comparison. We ensure the reproducibility and re-usability of our framework, while it is intended to be flexible as well, as we provide the resources to run our benchmark framework with new datasets or algorithms. |
16:30 | Beyond Recombination: Exploring the Impact of Meiotic Frequency on Genome-wide Genetic Diversity PRESENTER: Louis Ollivier ABSTRACT. In population genetics, a key objective is to elucidate the demographic and molecular mechanisms behind genetic diversity across genomes. Recent advances have highlighted the value of forwardin-time simulations, particularly those facilitated by the SLiM framework, for their capacity to simulate intricate evolutionary processes. Canonical models define meiotic recombination as a variable rate per site per generation. Yet, the passage from one generation to the next is not limited to meiotic processes alone, given that mitotic cell divisions are prevalent in most species. This paper studies the interplay between the frequency of meiosis versus mitosis, recombination rates, and selection coefficients on genetic diversity. Our findings reveal that selective sweeps reduce the entire genomic diversity when meiotic frequency is low. We further elucidate that recombination rate per meiosis, rather than meiotic frequency, defines the breadth of the genetic linkage valley surrounding a selective sweep. Additionally, we explore how dominance coefficient influences the loss of genomic diversity. This paper contributes to a deeper understanding of how recombination rate plays a crucial role in shaping genetic diversity, offering new insights into the complexities of evolutionary genetics. By examining selective sweeps and their genetic consequences, we highlight the importance of incorporating realistic modeling of the most essential parameters, such as recombination rate. |
16:50 | When lianas and trees talk DNA: new insights into genetic exchanges in plants PRESENTER: Emilie Aubin ABSTRACT. Horizontal transfer (HT) refers to the exchange of genetic material between divergent species, without reproduction. HT has played a significant role in bacterial evolution, but its role is underestimated in higher eukaryotes. Several recent studies have demonstrated HTs in eukaryotes, particularly in the context of parasitic relationships and model species. However, very little is known about HTs in natural ecosystems, particularly those involving non-parasitic wild species and the nature of relationships between species that promote these HTs. To fill this knowledge gap, we conducted a pilot study to investigate HTs in a natural ecosystem, the Massane forest located in the south of France by sequencing the genomes of 17 wild non-model species. To reach this goal, we developed a new computational pipeline called INTERCHANGE, allowing the characterization of HTs at the whole genome level without prior annotation and directly in the raw sequencing reads. Using this pipeline,we identified 12 HT events, half of which occurred between lianas and trees. We found mainly a low copy number of LTR-retrotransposons from Copia superfamily being transferred between wild plant species. This work highlights a new possible route for HTs between non-parasitic plants and provide new insights into genomic characteristics of horizontally transferred DNA in plant genomes. |
17:10 | Unzipped genome assemblies of polyploid root-knot nematodes reveal unusual and clade-specific telomeric repeats PRESENTER: Etienne G.J. Danchin ABSTRACT. Using long-read sequencing, we assembled and unzipped the polyploid genomes of Meloidogyne incognita, M. javanica and M. arenaria, three of the most devastating plant-parasitic nematodes. We found the canonical nematode telomeric repeat to be missing in these and other Meloidogyne genomes. In addition, we find no evidence for the enzyme telomerase or for orthologs of C. elegans telomere-associated proteins, suggesting alternative lengthening of telomeres. Instead, analyzing our assembled genomes we identify species-specific composite repeats enriched mostly at one extremity of contigs. These repeats are G-rich, oriented, and transcribed, similarly to canonical telomeric repeats. We confirm them as telomeric using fluorescent in situ hybridization. These repeats are mostly found at one single end of most chromosomes in these species. The discovery of unusual and specific complex telomeric repeats opens a plethora of perspectives and highlights the evolutionary diversity of telomeres despite their central roles in senescence, aging, and chromosome integrity. |
#26 Nadège Guiglielmoni and Philipp H. Schiffer "Phasing or purging: tackling the genome assembly of a highly heterozygous animal species in the era of high-accuracy long reads"
#27 Baptiste Ruiz, Arnaud Belcour, Samuel Blanquart, Sylvie Buffet-Bataillon, Isabelle Le Huërou-Luron, Anne Siegel and Yann Le Cunff "SPARTA : intégration de connaissances fonctionnelles pour une classification interprétable des microbiomes."
#34 Rémy Costa, Alban Mancheron and William Ritchie "Enhancing genomic data privacy: k-mer based strategy for reducing re-identification risks"
#43 Clémence Su, Olivier Alibert, Florence Glibert, Cécile Dulary, Cédric Fund, Eric Bonnet, Jean-Franíçois Deleuze, Solène Brohard and Sophie Chantalat "High-order chromatin contacts in regulatory regions with the Pore-C approach"
#51 Vitushanie Yogaranjan "Artificial self-sustaining microbial consortia for space biomining: predicting the species dynamics via an ecological model."
#54 Franck Samson and Sébastien Aubourg "GBOT, one flew over the ortholog's nest"
#56 Allyson Moureaux and Anne-Muriel Arigon "La place de l'Intelligence Artificielle pour la gestion de la production pharmaceutique aujourd'hui."
#57 Adam Schumacher, Mickaël Lebalch, Léopold Carron and Romain Grall "Characterization of Transcriptomic Signature of an anchor compound for Quality Control of Future Studies."
#60 Margaux Haering and Bianca Habermann "mitoXplorer 3.0 : exploring mitochondrial dynamics in single-cell data."
#61 Arthur Durante, Guillaume Devailly, Katia Feve, Yann Labrune, Laure Gress, Denis Milan, Jean-Luc Gourdine, Hélène Gilbert, David Renaudeau and Juliette Riquet "Climate, heat-stress, and genetics impacts the whole-blood gene expression levels in crossbred pigs"
#62 Elise Maigné, Isabelle Sanchez, David Carayon, Joseph Tran and Jean-Franíçois Rey "SK8: an institutional management and hosting service for Shiny applications"
#63 Jean-Christophe Mouren, Antoinette Van Ouwerkerk, Magali Torres, Frederic Gallardo, Salvatore Spicuglia and Benoit Ballester "Genome-wide Characterization of Exons as Regulatory Elements"
#64 Julie Cartier, Chloé-Agathe Azencott, Adeline Fermanian and Florian Massip "Applying the knockoff procedure on transcriptomic data to improve stability"
#65 Bryan Brancotte and Elodie Chapeaublanc "Shiny-K8s, a toolkit to easily and reproducibly deploy your (R)Shiny app thanks to docker and Kubernetes (without mastering them!)"
#67 Aurélie Mercadié, Eléonore Gravier, Gwendal Josse, Nathalie Vialaneix and Céline Brouard "NMFProfiler: a supervised NMF extension for integrating omics data"
#69 Faustine Souc, Thomas Bellembois, Nils Paulhe, Erina Point, Antoine Mahul, Nadia Goué and Franck Giacomoni "MetaboCloud : A catalog of microservices hosted on a Cloud infrastructure and addressing issues linked to FAIR principles and open science."
#70 Veronique Jamilloux, Ariane Bize, Aurélie Gramusset, Cédric Gil, Nicolas Raidelet, Cédric Midoux, Emilie Fernandez, Guilhem Heinrich and Aurore Chapelle "DeepOmics, Digital Environmental Engineering Platform for meta-Omics data"
#71 Solène Pety, Ingrid David, Mahendra Mariadassou and Andrea Rau "A flexible simulation framework for transgenerational hologenomic data"
#72 Caroline Sancho, Nacer Mohellibi, Pierre Renault and Julien Tap "Development of an integrative datawarehouse to ease food microbi-al data mining"
#76 Ercan Seckin, Dominique Colinet, Marc Bailly-Bechet, Edoardo Sarti and Etienne Danchin "Identification of orphan genes and de novo gene birth in the evolution of plant parasitic nematodes"
#78 Emilie Fernandez, Ariane Bize, Elie Desmond-Le Quemener, Virginie Rossard and Eric Latrille "Information systems for environmental biorefinery: combining omics data with bioprocess data"
#79 Lijiao Ning, Léa Bellenger, Naïra Naouar and Christophe Antoniewski "Offering researchers Interactive Online Companionship to acquire lasting expertise in genomic data analysis"
#81 Roland Barriot, Maxime Bonhomme, Emilie Lecompte, Jérí´me Farinas and Gwennaele Fichant "Master de Bioinformatique de l'Université Toulouse 3"
#82 Elodie Girard, Nicolas Servant and Paul-Antoine Nicolas "Prediction of tumor-specific neoantigen based on both DNA and RNA next-generation sequencing data from pan-cancer patients"
#83 Nicolas Haas, Julie Dawn Thompson, Jean-Paul Renaud, Kirsley Chennen and Poch Olivier "StopKB: A comprehensive knowledgebase for nonsense suppression therapies"
#84 Vincent Rocher, Anne Goelzer and Nathalie Vialaneix "Evaluation of gene network inference methods on the fully reconstructed network of Bacillus subtilis"
#85 Sidwell Bobe Rigade and Denis Mestivier "Heterogeneity of V3-V4 region sizes among genus impact 16S rRNA sequence analysis"
#86 Maxime Ben Braiek, Mekki Boussaha, Philippe Bardou, Pierre Faux, Bertrand Servin, Cécile Grohs, Julien Sarry, Florian Besnard, Jeanlin Jourdain, Sébastien Taussat, Chris Hoze, Clémentine Escouflaire, Hélène Leclerc, Patrice Dehais, Florent Woloszyn, Claire Kuchly, Camille Eche, Camille Marcuzzo, Cécile Donnadieu, Marcel Amills, Stéphane Fabre, Didier Boichard, Gwenola Tosser-Klopp and Aurélien Capitan "Whole genome reverse genetic screen for natural deleterious variants in 4000 domestic ruminants to gain insight into mammalian gene function"
#88 Vincent Le Goff, Vincent Guillemot, Cathy Philippe, Gwendoline Mendes, Jean-Franíçois Deleuze, Edith Le Floch and Arnaud Gloaguen "Impact of joint Dimension Reduction methods for survival prediction - extension of a multi-omics benchmark study"
#90 Quentin Bouvier, Mohammad Salma, Eric Soler and Charles Lecellier "TF-based regulatory network controlling terminal erythroid differentiation"
#91 Salomé Brunon, Laurent Jourdren and Sophie Lemoine "Generation of gene annotations including UTRs for Bulk and Single-Cell RNASeq analyses"
#92 Corinne Blugeon, Salomé Brunon, Ali Hamraoui, Laurent Jourdren, Sophie Lemoine, Tiphaine Marvillet, Catherine Senamaud-Beaufort, Oumy Seydi, Stephane Le Crom and Morgane Thomas-Chollier "GenomiqueENS, the IBENS Genomics core facility"
#93 Oscar Otero Laudouar, Benoit Samson, Philippe Bertolino, Olivier Gandrillon, Franck Picard and Ghislain Durif "Aggregating nuclei segmentation methods to quantify cell counts in spatial transcriptomics data"
#94 Margaux Labrosse, Maxime Mathieu, Mireille Andre, Emmanuelle Arnaud, Amandine Girousse, Coralie Sengenes and Xavier Contreras "Identification of ASC sub-populations originating from adipose tissue and involved in muscle homeostasis and regeneration"
#95 Rémi-Vinh Coudert, Frédéric Jauffrit, Jean-Philippe Charrier, Jean-Pierre Flandrois and Céline Brochier-Armanet "MPS-Sampling: a novel method allowing the reliable selection of representative genomes"
#97 Sarah Maman, Gaston Rognon, Chloé Bellanger and Léonard Ransan "DeFIS - Detection pipeline with k-mers analysis to identify species."
#98 Noémie Teixido, Elisa Simon, Emmanuelle Labarthe, Emilien Rottier, Benjamin Basso, Fanny Mondet, Alain Vignal, Kamila Canale-Tabet and Thibault Leroy "Metagenomics and population genomics of honey bees through the direct sequencing of honeys"
#99 Ali Hamaroui, Salomé Brunon and Laurent Jourdren "ToulligQC 2.6: fast and comprehensive quality control for Oxford Nanopore sequencing data"
#100 Koloina Rabemanantsoa "Modeling interactions between a host and its gut microbiota"
#101 Anaïs L'Haridon, Thomas Rambaud, Virginie Saillour and Alban Lermine "SeqOIA cancer pipeline for diagnostic: current tools and upcoming evolutions"
#103 Mariene Wan, Aaron Millan-Oropeza, Thomas Lacroix, Jonathan Mineau-Cesari, Sophie Schbath and Valentin Loux "SIDURI : an integrative information system for next generation fermented foods"
#105 David Hirst, Morgane Térézol, Laura Cantini, Paul Villoutreix, Matthieu Vignes and Anaïs Baudot "MOTL: enhancing multi-omics matrix factorization with transfer learning"
#108 Jean Mainguy, Jérí´me Arnoux, Guillaume Gautreau, Adelme Bazin, David Vallenet and Alexandra Calteau "PPanGGOLiN V2: technical enhancement and new features to analyze thousands of prokaryotic genomes"
#110 Céline Chevalier, Philippe Baratta, Sevda Rafatov, Benjamin Loire and Anaïs Baudot "scDataPipeline: A Reproducible Framework for Single-Cell RNA-Seq Analysis"
#112 Lucie Lamothe, Yasmina Kermezli, Yuna Blum and Magali Richard "deconvPDAC: a single-cell based quantifier of PDAC tumor cellular heterogeneity"
#133 Sébastien Gradit, Elisabeth Hellec, Julien Fumey, Jérémy Rousseau, Benjamin Loire, Baptiste Imbert, Arthur Durante, Samuel Ortion and Ravy Leon Foun Lin "International Society for Computational Biology Student Council Regional Student Group France (RSG France): Association of Young Bioinformaticians of France (JeBiF)"
#144 Joe Ueda, Adeline Humbert, Johann Peltier, Daniel Gautheret, Olga Soutourina and Claire Toffano-Nioche "Search for ncRNA targets in Clostridioides difficile by cross-referencing MAPS data and RNA- RNA interaction predictions"
#168 Luis Martin Pena, Elise Amblard and Magali Richard "A robust framework for the benchmarking of deconvolution methods for omic data analysis"
#177 Francis Ogereau, Mateo Hiriart, Pierre Marin, Bérénice Batut, Cléa Siguret, Philippe Ruiz, Sophie Desset, Pierre Pouchin, Faustine Souc, Nils Paulhe, Franck Giacomoni, David Grimbichler, Thomas Bellembois, Valérie Legué, Pierre Peyret, Antoine Mahul and Nadia Goué "FAIRing Research Data to Live @AuBi Platform"
#184 Elouan Bethuel, Magali Hennion and Olivier Kirsh "Methylator, a complete workflow for DNA methylation analysis"
#191 Nicolas Homberg, Magali Richard and Florent Chuffart "The backstage of the third Health Data Challenge: Introducing an Organizational Template and Custom, Temporary and Local Codabench Instances."