View: session overviewtalk overview
09:30 | SVJedi-Tag : a novel method for genotyping large inversions with linked-read data PRESENTER: Mélody Temperville ABSTRACT. Structural Variants (SVs) are an important but overlooked aspect of genetic variation, with a strong impact on phenotypes, including disease and traits of agrinomical interest. In particular, inversions are known for their role in the evolution of biological diversity and particularly studied in non-model species using population data. One of the major steps in the study of SVs is genotyping, i.e. determining the presence or absence of each allele of an SV in an individual. While long-read data is ideal for studying SVs, because of the long-distance information it provides, it hardly scale-up in a cost-efficient manner for genotyping many individuals. Linked-read data provide a relevant alternative by combining the low sequencing cost of short reads with long-distance information thanks to the use of barcodes tagging long molecules. Whereas several methods have been proposed to discover SVs with linked-reads, there are currently no tool for genotyping with this type of sequencing data. In this paper, we present SVJedi-Tag, the first inversion genotyping tool dedicated to linked-read data. SVJedi-Tag represents the inversions in a variation graph, maps the linked-reads on the graph and then analyzes the barcode signals of reads aligned on each side of the inversion breakpoints to predict the genotype of each inversion. We tested SVJedi-Tag on simulated and real linked-read data in the seaweed fly Coelopa frigida, and showed that SVJedi-Tag is able to genotype with high accuracy large inversions above 25 kb, with a read depth as low as 3X. A last test with a real complex and large inversion of 4Mb and real linked-read data demonstrated that SVJedi-Tag provides an efficient and acccurate method to genotype complex rearrangements in species of ecological interest. |
09:50 | GrAnnoT, a tool for efficient and reliable annotation transfer through pangenome graph PRESENTER: Nina Marthe ABSTRACT. The increasing availability of genome sequences has highlighted the limitations of using a single reference genome to represent the diversity within a species. Pangenomes, encompassing the genomic information from multiple genomes, offer thus a more comprehensive representation of intraspecific diversity. However, pangenomes in form of graph often lack annotation information, which limits their utility for forward analyses. We introduce here GrAnnoT, a tool designed for efficient and reliable annotation transfer using such graphs, by projecting existing annotations from a source genome to the graph and subsequently to other embedded genomes. GrAnnoT was benchmarked against state-of-the-art tools on pangenome graphs and linear genomes from rice, human, and E. coli. The results demonstrate that GrAnnoT is consensual, conservative, and fast, outperforming alignment-based methods in accuracy or speed or both. It provides informative outputs, such as presence-absence matrices for genes, and alignments of transferred features between source and target genomes, aiding in the study of genomic variations and evolution. GrAnnoT's robustness and replicability across different species make it a valuable tool for enhancing pangenome analyses. GrAnnoT is available under the GNU GPLv3 licence at https://forge.ird.fr/diade/dynadiv/grannot. |
10:10 | Facilitating genome annotation using ANNEXA and long-read RNA sequencing PRESENTER: Nicolaï Hoffmann ABSTRACT. With the advent of complete genome assemblies, genome annotation has become essential for the functional interpretation of genomic data. Long-read RNA sequencing (LR-RNAseq) technologies have significantly improved transcriptome annotation by enabling full-length transcript reconstruction for both coding and non-coding RNAs. However, challenges such as transcript fragmentation and incomplete isoform representation persist, highlighting the need for robust quality control (QC) strategies. This study presents an updated version of ANNEXA, a pipeline designed to enhance genome annotation using LR-RNAseq data while also providing QC for reconstructed genes and transcripts. ANNEXA integrates two transcriptome reconstruction tools, StringTie2 and Bambu, applying stringent filtering criteria to improve annotation accuracy. It also incorporates deep learning models to evaluate transcription start sites (TSSs) and employs the tool FEELnc for the systematic annotation of long non-coding RNAs (lncRNAs). Additionally, the pipeline offers intuitive visualizations for comparative analyses of coding and non-coding repertoires. Benchmarking against multiple reference annotations revealed distinct patterns of sensitivity and precision for both known and novel genes and transcripts and mRNAs and lncRNAs. To demonstrate its utility, ANNEXA was applied in a comparative oncology study involving LR-RNAseq of two human and eight canine cancer cell lines. The pipeline successfully identified novel genes and transcripts across species, expanding the catalog of protein-coding and lncRNA annotations in both species. Implemented in Nextflow for scalability and reproducibility, ANNEXA is available as an open-source tool: https://github.com/IGDRion/ANNEXA. |
09:30 | Exhaustive Identification of Pleiotropic Loci for Serum Leptin Levels in the NHGRI-EBI Genome-Wide Association Catalog PRESENTER: Anthony Haidamous ABSTRACT. Leptin is an adipokine that regulates energy expenditure and calory intake by acting on the hypothalamic leptin-melanocortin pathway. It has known effects on obesity, inflammation, and neurodevelopment, hence carrying high pleiotropic potential. Using the Genome-Wide Association Study (GWAS) Catalog, we identified the single nucleotide polymorphisms (SNPs) associated with fluctuations of leptin in the blood, which we expanded into wider genetic blocks, and explored their associations and effect sizes with leptin levels. We then looked into the genetic pleiotropy linked with these genomic regions. Starting from 35 GWAS studies on leptin levels with a minimum discovery sample size > 1000 individuals, we selected 25 SNPs reaching genome-wide significance (p < 5.10-8). This led to the aggregation of 15 genetic blocks with SNPs in high linkage disequilibrium with the leptin SNPs. The blocks were also associated with 574 pleiotropic phenotypes, which were then grouped into 22 categories, including the enriched other adipokines (n=3, fold-enrichment against background: e=3.5, effects in reverse directions with leptin), obesity-related (n=11, e=15, increasing with leptin), inflammation-related (n=47, e=2), cancer-related (n=12, e=1.7, effects in reverse directions with leptin), body fat (n=33, e=2.3, increasing with leptin), type-2-diabetes-related (n=107, e=1.6), and addiction-related (n=61, e=2.3) cross-traits. The list of genes overlapping with the genetic blocks was used to map a protein-protein interaction network surrounding leptin with which we identified functional modules enriched in ontological terms such as the leptin-melanocortin pathway, embryogenesis, immunity and transcription regulation. This study extends the genetic architecture behind leptin levels to its wider roles into human physiology by deciphering molecular pathways and gene modules implicated in its end effects. |
09:50 | jsPCA enables fast, interpretable and parameter-free domain identification in 3D spatial transcriptomics data PRESENTER: Ines Assali ABSTRACT. Spatial Transcriptomics (ST) uncovers gene expression patterns within the elaborate spatial layout of tissues, a level of detail absent in single-cell transcriptomics analysis, which enhances our comprehension of cell-environment interactions. Accurate spatial information is critical for clustering cell domains and for a better understanding of their functional connections in intricate biological tissues. In this study, we propose a novel approach, joint spatial principal component analysis (jsPCA), to efficiently reveal complex gene expression profiles while preserving the spatial context of tissues in multi-slices or multi-samples ST. Our approach consists in identifying the principal components (PCs) that best maximize the product of spatial autocorrelation (Moran’s Index) and transcriptomic covariance, reflecting both the structure of genetic expression and its spatial distribution. By combining dimensionality reduction and emphasis on spatial correlations, jsPCA refines the ability to detect spatial gene expression patterns and variations, thereby improving the outcome of domain clustering. We take advantage of sparse matrices to improve scalability, which makes it ideally adapted to the analysis of large-scale ST datasets. The interpretability of jsPCA arises from its linear structure, which provides a clear understanding of the impact of each variable on the clustering results, in contrast to current more complex approaches based on Graph Neural Networks. jsPCA handles multi-slice or multi-sample analysis. Spatial domains are obtained by Gaussian mixture clustering in this joint space. We evaluated our approach using the Visium 10x dataset of human dorsolateral prefrontal cortex (DLPFC), featured in numerous benchmarks. Our approach demonstrated robust performance, comparable or better to various state-of-the-art methods, while being fast, interpretable and parameter free. |
10:10 | RITHMS : An advanced stochastic framework for the simulation of transgenerational hologenomic data PRESENTER: Solène Pety ABSTRACT. A holobiont is made up of a host organism together with its microbiota. In the context of animal breeding, the holobiont can be viewed as the single unit upon which selection operates. Therefore, integrating microbiota data into genomic prediction models may be a promising approach to improve predictions of phenotypic and genetic values. Nevertheless, there is a paucity of hologenomic transgenerational data to address this hypothesis, and thus to fill this gap, we propose a new simulation framework. Our approach, an R Implementation of a Transgenerational Hologenomic Model-based Simulator (RITHMS) is an open-source package, builds upon simulated transgenerational genotypes from the MoBPS package and incorporates distinctive characteristics of the microbiota, notably vertical and horizontal transmission as well as modulation due to the environment and host genetics. In addition, RITHMS can account for a variety of selection strategies and is adaptable to different genetic architectures. We simulated transgenerational hologenomic data using RITHMS under a wide variety of scenarios, varying heritability, microbiability, and microbiota heritability. We found that simulated data accurately preserved key characteristics across generations, notably microbial diversity metrics, exhibited the expected behavior in terms and correlation between taxa and of modulation of vertical and horizontal transmission, response to environmental effects and the evolution of phenotypic values depending on selection strategy. Our results support the relevance of our simulation framework and illustrate its possible use for building a selection index balancing genetic gain and microbial diversity. RITHMS is an advanced, flexible tool for generating transgenerational hologenomic data that incorporate the complex interplay between genetics, microbiota and environment. |
09:30 | rnaends: an R package targeted to study the exact RNA ends at the nucleotide resolution PRESENTER: Tomas Caetano ABSTRACT. 5’ and 3’ RNA-end sequencing protocols have unlocked new opportunities to study aspects of RNA metabolism such as synthesis, maturation and degradation, by enabling the quantification of exact ends of RNA molecules in vivo. From RNA-Seq data that have been generated with one of the specialized protocols, it is possible to identify transcription start sites (TSS) and/or endoribonucleolytic cleavage sites, and even co-translational 5’ to 3’ degradation dynamics in some cases. Furthermore, post-transcriptional addition of ribonucleotides at the 3’ end of RNA can be studied at the nucleotide resolution. While different RNA-end sequencing library protocols can vary, and each have their specificities, the generated RNA-Seq data are very similar and share common processing steps. Most importantly, the major aspect of RNA-end sequencing is that only the 5’ or 3’ end mapped location is of interest, contrary to conventional RNA sequencing that considers genomic ranges for gene expression analysis. This translates to a simple representation of the quantitative data as a count matrix of RNA-end location on the reference sequences. This representation seems under-exploited and is, to our knowledge, not available in a generic package focused on the analyses on the exact transcriptome ends. Here, we present the rnaends R package which is dedicated to RNA-end sequencing analysis. It offers features for raw read pre-processing, RNA-ends mapping and quantification, RNA-ends count matrix post-processing, and further count matrix downstream analyses such as TSS identification, fast Fourier transform for signal periodic patterns analysis, or differential proportion of RNA-ends analysis. The use of rnaends is illustrated with applications in RNA metabolism studies through selected workflows on published RNA-end datasets: (i) TSS identification, (ii) ribosome translation speed and co-translational degradation, (iii) post-transcriptional modifications analysis and differential proportion analysis. |
09:50 | Strain-dependency of metabolic pathways within 1,494 genomes of lactic bacteria evidenced with Prolipipe, an in silico screening pipeline PRESENTER: Noé Robert ABSTRACT. Genomes from bacteria of interest to the food industry exhibit significant functional variability, yet evaluating this characteristic remains challenging. As public repositories continue to accumulate more genomes, large-scale assessment of metabolic potential emerges as a promising method to highlight this functional variability. The primary challenge lies in automating a workflow to construct metabolic networks from genomes on a massive scale, with enzyme identification in sequences being a critical bottleneck. Here, we present Prolipipe, a pipeline designed for the large-scale assessment of metabolic potential in bacteria, focusing on specific pathways. Given a large dataset of hundreds to thousands of bacterial genomes with known taxonomy and a list of targeted pathways, Prolipipe identifies gene functions through a comprehensive annotation step using three different tools Then it builds genome-scale metabolic networks for each genome. These networks are then parsed to document the presence or absence of each reaction across all processed genomes and queried for reactions specific to particular pathways. By doing so, the pipeline evaluates the metabolic potential of each genome to carry out the pathway according to its gene content and highlight the best candidates among the large-scale set of genomes. In this study, Prolipipe was applied to 1,494 genomes of lactic acid bacteria, assessing the completion ratio of 761 pathways. We classified pathways according to their maximum completion rate, revealing that 137 pathways can be operated by at least one strain in our dataset. By mapping the identifiers of these pathways onto the pathway ontology graph of the Metacyc database, we highlighted that none of the pathways within four functional classes of Metacyc can be entirely recovered in the strain dataset. We then investigated infraspecific variability, a strong indicator of functional variability, and compared the species in our genome dataset based on their tendency to exhibit infraspecific variability. This analysis revealed species potential for strain-dependency, where phenotypes differ among strains of the same species -a feature observable in Prolipipe outputs. |
10:10 | Benchmarking circRNA Detection Tools from Long-Read Sequencing Using Data-Driven and Flexible Simulation Framework PRESENTER: Anastasia Rusakovich ABSTRACT. Circular RNAs (circRNAs) are unique non-coding RNAs with covalently closed loop structures formed through backsplicing events. Their stability, tissue-specific expression patterns, and potential as disease biomarkers have garnered increasing attention. However, their circular structure and diverse size range pose challenges for conventional sequencing technologies. Long-read Oxford Nanopore (ONT) sequencing offers promising capabilities for capturing entire circRNA molecules without fragmentation, yet the effectiveness of bioinformatic tools for analyzing this data remains understudied. This study presents the first comprehensive benchmark comparison of three specialized tools for circRNA detection from ONT long-read data: CIRI-long [1], IsoCIRC [2], and circNICK-Irs [3]. To address the lack of standardized evaluation frameworks, we developed a novel computational pipeline to generate realistic simulated datasets that integrate several molecular features of circRNAs extracted from established databases (CircAtlas [4] and CircBase[5]) and real nanopore sequencing characteristics. Benchmark FASTQ reads were simulated in silico using NanoSim [6] and Zhang et al. dataset [1] to accurately reflect biological diversity and technical properties. We systematically assessed tool performance across key metrics, including precision, recall, specificity, accuracy, and F1 score. Our analysis revealed distinct performance profiles: while all tools exhibited high specificity, they varied in precision and their ability to detect different circRNA subtypes, often showing limited sensitivity and precision. Notably, the overlap in detected circRNAs among tools was relatively low. Additionally, computational efficiency varied significantly across the tools. This suggests that relying on a single tool might not be ideal, and combining tools or improving algorithms could be necessary for more accurate circRNA detection from ONT data. This benchmark provides valuable insights for researchers selecting appropriate tools for circRNA studies using ONT sequencing. Furthermore, our customizable simulation framework is freely available online, offering a resource to optimize detection approaches and advance bioinformatic tool development for circRNA research. |
Integrating structural variants in genomic studies of rare and complex diseases with long-read sequencing and pangenomes