ICCABS 2025: 13TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL ADVANCES IN BIO AND MEDICAL SCIENCES
PROGRAM FOR SUNDAY, JANUARY 12TH
Days:
next day
all days

View: session overviewtalk overview

09:30-10:30 Session 2: Keynote Talk (SCE Auditorium)

Yury Khudyakov Centers for Disease Control and Prevention (CDC)

Conceptual challenges of viral molecular epidemiology

10:50-12:30 Session 3A: ICCABS 1 (SCE Auditorium)
10:50
Enhancing Protein Side Chain Packing Using Rotamer Clustering and Machine Learning

ABSTRACT. One of the challenges and a significant part of a protein structure’s prediction in three-dimensional space is a side chain prediction/packing. This area of research has a large importance, due to its various applications in protein design. In recent years, many methodologies and techniques have been crafted for side chain prediction such as DLPacker, FASPR, SCWRL4 and OPUS-Rota4. In this research, we address the problem from a different perspective. We employed a machine learning model to predict the side chain packing of protein molecules given only the Cα trace. We analyzed 32,000 protein molecules to extract important geometrical features that can distinguish between different orientations of side chain rotamers. We designed and implemented a Random Forest model to tackle this problem. Given the accuracy of existing state-of-the-art approaches, our model represents an improvement from among other models. The results of our experiment show that Random Forest is highly effective, achieving a total average accuracy of 73.7% for proteins and 73.3% for individual amino acids.

11:10
Can Language Models Reason about ICD Codes to Guide the Generation of Clinical Notes?
PRESENTER: Ivan Makohon

ABSTRACT. In the past decade a surge in the amount of electronic health record (EHR) data in the United States, attributed to a favorable policy environment created by the Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009 and the 21st Century Cures Act of 2016. Clinical notes for patients’ assessments, diagnoses, and treatments are captured in these EHRs in free-form text by physicians, who spend a considerable amount of time entering them. Manually writing clinical notes may take considerable amount of time, increasing the patient’s waiting time and could possibly delay diagnoses. Large language models (LLMs), such as GPT-3 possess the ability to generate news articles that closely resemble human-written ones. We investigate the usage of Chain-of-Thought (CoT) prompt engineering to improve the LLM’s response in clinical note generation. In our prompts, we incorporate International Classification of Diseases (ICD) codes and basic patient information along with similar clinical case examples to investigate how LLMs can effectively formulate clinical notes. We tested our CoT prompt technique on six clinical cases from the CodiEsp test dataset using GPT-4 as our LLM and our results show that it outperformed the standard zero-shot prompt.

11:30
Link Prediction in Disease-Disease Interactions Network Using a Hybrid Deep Learning Model
PRESENTER: Ashwag Altayyar

ABSTRACT. Discovering disease-disease association based on the underlying biological mechanisms is an essential biomedical task in modern biology as understanding these relationships will assist biologists in discovering the pathogenesis, diagnosis, and intervention of human diseases. Recently, deep learning on graph and graph neural networks have achieved promising performance in modeling complex biological structures and learning compact representations of interconnected data. Inspired by the success of graph neural networks in learning subgraph representations, we propose a novel framework, SNN-VGA, designed to predict potential disease comorbid pairs. We first model disease-associated genes as subgraphs in the protein-protein interactions network and learn disentangled disease module representations using a subgraph neural network model. The learned embeddings are leveraged by the variational graph auto-encoder to predict disease comorbidity in the disease-disease interactions network. Empirical results from a benchmark dataset demonstrate that our method performs competitively compared with the state-of-the-art model, with an AUROC of 0.96.

11:50
Explaining Protein Folding Networks Using Integrated Gradients and Attention Mechanisms

ABSTRACT. Protein folding prediction models like AlphaFold and ColabFold have revolutionized structural biology by providing accurate protein structures. However, these models are often criticized for their lack of interpretability. In this paper, we propose the application of Explainable AI (XAI) techniques, specifically Integrated Gradients and Attention Mechanisms, to elucidate the decision-making process of these complex networks. We conduct computational experiments to evaluate the effectiveness of these methods and discuss potential implications for the field.

12:10
Unsupervised Learning for Tertiary Structure Prediction of Protein Molecules: Systematic Review

ABSTRACT. Tertiary structures of molecules represent high-dimensional data containing spatial information of hundreds (even thousands) of atoms. Unsupervised learning techniques can be applied to such spatial data to uncover hidden organizations that can be subjected to further evaluation. Such techniques have already been employed in a number of relevant applications e.g., tracking the conformational changes in a set of structures, detecting biologically active tertiary structures from computed structures of proteins, analyzing molecular dynamics simulation of peptides, and so on. This paper presents a comprehensive review of clustering techniques for tertiary (3D) molecular structure data focusing on protein molecules. In fact, the article systematically organizes as well as analyzes the existing approaches in terms of the data representation, methodology, proximity measure, and evaluation metric. Besides, it highlights key open challenges and proposes future research directions to advance this domain.

10:50-12:30 Session 3B: CAME 1 (SCE Room 203)
10:50
Antigenic cooperation and cross-immunoreactivity networks

ABSTRACT. Antigenic cooperation serves as an alternative to the mechanism of immune escape by continuous genomic diversification and provides insights to various experimental observations associated with the establishment of chronic infections by highly mutable viruses such as Hepatitis-C, HIV, Influenza A, Zika, and Sars-CoV-2. In this mechanism, viral variants arrange themselves into a network of cross-immunoreactivity and take up complementary roles such as altruistic, persistent, and transient. The role of a viral variant in this immune escape mechanism is not an inherent property of the variant, and changes based on the dynamic changes in the quasi-social ecosystem of the virus such as the emergence of a new variant or the merging of two intra-host viral populations through a viral transmission between two infected hosts. We explore the interactions between altruistic variants shielding persistent variants from the host immune system, and discover that each altruistic variant in a cross-immunoreactivity network operated independently from each other. Connections between altruistic variants do not change neither their qualitative roles, nor the quantitative values of the strength of persistent variants that they can shield from the host immune system. While having more altruistic variants certainly increases the number of persistent variants that can be shielded from the host immune system, the number of persistent variants connected to an altruist does not change the level of immune response targeted towards it. Variants strongly compete with each other to become persistent, and altruists have a maximal load for variants that they can shield from the host immune system. We also investigate real datasets of cross-immunoreactivity networks formed by Hepatitis-C viruses in acute and chronic infections and find that these two classes of networks have significantly different in-degree and out-degree distributions and eigenvector centrality.

11:10
The Use of Google Trends Data to Improve COVID-19 Incidence and Mortality Predictions Over Time in the 50 States of the USA and the District of Columbia.

ABSTRACT. The accurate predictions of incidence and mortality trends over time are useful for the immediate allocation of available public health resources as well as for studying the course of the pandemic in the long run. The surveillance data, however, may come with reporting delays, so the auxiliary data sources that are available immediately can provide valuable additional information right away for a given time period. In this work, a broad search of Google search queries performed by individuals is analyzed, which includes keywords related to COVID-19 incidence and mortality that may provide insights about future epidemic behavior. The identified search keywords were evaluated for their associations and predictive abilities using Granger Tests and cross-correlation coefficients against reported incidence and mortality time series. For the second step, ARIMA, Prophet, and XGBoost models were used to generate predictions using either only reported incidence and mortality (baseline model) or together with identified search keywords based on their predictive abilities (predictors model). The prediction results across models were evaluated using benchmark statistics. In summary, the impact of adding keywords identified as the most appropriate according to Granger Tests and cross-correlation coefficients significantly enhanced prediction accuracy across all three models and can be recommended for use. The results with other keywords that were not the most promising according to Granger Tests and cross-correlation coefficients were considered, but they were less pronounced. The predictors and the corresponding improvements were more pronounced for incidence and related search keywords, while the results were less accurate for mortality trends and the corresponding predictions.

11:30
The evolution of cancer progression risk: a phylogenetic and machine learning analysis

ABSTRACT. Understanding the evolution of cancer in early stages is critical to identifying the key drivers of cancer progression and developing better early diagnostics or prophylactic treatments. Early cancer is difficult to observe, though, since it is generally asymptomatic until extensive genetic damage has been accumulated. In this study, we use computational methods to infer how once-healthy cells enter into and become committed to a pathway of aggressive cancer through a strategy of using tumor phylogeny methods to look backwards in time to earlier stages of tumor development and machine learning to infer how progression risk changes over those stages. We apply this paradigm to point mutation data from a set of cohorts from the Cancer Genome Atlas (TCGA) to formulate models of how progression risk evolves from the earliest stages of tumor growth and how this portrait varies within and between cohorts. The results suggest general mechanisms by which risk develops as well as variability among them. They suggest limits to the potential for early diagnosis while also providing reason for hope in extending it beyond current practice.

13:30-14:30 Session 4: Keynote Talk (SCE Auditorium)

Srinivas Aluru School of Computational Science and Engineering, Georgia Institute of Technology 

Genome graphs: Algorithms and applications

14:40-16:00 Session 5A: ICCABS 2 (SCE Auditorium)
14:40
Resistance genes are distinct in protein-protein interaction networks according to drug class and gene mobility

ABSTRACT. With growing calls for increased surveillance of antibiotic resistance as an escalating global health threat, improved bioinformatic tools are needed to track antibiotic resistance genes (ARGs) across One Health domains. Most studies to date profile ARGs using sequence homology, but such approaches provide limited information about the broader context or function of the ARG in bacterial genomes. Here we introduce a new pipeline, PPI-ARG-finder, for identifying ARGs in genomic data that employs machine learning analysis of Protein-Protein Interaction Networks (PPINs) as a means to improve predictions of ARGs while also providing vital information about the genetic context, such as gene mobility. A random forest model was trained to effectively differentiate between ARGs and nonARGs and was validated using the PPINs of ESKAPE pathogens (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, and Enterobacter cloacae), which represent urgent threats to human health because they tend to be multi-antibiotic resistant. The pipeline exhibited robustness in discriminating ARGs from nonARGs, achieving an average area under the precision-recall curve of 88%. We further identified that the neighbors of ARGs, i.e., genes connected to ARGs by only one edge, were disproportionately associated with mobile genetic elements, which is consistent with the understanding that ARGs tend to be more mobile compared to randomly sampled genes in the PPINs. This pipeline showcases the utility of PPINs in discerning distinctive characteristics of ARGs within a broader genomic context and in differentiating ARGs from nonARGs through network-based attributes and interaction patterns.

15:00
MetaEdit: Computational Identification of RNA editing in Microbiomes

ABSTRACT. Abstract

Motivation: RNA editing is a pivotal post-transcriptional mechanism that plays a critical role in the regulation of some genes by altering their mRNA sequences, thereby influencing the resulting protein sequence, structure, and the functional and cellular responses. While extensively studied in eukaryotes, its significance and prevalence in prokaryotic microbiomes remain underexplored. Given the crucial role of microbiomes in various biological processes and their potential impact on human health and disease, understanding RNA editing within these communities could reveal new insights into microbial gene regulation and adaptation. The lack of studies to detect RNA editing in microbiomes motivates the need for developing bioinformatic strategies to bridge this research gap. Results: This study introduces MetaEdit, a computational tool designed to detect RNA editing in bacterial microbiomes. We apply MetaEdit to metatranscriptomic and metagenomic datasets to identify and characterize RNA editing events in the human gut microbiome. Our results demonstrate the presence of RNA editing in Escherichia coli and provide a foundation for future investigations into the functional implications of RNA editing in microbiomes. Our findings are supported by previously reported research but need validation with laboratory experiments. The developed pipeline is generic and can be applied to find RNA editing in any sequencing datasets containing both metagenomic and metatranscriptomic data.

15:20
Improving inter-helical residue contact prediction in α-helical Transmembrane proteins using structural neighborhood crowdedness information
PRESENTER: Aman Sawhney

ABSTRACT. Residue contact maps are a useful compressed representation that can be used as constraints for structural modeling, but can also help identify inter-helical binding sites and are hence effective on their own. In this work, we hypothesize that crowdedness around a target residue pair influences whether it is a contact point. We develop measures of crowdedness in the 3-D neighborhood of a residue : bin counts - defined in terms of relative residue distance, residue contact number measure - adapted for inter-helical TM proteins. Since unsupervised language models are very accurate, but also complementary to our approach, we combine MSA transformer score with our proposed features to assess the impact of crowdedness to residue contact prediction. We find that crowd- edness measures can in fact increase the upper-bound performance by at least 7.65% Average precision in cross validation experiments and by at least 11.59% Average precision in our held out experiments. Further, we develop a method to transfer this information when true crowdedness information is unavailable, our approach outperforms MSA transformer by at least 1.15% Average precision for cross validation experiments and 1.85% in held-out experiments.

15:40
Novel Molecular Markers of the Inferior Colliculus identified via Single Nuclei Transcriptomics

ABSTRACT. The inferior colliculus (IC) of the midbrain region is a major hub in the central auditory system. IC changed activities are implicated in tinnitus, speech processing issues, and deficits in temporal processing. In the past decades, various neuron types in the IC have been identified using approaches such as morphology, histochemistry or electrophysiology. In this work, we aim at developing an un-biased approach that identifies all cell types (whether neuronal or non-neuronal) comprehensively. We sequenced over 70K single nuclei from the left and right ICs of adult male and female CBA/CaJ mice after bilateral noise exposure and from age and sex matched controls to produce comprehensive single cell RNA-Seq profiles. Our computational analysis workflow identifies the cellular composition of both the neuronal and non-neuronal cell populations in the IC. We identify novel differential transcriptomic expressions in several neuronal subtypes. Our analysis establishes the baseline cellular and molecular profiles of the neuronal populations present in the IC of the adult mice, both with normal hearing and after noise exposure.

14:40-16:00 Session 5B: CAME 2 (SCE Room 203)
14:40
Leveraging Multiple Dimensions of Public Data to Characterize the Evolution of a Staphylococcal Plasmid

ABSTRACT. Leveraging Multiple Dimensions of Public Data to Characterize the Evolution of a Staphylococcal Plasmid

15:00
Phylogenetic inference of migration histories of viral populations under evolutionary and structural constraints

ABSTRACT. Phylogenetic inference of migration histories of viral populations under evolutionary and structural constraints

15:20
Leveraging large language models for predicting viral subtypes from sequence data

ABSTRACT. Leveraging large language models for predicting viral subtypes from sequence data

15:40
Mixed HCV Infection: A novel genetic marker for surveillance of high-risk populations

ABSTRACT. Mixed HCV Infection: A novel genetic marker for surveillance of high-risk populations

16:10-17:10 Session 6A: CANGS 1 (SCE Auditorium)
16:10
scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data

ABSTRACT. Single-cell RNA-sequencing (scRNA-seq) has been widely used for disease studies, where sample batches are collected from donors under different conditions including demographic groups, disease stages, and drug treatments. It is worth noting that the differences among sample batches in such a study are a mixture of technical confounders caused by batch effect and biological variations caused by condition effect. However, current batch effect removal methods often eliminate both technical batch effect and meaningful condition effect, while perturbation prediction methods solely focus on condition effect, resulting in inaccurate gene expression predictions due to unaccounted batch effect.

Here we introduce scDisInFact, a deep learning framework that models both batch effect and condition effect in scRNA-seq data. scDisInFact learns latent factors that disentangle condition effect from batch effect, enabling it to simultaneously perform three tasks: batch effect removal, condition-associated key gene detection, and perturbation prediction. We evaluate scDisInFact on both simulated and real datasets, and compare its performance with baseline methods for each task. Our results demonstrate that scDisInFact outperforms existing methods that focus on individual tasks, providing a more comprehensive and accurate approach for integrating and predicting multi-batch multi-condition single-cell RNA-sequencing data.

16:30
A Multi-Model Ensemble Learning Framework with Uncertainty Quantification for Enhanced Poliovirus Surveillance (M-SURE)

ABSTRACT. Sequence data deposited in the Sequence Read Archive from NCBI may provide valuable information about current or emerging infectious diseases. Identification of poliovirus sequences in sequence data requires accurate and reliable computational methods to classify viral sequences. Aligning all public sequence data to the poliovirus genome and then identifying poliovirus sequences requires a large amount of human intervention and identification. This can lead to incorrect classifications, particularly in ambiguous examples as opposed to direct ones, and often lacks clear measures of confidence in the results (uncertainty quantification). To address this challenge, we introduce M-SURE (Machine Learning Surveillance and Uncertainty Reporting Engine) a multiple machine learning model framework that provides detailed uncertainty metrics for robust classification of poliovirus sequences found in the SRA database (NCBI). Our framework processes sequence data through several metrics, including alignment percent identity, alignment lengths, and log-transformed features, which are fed into an ensemble of models. Specifically, it utilizes PyCaret AutoML for model selection, Ludwig deep learning with a custom architecture, TensorFlow neural networks with dual dense layers, and Bayesian inference for uncertainty quantification. The models were tested and trained on 50,000 randomly selected alignments between poliovirus types 1, 2, and 3 and either known poliovirus samples or known non-poliovirus samples using BLAST. Known poliovirus samples included 26 poliovirus control samples submitted to the SRA database. Known non-poliovirus samples included non-polio enteroviruses, bacteria, fungi, vertebrates, and all other non-poliovirus sequences in the RefSeq database. The system enforces a 70% confidence threshold for high-confidence predictions and uses a multi-model voting approach for final classification decisions. Samples with low uncertainty (<15%) are clearly classified, while those with medium uncertainty (15–30%) or high uncertainty (>30%) are flagged for expert review. In the test data set, only a small proportion (3.3%) of the analyzed alignments required expert review. Analysis of the framework's performance shows robust classification capabilities, achieving a high model agreement (>70% confidence) for identifying poliovirus and non-poliovirus alignments. The system successfully segregates clear classifications from ambiguous examples, employing a novel alert mechanism that categorizes predictions based on both model agreement and confidence levels. Performance metrics demonstrate high accuracy in clear classifications, with Bayesian models providing comprehensive uncertainty quantification. Out of 50,000 randomly selected alignments, 1,639 samples (3.278%) were flagged as requiring further review. M-SURE is designed to be compatible with existing surveillance systems, scalable to large datasets, and adaptable to different sequence features. Future enhancements will include adding sequence features, expanding to multi-class virus classification by enabling the system to classify sequences into multiple virus types beyond poliovirus, including distinguishing between poliovirus, non-polio enteroviruses, and other viral pathogens, integrating active learning components, and developing explainable AI capabilities.

16:50
HAPLOQ: A simple strategy for correcting sequencing errors on viral amplicons

ABSTRACT. HAPLOQ: A simple strategy for correcting sequencing errors on viral amplicons

16:10-17:10 Session 6B: CASCODA (SCE Room 203)
16:10
Supervised Poisson Factorization for Uncovering Latent Gene Expression Patterns and Predicting Disease Risk

ABSTRACT. Predicting personalized disease risk enables clinicians to implement treatment strategies that may prevent or delay the onset of complex diseases. Accurate disease risk prediction using gene expression data remains challenging due to the reliance of current methods on gene-level features. Identifying latent disease-relevant structures within these data could inform personalized interventions and improve patient outcomes. Existing approaches, however, do not adequately model the complex relationships between genes and patient specific factors, including sociodemographic variables, limiting their predictive power and interpretability. In this work, we develop a Bayesian supervised Poisson factorization model to learn the latent gene expression patterns that enhance disease risk prediction while incorporating environmental risk factors. Our method captures complex, multi-gene dependency patterns, resulting in more accurate predictions. By uncovering structured latent patterns that integrate with patient specific characteristics and sociodemographic factors, our method may provide a deeper understanding of underlying disease mechanisms.

16:30
A Fused Transformer-based Model for Gene Expression Prediction using Histopathology Images

ABSTRACT. Spatial transcriptomics (ST) is a cutting-edge technology that enables the spatial localization and analysis of gene expression within tissue sections. Despite its transformative potential, ST is constrained by high costs and limited spatial resolution, making it less accessible and challenging to implement widely. To overcome these limitations, we propose a fused transformer-based approach designed to predict high-density gene expression profiles directly from Whole Slide Images (WSIs). Our model capitalizes on the multi-scale hierarchical structure of WSIs, integrating several key components: ResNet50 and transformer encoders to extract features at various scales, and a Fused Transformer Block (FTB) that effectively aggregates these features. The FTB incorporates Spatial Positional Embeddings (SPE) to maintain spatial context, cross attention to prioritize relevant features across different scales, and Random Mask Attention (RMA) to concentrate the model's attention on the most significant patterns. To validate our approach, we conducted extensive experiments using three public spatial transcriptomics datasets (HBC, HER2+, SCC) and three additional Visium datasets from 10X Genomics. Our results demonstrate that the proposed model not only preserves essential spatial relationships for mapping gene expressions to tissue morphology but also outperforms current state-of-the-art methods.

16:50
Cancer progression inference from single-cell data with recurrences and mutation losses

ABSTRACT. The inference of cancer evolutionary histories is a key step for the understanding and treatment of the disease, thus many tools had been developed in the last decade to attack this important problem. However methods for inferring tumor phylogenies need to strike a balance between keeping small running times and employing sophisticated evolution models. Binary characters, such as single-nucleotide variants and known mutations, which is our focus, is an example of a simple model that is able to capture most relevant cases --- but not copy number variants. On binary characters, most methods are designed for simpler models where mutations can only be accumulated under the infinite sites assumption, however those models tend to be too simplistic for real case scenarios.

While the most explored direction is to allow mutation losses, in this paper we introduce an even more general model, where each mutation can be acquired and lost more than once. We describe this model, provide a simulated annealing approach exploiting this novel evolutionary framework and show its accuracy on different sets of experimental evaluations when compared to less general models and potential application to real data.