ICCABS 2026: THE 14TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL ADVANCES IN BIO AND MEDICAL SCIENCES
PROGRAM FOR MONDAY, FEBRUARY 16TH
Days:
previous day
all days

View: session overviewtalk overview

09:00-10:40 Session 7A: CANGS-1
Location: Room C/D
09:00
Enhancement of the Genetic Algorithm using Saltation

ABSTRACT. Since the beginning of the COVID-19 pandemic, an unprecedented growth in the number of SARS-CoV-2 variants has been observed, and their epistatic networks exhibit a saltation-based evolutionary pattern. The occurrence of multiple correlated mutations leads to synergistic epistasis. To exploit the epistasis network formed by Clique-SNV where dense subgraphs in the network correspond to emerging saltation, we incorporate a genetic algorithm with Clique-SNV to simulate saltational evolution. Stasis is equivalent to a certain number of generations of no improvement in the fitness landscape and once this condition is met, Clique-SNV is triggered to induce an evolutionary jump. We validate the proposed approach with the NP-hard Traveling Salesman Problem and demonstrate the performance of enhanced GA and further improve the TSP route with route planarization to remove the intersections.

09:20
Structured Modeling of Allelic-specific Gene Expression

ABSTRACT. Abnormalities in gene expression can serve as important diagnostic signals in cases of undiagnosed genetic disease. Of particular relevance to the search for gene regulatory defects is allelic imbalance in expression. We describe a family of probabilistic graphical models for dissecting allelic signals in both endogenous genes and massively parallel reporter assays. For endogenous genes, we show that detailed modeling of haplotype structure can improve estimation accuracy when short read data are used, and in the context of pedigrees we show that joint modeling of the sharing of haplotypes and expression imbalance between individuals leads to improved identification of inheritance patterns of genetic causes of imbalance, even when causal factors are unobserved. In the case of pooled, multi-sample reporter assays, we show that imposing structure on the sample pools leads to higher estimation accuracy, due to heterogeneity in allele frequencies and an induced Poisson-binomial structure in the data-generating process. We also identify a number of important future directions for these applications.

09:40
A classification pipeline for B-ALL genomic subtypes using Nanopore mRNA-sequencing

ABSTRACT. The clinical classification of B-cell acute lymphoblastic leukemia (B-ALL) genomic subtypes (driven by gross aneuploidies and fusion oncogenes) has significant prognostic implications. Previously, we showed a partial least squares regression-support vector machine (PLSR-SVM) composite model can classify leukemia major lineages and genomic subtypes based on Oxford Nanopore Technologies whole-transcriptome sequencing (ONT-WTS). Additionally, we showed direct fusion calling with high depth detects most B-ALL gene fusions. Compared to conventional methods for B-ALL genomic subtyping, ONT-WTS has the advantages of lower capital and sequencing costs, making it accessible to a broader range of resource settings. Here, we present an updated B-ALL classification pipeline enhanced by more granular target B-ALL genomic subtypes, additional training data, data quality filtering and direct fusion detection.

Previously, the PLSR-SVM model was trained on 1,036 transcriptomes, derived from Illumina short-read mRNA-sequencing, and 134 in-house transcriptomes, derived from ONT-WTS. In addition to the short-read transcriptomes, the current PLSR-SVM model is trained on a larger cohort of 221 in-house and 168 external ONT-WTS transcriptomes. RNA was extracted from patients with acute leukemia from cryopreserved bone marrow and peripheral blood mononuclear cells. Full-length cDNA libraries were prepared and sequenced on the nanopore platform. Gene expression quantification was based on alignments to the reference transcriptome (GRCh38) and assignment of reads to transcripts with highest mapping scores. To improve data quality, confounding transcripts stemming from hemoglobin, mitochondrial and ribosomal RNA were removed. For direct fusion detection, B-ALL fusions were called from putative fusion transcripts resulting from spliced alignment of full-length cDNA reads. Candidate B-ALL fusion transcripts were called based on split-read support after filtering ambiguous alignments. B-ALL gene fusions detected with high confidence take precedence over PLSR-SVM classifications.

Previously, the PLSR-SVM model resulted in a B-ALL genomic subtype accuracy of 94.1%, spanning four distinct subtypes: ETV6-RUNX1, Ph/Ph-like, KMT2Ar, TCF3-PBX1 and two B-ALL subtype groupings: hyperdiploid/near-haploid and “other” (primarily due to low sample representation in these less frequently occurring subtypes). Cross validation of the updated B-ALL classification pipeline resulted in a B-ALL genomic subtype accuracy of 86%, spanning the distinct subtypes above and six additional distinct subtypes: near-haploid, low hypodiploid, hyperdiploid, DUX4r, MEF2Dr, and ZNF384r. Direct fusion detection and data quality filtering corrects genomic subtypes in 11 samples including ETV6-RUNX1, TCF3-PBX1 and MEF2Dr. Remaining discrepancies aneuploid subgroups which account for 80% of the misclassifications. Future work involves developing a ploidy-specific model and further exploration of the few grossly misclassified samples.

10:00
Integrating Sparse Sequence, Experimental, and AI-Predicted Structural Information for Genome-Scale Protein–Nucleic Acid Interaction Prediction

ABSTRACT. Sequence-specific protein-nucleic acid interactions underpin essential processes in gene regulation, yet generalizable computational methods for simultaneously predicting protein recognition sites and binding affinities with DNA and RNA remain limited, largely due to the sparsity of experimental binding data. To address this challenge, our group has developed physics-informed, data-driven models for precise predictions of protein-nucleic acid interactions. By integrating higher-order structural information with sequence features of protein-nucleic acid complexes into an optimized energy model, our model achieves state-of-the-art accuracy in predicting sequence-specific protein-DNA and protein-RNA binding affinities. Leveraging recent advances in AI-predicted complex structures, we further demonstrate the model's effectiveness even in the absence of experimentally resolved training structures. Beyond quantitative affinity prediction, the model identifies favorable interaction motifs for given protein targets, facilitating the rational design of therapeutic nucleic acid aptamers. Importantly, its computational efficiency enables high-throughput, genome-scale binding-site predictions for DNA-binding proteins and can be further enhanced by integrating with sequencing-derived data, including whole-genome bisulfite sequencing and chromatin accessibility data. Together, our work establishes an integrative computational platform that links sequencing-based regulatory data to molecular binding mechanisms, reduces experimental costs for assessing nucleic acid recognition, and enables mechanistic studies of various genetic and epigenetic processes at genomic scales.

10:20
Computational Genomics for Polyploid Organisms: Distinguishing paradigms for post-polyploid evolution

ABSTRACT. Using POInT (the Polyploidy Orthology Inference Tool), we have modeled the post-polyploidy evolution of more than 60 genomes from across the eukaryotic tree of life. Conventional wisdom is that the duplicated genes produced by polyploidy are rapidly lost after these genome doublings. However, we have identified a class of polyploidy where the continued occurrence of four-way recombination events in meiosis after polyploidy maintains the duplicated material and gives rise to complex patterns in the resulting gene trees. Coupled to our new understanding of the biases in losses between the sub genomes contributing to these polyploidy events, these results again illustrate the power of polyploid organisms to probe complex features of genome structure.

09:00-10:40 Session 7B: CAME-II
Location: Room A/B
09:00
Using gene-specific ADAR editing profiles as blood-based biomarkers for neuropsychiatric disorders

ABSTRACT. As part of post-transcriptional RNA editing, Adenosine Deaminases Acting on RNA (ADARs) can modify a genetically encoded adenosine (A) to inosine (I). A-to-I editing targets double-stranded RNA structures, and the modified adenosine is interpreted as guanosine (G) by the cellular machinery. Multiple studies reported widespread A-to-I editing changes in viral infections and neuropsychiatric/neurodegenerative diseases, including Parkinson’s Disease (PD). Specifically, alterations in editing profiles of genes such as 5HT2CR and PDE8A have been linked with suicide and mood disorders. However, limited research, if any, has been done to explore changes in ADAR editing during the progression of neuropsychiatric symptoms. Here, we propose a transcriptome-wide approach to measure gene-specific changes in ADAR editing profiles to serve as blood-based prognostic biomarkers for neuropsychiatric disorders. We compared whole blood transcriptome data from healthy controls and patients with the prodromal (initial non-motor) stage of PD. Pairwise distances were computed across samples to measure the magnitude of difference in A-to-I editing, and gene-specific comparisons were made to identify differentially edited genes based on the magnitude of difference in editing. Our preliminary results show differences in the magnitude of gene-specific editing, including PD risk genes such as GBP2. Gene Ontology analysis for molecular function of differentially edited genes shows enrichment for categories associated with transcriptome regulation. For instance, some of the genes with the highest dissimilarity in editing profiles belong to transcription repressor proteins, e.g., HDAC1, and pro-inflammatory signaling proteins, e.g., ISG15. These results suggest a potential role of ADAR editing in the progression of PD, possibly through its effect on the expression and translation of PD-related and proinflammatory genes. The ongoing analysis focuses on confirming this hypothesis through analyzing the relationship (if any) between ADAR editing and the expression of genes in a protein-protein interaction network.

09:20
Comparing Hamming distance vs LLM Embedding distance metrics in Evolutionary Space

ABSTRACT. TBD

09:40
Interpreting Infectious Disease Surveillance During Armed Conflict: War-Driven Epidemiological Divergence

ABSTRACT. A primary challenge in crisis epidemiology is not only studying the effects of disruptive events on public health, but also distinguishing true changes in infectious disease transmission from artifacts of surveillance system collapse.This paper presents a rigorous, data-driven analysis pertaining to these challenges using Ukraine’s Kharkiv region as a critical case study. We employ a decade of detailed monthly incidence data (2013-2024), including the period of active war, to quantify and interpret a profound divergence between different infection types. Using time series models (e.g., ARIMA, Prophet) for pre-war data to generate expected incidence, we apply parametric $\mathcal{P}$-scores to measure the deviation of observed cases for acute respiratory infections (COVID-19, influenza) and intestinal infections (salmonellosis, rotavirus). The analysis reveals a stark, war-driven epidemiological divergence. First, reported respiratory infections plummeted in areas experiencing active combat. Our models confirm this decline with approximately 90–98\% drop in actual cases relative to model-based expectations. The observed respiratory infections counts were significantly negatively correlated with air raid alert intensity, which was a function of widespread surveillance collapse and underreporting, not a true reduction in transmission. Second, in stark contrast, intestinal infections exhibited a gradual resurgence across most subregions indicating that observed intestinal infection counts exceeded expected levels by more than 100\% in several districts by late 2024. This increase in intestinal infections, attributed to the deterioration of water, sanitation, and hygiene (WASH) infrastructure and overcrowding, was significantly positively correlated with air raid alerts, suggesting reporting persisted due to symptom acuteness. These findings demonstrate the complex, condition-specific impacts of armed conflict. Critically, this study establishes a replicable methodological framework for interpreting unreliable surveillance data, enabling public health organizations to better distinguish surveillance artifacts from genuine epidemiological trends in active conflict zones.

10:00
Trustworthy Multimodal LLMs for Medical Diagnostics via Confidence Calibration

ABSTRACT. TBA

10:20
Machine Learning Approaches for Radiotherapy in Head and Neck Cancer

ABSTRACT. Recent advances in large-scale clinical and imaging data have enabled personalized treatment strategies aimed at improving patient outcomes. In radiotherapy (RT) for head and neck cancer, multimodal imaging data—such as computed tomography (CT), positron emission tomography (PET), and magnetic resonance imaging (MRI)—combined with patient-specific clinical information offer significant potential for treatment optimization. Knowledge-based response-adapted radiotherapy (KBR) seeks to personalize RT dose prescriptions by adapting treatment based on predicted patient response, thereby maximizing therapeutic efficacy while minimizing toxicity.

In this study, we investigate the use of CT imaging data integrated with clinical and dosimetric information to predict the RT prescribed dosage for head and neck cancer patients. We conduct experiments using two publicly available datasets, TCIA-HNSCC and HN1. A range of machine learning (ML) models is employed to extract informative features from both imaging and tabular data and to perform dose prediction via regression. Initial results indicate that regression models trained solely on tabular clinical and dosimetric data exhibit below-average predictive performance, highlighting the limitations of unimodal approaches. These findings motivate the need for multimodal learning frameworks that effectively leverage imaging-derived features alongside clinical data to improve RT dose prediction and support response-adapted treatment planning.

10:40-11:00Coffee Break
11:00-12:40 Session 8A: ICCABS-II
Location: Room A/B
11:00
Exploring the Causal Relationship between the Gut Microbiome and Polycystic Ovarian Syndrome

ABSTRACT. Polycystic ovarian syndrome (PCOS) is an endocrine disorder characterized by irregular menstrual cycles, lack of ovulation, ovarian cysts morphology, and hyperandrogenism. The syndrome affects 10% of the women in the reproductive age group and, if left untreated, could lead to complications like Type II Diabetes, Stroke, and Endometrial cancer. There is presently no cure for PCOS, its etiology remains largely unknown, and treatment options are fairly limited, often coming with side-effects. While previous studies have investigated correlations between gut dysbiosis with PCOS and clinical parameters, exploration of causal relationships is still quite limited, and restricted to taxa abundance. In this study, we analyze publicly available metagenomic data of healthy participants (n=19) and women with PCOS (n=24). We first produce a set of Amplicon Sequence Variants (ASVs), analyze using standard macroscale approaches (diversity, composition, differential), and compare to both the literature and the pilot. We then build causal Bayesian networks, visualizing relationships between microbial abundance, hormone levels, and other clinical factors. Preliminary results confirm relationships from the literature, including increased gut dysbiosis in PCOS patients, and an almost universal lower level of Sex Hormone Binding Globulin (SHBG). Differential and compositional analysis also suggest an increased presence of Erysipelotrichaceae and Faecalibacterium taxa in PCOS samples, and our Bayesian networks suggest these taxa may play roles in the downregulation of SHBG. Additionally, our networks suggest Estrone (elevated in PCOS) may limit the presence of Short Chain Fatty Acid (SCFA)-producing bacteria in the gut, further contributing to dysbiosis. Finally, our networks suggest collective roles played by the Follice Stimulating Hormone (FSH, elevated in PCOS patients) and Calprotectin (a protein released during inflammation) in supporting the presence of Olsenella, previously associated with endodontic infections. Collectively, these results suggest specific areas of the complex underlying web of interactions to target for future biological experimentation when developing PCOS treatments.

11:20
“You Can’t Deny Who You Are”: Cross-Sectional Dimensionality Reduction Reveals Strongest Connection Between Identity and Lower GI Microbial Composition

ABSTRACT. Numerous connections have now been established between the microbiome and implications for human health, particularly in the gut, where the largest number and variety of bacterial cells are present. Dimensionality reduction has historically been used to reduce taxa abundance data from a high- to low-dimensional space, enabling visual analysis of the differentiation of samples into groups. We apply two unsupervised dimensionality reduction techniques (t-SNE and UMAP) to Irritable Bowel Disease gut microbiome samples from the Integrative Human Microbiome Project (iHMP), which include samples from multiple lower GI locations from individuals with Crohn’s Disease (CD) and Ulcerative Colitis (UC), in addition to health controls. Our analysis reveals that, more than age, gender, disease state, or even lower GI tract location – identity is the most revealing factor when differentiating these gut microbiome samples.

11:40
LLM-GeneTxGraph: A LLM-based Knowledge Graph Generation Framework for Gene-gene Network and Downstream Tasks

ABSTRACT. Currently, graph-based methods for analyzing genomic sequencing data leverage gene-gene networks (e.g. STRING and BioGRID) as prior knowledge to build graph structures. However, these databases suffer from significant drawbacks: incompleteness, static nature, and a lack of descriptive information. Recently, large language models (LLMs) have demonstrated powerful ability to understand and reason in the biomedical field without any fine-tuning. Therefore, we propose LLM-GeneTxGraph, a zero-shot LLM-based computational framework designed to dynamically generate descriptive, biologically validated gene-gene interaction networks, and include downstream task module to serve as a backbone model for various downstream tasks. LLM-GeneTxGraph employs retrieval-augmented generation (RAG) techniques to incorporate existing databases and relevant biomedical literature, leveraging an ensemble of three state-of-the-art LLMs alongside an LLM-as-a-judge in order to ensure accuracy and biological significance. The downstream task module utilizes a gated edge-aware graph attention network (GEGAT) model on the generated textual graphs for multiple downstream tasks. Our experiments demonstrate that LLM-GeneTxGraph not only provides more descriptive and information-rich interaction descriptions compared to existing databases but also uncovers new interactions absent from current databases. The downstream task module further confirms the benefit of utilizing such knowledge graphs for genomic analysis.

12:00
POSTER TALK: LV-Vis: A Memory-Efficient Large Volume Visualization System Using a Multilevel Octree

ABSTRACT. The real-time visualization of large-scale 3D volumetric images, often hundreds of gigabytes to terabytes, is challenging on single machines. Limited RAM/VRAM capacity and I/O bandwidth are significant constraints. To address this, we propose LV-Vis, a memory-efficient system for interactive browsing and multi-volume collaborative analysis. Its core is an Octree level of detail (LOD) system for hierarchical storage and loading, combined with ROI-oriented progressive zoom. LV-vis maintains smooth user interactions and navigation while preserving global context by Maximum Intensity Projection and a dualcanvas view. As a result, the system can handle datasets up to 1 TB, using only 3GB of memory with a four-level LOD preprocessing—0.3% of the original size. In scalability tests, the latency for zoom-in operations shows a near-sublinear increase, demonstrating the effectiveness of our cache sharing and on-the-fly sub-volume loading strategies. In conclusion, LV-Vis enables real-time, scalable, consistent, and memory-efficient visualization of terabyte-scale volumetric data on a single machine.

12:10
POSTER TALK: Computerized Tomography Localized Starburst Artifact Reduction using View Angle Shift Bayesian Optimization

ABSTRACT. In contrast to the traditional “step and shoot” sequential computerized tomography (CT) acquisition method, the continuous acquisition method scans the sample without temporarily stopping the rotation. While beneficial to have a shorter scan acquisition time, a novel artifact termed the "localized starburst artifact (LSA)" shows up in the reconstructed images. In this study, we propose and verify a hypothesis positing that the LSA is a consequence of cumulative deviation in view angles during the rotational CT acquisition process. Additionally, due to the view-angle shifts, the traditional rotational axis misalignment (RAM) correction procedure becomes infeasible, leading to more artifacts. In response, an iterative Bayesian optimization-based LSA and RAM reduction method is introduced, which parametrizes the view angle deviation and the rotational axis tilt, iteratively optimizing the reconstruction quality with the acutance metric. Experiments with both the simulated and real-world datasets show the effectiveness of the proposed approach in mitigating the LSA and enhancing the fidelity of the reconstructed images.

12:20
POSTER TALK: Vibration Correction For High-Resolution Synchrotron Computerized Tomography Using SIREN and Gradient Descent

ABSTRACT. While high-resolution synchrotron tomography (CT) achieves nanometer scales, mechanical vibrations are often unavoidable and can lead to significant artifacts in the reconstructed images, compromising their quality and interpretability. In this work, we propose a novel method for CT vibration correction that leverages Sinusoidal Representation Networks (SIREN) and gradient descent optimization. Our approach begins with a naive keypoint tracking method to obtain rough displacement estimates, followed by a refined correction using SIREN to model the sinogram as a continuous function. By optimizing the displacement parameters through gradient descent, we effectively reduce high-frequency artifacts caused by vibrations. Additionally, we incorporate an optional Bayesian optimization step for further refinement in datasets with a large number of projection angles. Experimental results on both synthetic and real-world datasets demonstrate that the proposed method significantly improves reconstruction quality and successfully corrects vibrations in challenging scenarios with sparse keypoints, where the previous feature-based approach failed.

12:30
POSTER TALK: From Activity to Potency: An Integrated Machine Learning Framework for HIV-1 Integrase Inhibitor Discovery

ABSTRACT. Introduction The HIV-1 integrase enzyme is a known antiviral target for which experimental assessment is costly. Most analyses carried out using machine learning for this target have been based on binary activity. In this study, models based on machine learning will be developed capable of predicting the activity type and potency for a compound acting as an HIV-1 integrase inhibitor. Methods IC50 values for the HIV-1 integrase target protein (target CHEMBL3471) were extracted from the ChEMBL database, yielding a pool of 3186 unique compounds that passed data quality filtering. A variety of fingerprints, including physicochemical, functional group, charge, and Morgan fingerprints, were generated in 2D for each candidate using the RDKit library. Candidates with a pIC50 higher than five were considered active. GridSearchCV was used to optimize each model, splitting the available data into training and test sets at 80% and 20%, respectively, and evaluating each model's performance on the test set. Results The best-performing classifier was the RandomForest classifier (accuracy=0.839, F1=0.886, ROC-AUC=0.911, average precision=0.961). The best-performing regression model was XGBoost (R²=0.779, RMSE=0.660, MAE=0.484, Pearson=0.884). Conclusion The integrated RF-XGBoost approach has yielded strong and interpretable models for virtual screening and potency ranking of HIV-1 integrase inhibitors.

11:00-12:40 Session 8B: ICCABS-III
Location: Room C/D
11:00
XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence

ABSTRACT. Deep learning has significantly advanced automated brain tumor diagnosis, yet clinical adoption remains limited by interpretability and computational constraints. Conventional models often act as opaque “black boxes” and fail to quantify the complex, irregular tumor boundaries that characterize malig- nant growth. To address these challenges, we present XMorph, an explain- able and computationally efficient framework for fine-grained classification of three prominent brain tumor types: glioma, meningioma, and pituitary tu- mors. We propose an Information-Weighted Boundary Normalization (IWBN) mechanism that emphasizes diagnostically relevant boundary regions alongside nonlinear chaotic and clinically validated features, enabling a richer morpho- logical representation of tumor growth. A dual-channel XAI module combines GradCAM++ visual cues with LLM-generated textual rationales, translating model reasoning into clinically interpretable insights. The proposed framework achieves a classification accuracy of 96.0%, demonstrating that explainability and high performance can co-exist in AI-based medical imaging systems.

11:20
Using Cluster Quality Metrics for Record Linkage Evaluation

ABSTRACT. Accurate evaluation of record linkage and entity resolution systems typically requires labeled ground truth data to compute pairwise precision, recall, and F1 scores. However, such labeled data are rarely available in large-scale or privacy-sensitive linkage applications. In this paper, we propose an unsupervised framework for evaluating record linkage results by leveraging pairwise linkage uncertainty to derive cluster-level quality indicators. We first formalize the notion of pairwise linkage uncertainty as a measure of whether two records truly correspond to the same entity. This pairwise uncertainty is then propagated through the linkage graph to quantify the internal cohesion and separation of the resulting clusters. Using this uncertainty-weighted representation, we adapt classical cluster validity metrics—such as the Davies–Bouldin Index, Dunn Index, and Silhouette coefficient to the record linkage context. Our experimental evaluation on synthetic and real-world datasets demonstrates that these adapted cluster quality metrics correlate strongly with supervised evaluation measures when ground truth is available, and remain informative when it is not. The results suggest that internal clustering indices, when derived from linkage uncertainty probability, can provide a practical and interpretable means of assessing record linkage performance in unsupervised settings. Such uncertainty metrics could then be used in linking records when we don't have access to any ground truth data.

11:40
Leveraging Vision Transformers and Traditional Classifiers for Accurate Gastrointestinal Disease Detection

ABSTRACT. Accurate analysis of gastrointestinal endoscopic images is critical for early disease detection, yet manual interpretation remains subjective and resource intensive. This work presents a comprehensive computational framework for gastrointestinal disease classification that integrates deep feature extraction with traditional machine learning clas- sifiers. Convolutional neural networks and Vision Transformer models are employed as feature extractors and combined with Support Vec- tor Machines, Random Forest, and XGBoost classifiers. Experiments are conducted on the publicly available Kvasir dataset comprising eight gastrointestinal classes. The results demonstrate that transformer-based representations consistently outperform convolutional features across all classifiers, with the Vision Transformer and Support Vector Machine pipeline achieving the highest accuracy of 91.92 percent. In addition, we evaluate multiple deep learning architectures in terms of performance, computational complexity, and inference efficiency, where EfficientNet- B0 provides the best balance between accuracy and efficiency. To address limited annotation scenarios common in clinical practice, few-shot learn- ing experiments are performed using transformer-based models, showing consistent performance gains as labeled samples increase. Model inter- pretability is further analyzed using Grad-CAM visualizations, revealing that correct predictions focus on clinically relevant regions, while mis- classifications are associated with diffuse or irrelevant activations. Over- all, the proposed framework demonstrates an accurate, efficient, and in- terpretable solution for automated gastrointestinal disease classification with strong potential for clinical decision support.

12:00
Selene: Faster Detection of Microsatellite Instability in Cancer Transcriptomics Using k-mer Filtering

ABSTRACT. Endometrial and colorectal cancers are frequently caused by high microsatellite instability (MSI-H). MSI-H samples contain an excess amount of insertions and deletions (indels) in mononucleotide repeats, regions of the genome where the same base occurs repeatedly. The detection of MSI-H in DNA sequencing is mainstream, and existing bioinformatic approaches require mapping the sequencing reads to the human reference genome, a computationally intensive task that takes several hours. MSI-H can also be detected in RNA-sequencing, which motivates us to design a faster method to detect MSI-H by processing small sequences of length k, commonly known as k-mers. We designed and implemented Selene, a tool to detect MSI-H cases in transcriptomic data by identifying target k-mers in the human reference genome originating from one base pair indels in mononucleotide repeats. Using KMC3, a k-mer processing tool, Selene finds a small set of k-mers that intersect the target k-mers. These k-mers are then mapped to the reference genome to find indels, which are then used by a logistic regression model to predict MSI-H. Selene uses an evenly split training and testing dataset of 69 paired tumor/normal transcriptomes with a median number of 50 million reads, which were then filtered down to a median of 1,486 k-mers per case. Selene has an accuracy of 91%, specificity of 86%, sensitivity of 93%, and an AUC score of 89% on the testing cohort. Selene compares favorably against state-of-the-art MSIsensor-pro, which has an accuracy of = 87%, sensitivity = 87%, and specificity = 86%. Selene's median processing time is less than 2 minutes. Approaches like MSIsensor-pro, which require mapping, took a median of 25 minutes by using the RNA-sequencing aligner, STAR.

12:20
ImmuneBoost: A Gradient Boosting Machine Learning Model for Neoepitopes Classification

ABSTRACT. Neoepitopes resulting from cancer specific variants have shown a lot of promise for personalized cancer immunotherapy. The main challenge in using neoepitopes for cancer immunotherapy is determining whether a neoepitope will result in a tumor rejection response. Using the binding affinity of the neoepitope to the MHC molecule or the difference between its binding affinity and that of its wild type counterpart used to be the most common criteria for selecting neoepitopes used in therapy. However, more recent research efforts use multiple features of the neoepitope to classify it as immunogenic or not. In this work we present and test two machine learning models for neoepitope classification, a gradient boosting model and an artificial neural network model. We compare the two models we present and conclude that the gradient boosting model, ImmuneBoost, preforms better than the artificial neural network model. We then compare ImmuneBoost with three existing classification methods, two of which uses numeric features and the third uses sequence features. We show that ImmuneBoost has higher precision and specificity but lower recall, compared to the other two models. We also analyze the feature importance for the two models we built.

12:40-14:00Lunch Break (lunch provided)
15:10-15:30Coffee Break
15:30-17:10 Session 10A: CAMeRA-II
Location: Room C/D
15:30
Critical Evaluation of Long Read Taxonomic Profiling of the Gut Microbiome

ABSTRACT. The gut microbiome has been implicated in a wide range of human health conditions, from inflammatory bowel disease to neuropsychiatric disorders. With the arrival of long-read sequencing technologies for microbiome characterization, we now have access to high-fidelity assessments of microbial compositions. However, the relative impacts of reference database composition and taxonomic profiling accuracy require further elucidation. To address this need, we performed exhaustive evaluation of six taxonomic classification and profiling methods on PacBio HiFi and Oxford Nanopore long-read sequencing datasets that spanned simulated, mock, and clinical samples. To disentangle algorithmic effects from reference-driven bias, each method was evaluated using both its default database and a unified RefSeq-based reference (RefSeq v228).

We report three key findings. (1) profiling bias arises from the interaction between database composition and algorithmic design, with different algorithmic approaches exhibiting distinct susceptibilities to reference database compositions; (2) Long-read-optimized k-mer and sketch-based methodologies provide high accuracy and computational efficiency across many benchmarking scenarios; (3) Clinical gut microbiome samples exhibited reduced taxonomic profiling concordance when compared to simulated or mock datasets, underscoring the limitations of simulated and mock benchmarking datasets, and the necessity for standardized clinical microbiome analysis.

15:50
Whole Metagenome Sequencing: not Deep Enough for Complete Microbial Function Recovery

ABSTRACT. Background: Whole metagenome shotgun sequencing (WMS) is widely used to profile microbial function. However, technical variability in sequencing and analysis often obscures true biological patterns. Large-scale studies are particularly susceptible to batch effects, such as differences in sequencing depth and platform and annotation strategies. However, the relative effects of these factors on functional inference in such studies have yet to be systematically evaluated. We analyzed oral-rinse WMS data from a study cohort including 671 Nigerian youths aged 9-18, sequenced on two Illumina platforms. Microbial molecular functionality encoded in these data were annotated using the mi-faser/Fusion pipeline, to capture the broad functional repertoire, and HUMAnN 3/EC numbers pipeline to characterize curated enzymatic activities. We then quantified how technical factors shaped the recovery of microbial functionality. Results: Three findings of our work were most salient. First, we observed that the choice of annotation strategy traded off between breadth and specificity of functional coverage. Second, we found that low-prevalence functions were disproportionately lost at shallow sequencing depths, indicating that in e.g. case-control studies with few representatives of the minor class, sequencing depth could critically impact study resolution. Finally, using our newly developed model relating sequencing depth to functional recovery, we demonstrated that increasing sequencing depth does not directly or proportionally improve functional recall. That is, at as little as 10% of this study’s sequencing depth, 30% of the estimated complete microbiome functional repertoire was detectable. However, even at the full depth used in this study, we were only able to recover an estimated 60% of that complete functional repertoire. Conclusions: Together, these findings and our depth-to-function mapping framework provide practical guidelines for the design and interpretation of WMS studies. Coordinating sequencing depth planning with annotation strategy, experimental design, and rigorous batch control is thus essential for robust detection of microbial functions and for ensuring reproducible microbiome insights.

16:10
Efficient Metagenomic Analysis via Quantization​

ABSTRACT. TBA

16:30
Reconstruction of Cancer Evolutionary History From Single Cell Data

ABSTRACT. There are duplication events, and we are trying to reconstruct the evolutionary history from these events.

16:50
Pushing the Limits of Long-Read, Low-Biomass Metagenomics

ABSTRACT. There are many reasons to chose long-read shotgun sequencing for microbiome samples: improved taxonomic assignment, high-quality MAGs, direct quantification of pathways, detection of antimicrobial resistance genes. However, low-biomass microbiome samples are predominantly sequenced as short-read amplicons. While previous work has explored low-biomass metagenomic sequencing, the samples used were mock communities with minimal dynamic range. Here we build on this work by testing actual microbial communities with high dynamic range. Our results are of particular interest for studying the microbiome of the built environment where many samples of interest are low-biomass. This work includes a cautionary tale, which reminds us of the importance of negative controls and the pernicious hazards of barcode crosstalk.

15:30-17:10 Session 10B: CANGS-II + CAME-III + CASCODA-III
Location: Room A/B
15:30
HexSplit: An Improved Computational Approach for Detecting Partial Gene Transfer

ABSTRACT. Horizontal transfer of genetic material plays a critical role in microbial evolution. Such horizontal transfer events can result in the transfer of multiple complete genes, single genes, or partial gene fragments. However, current approaches to studying microbial evolution and horizontal transfer using phylogenetic methods often overlook partial gene transfer (PGT), leading to potential biases and errors in subsequent inferences regarding gene family evolution.

In this work, we address the problem of reliably identifying whether a given gene family has been affected by sufficient PGT to impact gene tree reconstruction. Our work builds upon an existing method, called trippd, which uses sequence alignment tripartitions to detect the presence of phylogenetically disruptive PGT within gene families. Our improved method, HexSplit, extends the trippd algorithm to identify additional PGT cases that may not be detected by trippd. Our analysis using simulated data, accounting for various evolutionary parameters, reveals that the proposed extension improves the sensitivity of the algorithm (seen through decreases in the false negative rate of PGT detection) without significantly increasing false positive detection. We also apply HexSplit to two real biological datasets and demonstrate how it can be used in biological analyses. HexSplit is open-source and freely available from https://github.com/shreyaaseshadri/hexsplit.

15:50
On campus dormitories as viral transmission sinks: Phylodynamic insights into student housing networks during the COVID-19 pandemic

ABSTRACT. University student housing environments are often viewed as hotspots for infectious disease transmission due to their high-density living conditions and high frequency of interpersonal interactions. During the COVID-19 pandemic, concerns arose that on campus dormitories could serve as amplifiers of viral spread, seeding outbreaks into surrounding off campus student residences. However, whether on campus housing acts as a primary driver of transmission or as a recipient of infections introduced from the broader off campus community remains unresolved. Here, we analyzed 1,431 SARS-CoV-2 genomes collected from students residing on and off campus at the University of North Carolina at Charlotte (UNCC) between September 2020 and May 2022. Sequencing was conducted using an amplicon based whole genome sequencing approach on the Oxford Nanopore PromethION platform. Using Bayesian phylodynamic and ancestral state reconstruction approaches, we traced viral transmission pathways to determine the directionality of spread between residential settings. Our results indicate that transmission from off campus housing consistently seeded on campus dormitory outbreaks. In contrast, viral movement from on campus to off campus housing was minimal. These patterns persisted across all major pandemic waves, regardless of shifting mitigation strategies, and suggest that on campus residences acted as transmission sinks rather than sources of broader student outbreaks. These findings raise the possibility that on campus residences may be more vulnerable than often considered, functioning as epidemiological ‘islands’ that primarily receive infections from off campus sources.

16:10
DANCE: Deep Learning-Assisted Analysis of ProteiN Sequences Using Chaos Enhanced Kaleidoscopic Images

ABSTRACT. Cancer is a complex disease characterized by uncontrolled cell growth and requires an accurate classification for effective treatment. T cell receptors (TCRs), crucial proteins in the immune system, play a pivotal role in antigen recognition. Advancements in sequencing technologies have facilitated the comprehensive profiling of TCR repertoires, uncovering TCRs with potent anticancer activity and enabling TCR-based immunotherapies. Performing an effective analysis of these complex biomolecules requires representations that accurately capture both their structural and functional characteristics. T cell protein sequences pose unique challenges because of their relatively shorter lengths compared to other biomolecules. Traditional vector-based embedding methods may encounter issues such as information loss. Therefore, an image-based representation approach becomes a preferred choice for efficient embedding, allowing the preservation of essential details and enabling a comprehensive analysis of T cell protein sequences. We propose generating images from protein sequences using the concept of Chaos Game Representation (CGR). We design images using the Kaleidoscopic images approach. This \textbf{D}eep Learning-Assisted \textbf{A}nalysis of Protei\textbf{N} Sequences Using \textbf{C}haos \textbf{E}nhanced Kaleidoscopic Images (called DANCE) provides a unique way to visualize protein sequences by recursively applying chaos game rules around a central seed point. The resulting kaleidoscopic images exhibit symmetrical patterns that offer a visual representation of the protein sequences. To investigate the effectiveness of this approach, we perform classification of the T cell receptors (TCR) protein sequences in terms of their respective target cancer cells, since TCRs are known for their immune response against cancer disease. The DANCE technique is used to turn the TCR sequences into pictures before classification. We employ deep learning vision models to classify the generated images to obtain insight into the relationship between the visual patterns in the generated kaleidoscopic images and the underlying protein properties. By combining CGR-based image generation with deep learning classification, this study opens new possibilities in protein analysis.

16:30
AtlasCollect: A Single Cell Data Atlas Tool and Unified Platform for Datasets Collection and Integration

ABSTRACT. Single cell sequencing techniques are rapidly developing as one of the most powerful tools for high-throughput next generation sequencing that facilitate the analysis of cell types, states and dynamics. Both the volume and quality of single cell datasets have dramatically increased over the past decade, and with it, the rise of so-called single cell atlases. Typically a single cell atlas is a manually curated set of datasets released by a research group implementing their results and findings with pre-calculated analysis as a black box. So far, there are no atlas development tools that allow for interactive and dynamic aggregation of single cell datasets that comprehensively collect and curate single cell sequencing datasets while simultaneously allowing for interactive datasets integration. In this work we present a Single Cell Data Atlas Tool, AtlasCollect, an innovative web-based interactive plat- form designed to streamline the collection, mapping, and live integration of single-cell RNA sequencing (scRNA- seq) data. Developed using the Next.js framework as well as R custom code and packages, our tool provides a user-friendly interface that simplifies complex data management and analysis processes. It efficiently man- ages datasets using SQLite and file systems, automates exploratory analysis workflows for data validation and visualization, and offers a centralized hub for various integration algorithms. The proposed platform features a comprehensive homepage displaying existing datasets with their metadata, a robust upload system that initiates automated processing workflows, and detailed individual dataset view pages for in-depth exploration. The tool generates essential latent space projections such as PCA, UMAP, and t-SNE plots, and facilitates comparative gene expression view across datasets. A key strength of our platform lies in its integration of multiple established dataset merging methods, including CCA [1], Harmony [2], Joint PCA [1], and RPCA [3]. By consolidating these popular integration algorithms in one accessible interface, we enable researchers to easily apply and compare different approaches for creating unified analytical spaces from multiple datasets. Furthermore, this platform facilitates seamless integration with downstream comprehensive analysis tools and frameworks such as Seurat [4] and SC1 [5]. To conclude, the AtlasCollect tool aims to empower researchers with varying levels of computational expertise to gain immediate insights into cellular heterogeneity and gene expression dynamics of dataset collectives. By providing a unified platform that leverages existing, cutting-edge integration methods, we strive to overcome challenges in heterogeneous data integration and hence support discoveries in fields such as immunology, cancer research, and neuroscience, where scRNA-Seq data is a staple research tool, making complex single-cell analysis more accessible and efficient for the scientific community.

16:50
Reconstructing intra-tumor fitness landscapes from scSeq CNA profiles via simulation-based Bayesian inference and Deep Learning

ABSTRACT. N/A