Program for Tuesday, December 12th

PROGRAM FOR TUESDAY, DECEMBER 12TH

Days:

previous day

next day

all days

View: session overview talk overview

08:00-09:00 Registrations - Coffee & Pastry

Registration

Location: OMU Molly Shi Boren Ballroom

09:00-10:00 Session 6: 3rd Keynote Talk

3^rdKeynote Talk

Chairs:

Marmar Moussa and Sanguthevar Rajasekaran

Location: OMU Molly Shi Boren Ballroom

09:00

Jie Chen

Computational and modeling aspects of analyzing high-throughput genomic data

ABSTRACT. The high-throughput sequencing technology has provided new opportunities in scientific discovery of genetic information. While the sequencing data pre-processing techniques have been very well developed and widely accepted by the research community, modeling and analyzing high-dimensional genomic data still post computational challenges. In this talk, I will present two of our recent projects. The first one is a framework on modelling sequencing data of multiple subjects for genomic feature discovery. Taking into consideration of the correlated structure of high throughput genomic data, we use the framework of a fused Lasso latent feature model to solve the problem, and further propose a modified information criterion for the tuning parameter selection when searching for common features shared by multiple samples. Simulation studies and application on DNA- sequencing data showed that the proposed approach can effectively identify individual genomic features of a single subject profile and common genomic features for multiple subjects. The second one is for jointly analyzing multiple types of genomics data, along with prognostic information, available within and across different studies. It has been a challenging and common task in modern statistical research to use all types of data to infer disease-prone genetic information and to link those features to cancer survival. We modelled the genomic, prognostic and survival datasets under a framework of an accelerated failure time with frailty (AFTF) to infer patients' survival time. Simulation results confirmed the good performance of the approach. The approach was applied to the analysis of the Cancer Genome Atlas (TCGA) multiple genomic datasets of Giloblastoma Multiforme (GBM), a lethal brain cancer, and interesting genomic features are identified, and biological interpretations are explored.

10:00-10:30 Coffee Break

Coffeee Break

Location: OMU Molly Shi Boren Ballroom

10:30-12:30 Session 7A: ICCABS - Sequence Analysis

ICCABS - Sequence Analysis

Chairs:

Pavel Skums and Sanguthevar Rajasekaran

Location: OMU Regents & Associates

10:30	Dottie Yu, Ram Ayyala, Sarah Sadek, Hafsa Farooq, Likhitha Chittampalli, Junghyun Jung, Jieting Hu, Tianze Tao, Ryan Alomair, Sungmin Park, Austin Nguyen, Nicholas Mancuso, Jong Wha Joo, Reid Thompson, Houda Alachkar and Serghei Mangul A rigorous benchmarking of alignment-based HLA typing algorithms for RNA-seq data
10:54	Paola Bonizzoni, Clelia De Felice, Yuri Pirola, Raffaella Rizzi, Rocco Zaccagnino and Rosalba Zizza Identification of Chimeric RNAs: a novel machine learning perspective ABSTRACT. Chimeric RNAs are transcripts generated by gene fusion and intergenic splicing events, thus comprising nucleotide sequences from different genes. Recent studies have shown that some chimeric RNAs can play a role in cancer development, and so can be used as diagnostics biomarkers when specifically expressed in cancerous cells and tissues. Most gene fusion prediction tools rely on an initial alignment step. However, alignments might be biased, especially for chimeric reads, creating many false positives. Therefore, developing alignment-free prediction methods of fusion genes would be helpful and may provide new insights into the genomic breakage phenomenon in the cell. In this direction, machine learning could pave the way for new solutions, due to their success in predicting genomic regulatory elements and alternative junction events from the genomic context. To date, however, these techniques have had a marginal supporting role, and, furthermore, manually-curated data sets, that are crucial for model training, are often expensive, unreliable or simply unavailable. Here we propose a novel ML-based method that learn to recognize the hidden patterns that allow to identify chimeric RNAs deriving from oncogenic gene fusions. Preliminary comparison with another state-ofthe- art method shows promising results.
11:18	Ahmed Soliman and Sanguthevar Rajasekaran A Novel String Map-Based Approach for Distance Calculations with Applications to Faster Record Linkage ABSTRACT. Numerous applications call for pairwise distance calculations among a set of input points. For example, given a set of n points in a Euclidean space, pairwise distance computations form the basis for clustering the n points. As another example, given records from multiple data sources, the record linkage problem is solved with the aid of pairwise distance computations of the records. Often, these distance computations are time intensive. For instance, in record linkage, each record can be thought of as a string of characters. If we use the edit distance to measure the distance between two records, the standard dynamic programming algorithm takes quadratic time. In this paper we present a technique based on mapping that can be used to compute an approximate distance between two points. Based on this computation, we can make a decision as to whether it is necessary to compute the actual distance between the two points or not. Since the approximate distance computation is very fast, the overall run time of the algorithm could reduce significantly. This technique is generic and can be used in solving a variety of problems. In this paper we demonstrate the applicability of this approach in the context of solving the record linkage problem.
11:42	Sarwan Ali, Haris Mansoor, Prakash Chourasia, Imdadullah Khan and Murray Patterson Preserving Hidden Hierarchical Structures: Poincaré Distance for Enhanced Genomic Sequence Analysis
12:06	Arnab Ganguly, Daniel Gibney and Sharma V. Thankachan On the Hardness of Wildcard Pattern Matching on de Bruijn Graphs ABSTRACT. In the pattern matching on labeled graphs problem, given an edge labeled graph $G = (V, E)$ and a string $P$, one seeks to identify if there exists a walk in the graph whose concatenation of edge labels (approximately) matches $P$. This is an elementary subproblem for utilizing genome graphs to represent collections of genetic sequences where patterns arise as reads in the sequencing data. Unfortunately, for general graphs, it is known that an algorithm running in $O(\|E\|\|P\|^{1-\varepsilon} + \|E\|^{1-\varepsilon}\|P\|)$ time for constant $\varepsilon > 0$ is not possible under the Strong Exponential Time Hypothesis (SETH). de Bruijn graphs provide a valuable exception, allowing for a path exactly matching a pattern to be found in $O(\|E\| + \|P\|)$ for constant-sized alphabets. This property has led de Bruijn graphs to be applied as indexes in the popular tool vg-toolkit. In this work, we consider the case where wildcards (that match with any edge label) are included in the pattern, and the graph is a de Bruijn graph. We demonstrate that adding these wildcards to the pattern is enough to again prove quadratic lower bounds conditioned on SETH for pattern matching on de Bruijn graphs, even when restricted to alphabets of size at most three and $k$-mer length $\Theta(\log \|V\|)$.

10:30-12:30 Session 7B: ASI Workshop - I

ASI Workshop - I

Chairs:

Marmar Moussa and Ion Mandoiu

Location: OMU Scholars

10:30	Wei Chen Single-cell RNA sequencing reveals the atlas of a breast tumor microenvironment and the remodeling by photo-immunotherapy
10:54	Alexandra "Sasha" Gavryushkina, Holly Pinkney, Sarah Diermeier and Alex Gavryushkin Single cell RNA phylogenetics of cancer
11:18	Serghei Mangul Adaptive receptor repertoire profiling methods for RNA sequencing
11:42	Aanuoluwa Adekoya and Carolyn Ibberson Investigating Microbial Community Functions in Chronic Bacterial Infections ABSTRACT. Chronic polymicrobial infections (cPMIs) affect over 30 million people in the US and place a financial burden on healthcare systems. These complex polymicrobial communities harbor multiple bacterial species with a wide range of metabolic capacities. Although we’ve known multiple bacteria comprise these cPMIs for over 150 years, previous studies have focused on describing the physiology of only a handful of pathogens, and knowledge about the microbial community function in human infection is lacking. The lack of representative polymicrobial infection models to investigate the molecular mechanisms that drive microbial interactions also limits the study of microbial community functions in these cPMIs. To address these knowledge gaps, our initial work focused on analyzing 102 previously published metatranscriptomes collected from cystic fibrosis (CF) sputum and chronic wounds (CW) to identify key active bacterial members and community functions in these chronic infections. Community composition analysis revealed the high prevalence of pathogens, particularly Staphylococcus and Pseudomonas, and anaerobic members of the microbiota, including Porphyromonas, Anaerococcus, and Prevotella. Functional profiling revealed that while bacterial competition, oxidative stress response, and virulence functions are conserved across chronic infection types, catabolic functions drive disease progression in the CW community, and the CF community is driven by biosynthetic processes. Taken together, we showed that the infection environment strongly influences bacterial physiology and that community structure influences function. Further investigations around understanding microbe-microbe interactions that may exist between key bacteria in chronic wound infection using a polymicrobial community model are ongoing. Understanding the bacterial community functions and the microbe-microbe interactions that drive the expression of these functions is a critical step for developing novel therapeutics for these complex infections.
12:06	Abdulrafeh Naqash and Hassan Abushukair Association of baseline tumor-specific neoantigens and CD8+ T infiltration with immune-related adverse events secondary to immune checkpoint inhibitors

12:30-13:30 Lunch

Boxed Lunch

Location: OMU Molly Shi Boren Ballroom

13:30-15:30 Session 8A: ICCABS - Machine Learning Models

ICCABS - Machine Learning Models

Chairs:

Pavel Skums and Murray Patterson

Location: OMU Regents & Associates

13:30	Namitha Pais, Nalini Ravishanker, Sanguthevar Rajasekaran and George Weinstock Repeated Measures Latent Dirichlet Allocation for Longitudinal Microbiome Analysis ABSTRACT. Topic modeling algorithms generally examine a set of documents, referred to as a corpus in Natural Language Processing (NLP), and analyze the words observed in a document to uncover themes that run through each document in a collection. In the microbiome framework, they are used to identify co-occurring microbial species and reveal hidden patterns or relationships within the microbial communities. Longitudinal microbiome data analysis provides a robust framework for studying microbiome compositions over time. By collecting multiple samples from the same individuals at different time points, researchers can capture the temporal variation within an individual's microbiome and evaluate its impact on the subjects' health status during each of their visits. This paper extends the Latent Dirichlet Allocation (LDA) modeling technique to a repeated measures framework. We propose Repeated Measures Latent Dirichlet Allocation (RM-LDA) where each document (subject) is assumed to be a collection of multiple sub-documents (visits associated with a given subject). In this study, we examine microbiome data on subjects making multiple visits to a medical facility to provide data on their microbiome counts. Our model allows us to analyze hidden patterns in the microbiome data over multiple visits, estimate the latent topic correlation structure within each subject, and study their association with the individual's health status over each visit.
13:54	Chen Song, Yuzhou Chen, Huanmei Wu and Xinghua Shi GenoDiffusion: Conditional Denoising Diffusion Model for Genomic Data Augmentation ABSTRACT. With recent advancements in biotechnology, a substantial volume of genomic sequences has become readily accessible. However, the analysis and sharing of this data present noteworthy challenges. These challenges primarily stem from the inherent imbalance and biases within genomic data, attributed to factors such as the rarity of diseases and the affordability of testing. Moreover, sharing genomic data is hindered by apprehensions regarding privacy, security, and consent. To address these challenges, we introduce an innovative probabilistic approach, termed GenoDiffusion, to enhance genomic data by generating realistic synthetic data that is balanced and free to share. GenoDiffusion employs conditional denoising diffusion models, a type of generative model that learns the underlying distribution and captures complex dependencies among input data features. By leveraging the original genomic data as input, our GenoDiffusion method generates synthetic new data with similar population structures, variant frequency distributions, and linkage disequilibrium patterns. Our experimental results demonstrate that GenoDiffusion outperforms existing methods across various genomics datasets including genotypes related to the human leukocyte antigen region and prostate cancer.
14:18	Aysegul Bumin, Kejun Huang and Tamer Kahveci PartialFibers: An efficient method for predicting Drug-Drug Interactions ABSTRACT. Drug resistance is one of the fundamental challenges in modern medicine. Using combinations of drugs is an effective solution to counter drug resistance as is harder to develop resistance to multiple drugs simultaneously. Finding the correct dosage for each drug in the combination remains to be a challenging task. Testing all possible drug-drug combinations on various cell lines for different dosages in wet-lab experiments is infeasible since there are many combinations of drugs as well as their dosages yet the drugs and the cell lines are limited in availability and each wet-lab test is costly and time-consuming. Efficient and accurate in silico prediction methods are surely needed. Here we present a novel computational method, PartialFibers, to address this challenge. Unlike existing prediction methods PartialFibers takes advantage of the distribution of the missing drug-drug interactions and effectively predicts the dosage of a drug in the combination. Our results on real datasets demonstrate that PartialFibers is more flexible, scalable, and achieves higher accuracy in less time than the state of the art algorithms.
14:42	Hanxia Li and Sixia Chen Machine Learning Application for Predicting Smoking Cessation Among US Adults: An Analysis of Waves 1-5 of the PATH Study ABSTRACT. Objective: This study aims to harness advanced machine learning (ML) techniques to predict smoking cessation trajectories among US adults, thereby informing the development of tailored intervention strategies. Methods: Leveraging longitudinal data from the Population Assessment of Tobacco and Health (PATH) study, spanning Waves 1-5, we employed a suite of ML algorithms, including Logistic Regression, Decision Tree, Random Forest, Support Vector Machines (SVM), XGBoost, Neural Network, and Deep Learning. Model selection and validation were rigorously conducted to ensure optimal performance and generalizability. Results: Preliminary analyses identified key determinants of smoking cessation, including patterns of e-cigarette use, smoking initiation age, duration of tobacco use, poly-tobacco consumption, and BMI metrics. The models showcased robust predictive capabilities with significant accuracies, underscoring their potential in real-world cessation prediction scenarios. Conclusions: Our findings emphasize the transformative potential of ML in enhancing smoking cessation efforts. By identifying nuanced cessation determinants and leveraging cutting-edge predictive modeling, this research paves the way for the next generation of personalized and effective tobacco intervention strategies.
15:06	Arghya Kusum Das ParMEC: A Set of Algorithms for Parallel Metagenomic Error Correction ABSTRACT. Metagenomic error correction is challenging. Since low abundance microbiomes promulgate truly low coverage k-mers in the dataset, the low coverage k-mers cannot be treated as errors and rectified as it is done in whole genome sequence. Further, rapidly increasing metagenomic library sizes pose significant computational challenges in terms of their optimal error correction in a reasonable amount of time. Here we present ParMEC, a set of parallel, fully distributed algorithms for metagenomic error correction. In this presentation, we show our algorithms for substitution errors only which are common in short read sequences. We propose a Pearson skew coefficient-based novel methodology for detecting the errors and a majority vote-based method to rectify them.

13:30-15:30 Session 8B: ASI Workshop - II / CASCODA Workshop - I

ASI Workshop II / CASCODA Workshop I

Chairs:

Ion Mandoiu and Marmar Moussa

Location: OMU Scholars

13:30	Samson Weiner, Bingjun Li, Hari Patchigulla and Sheida Nabavi Improved inference of single-cell haplotype-aware copy numbers in cancer
13:54	Walker Hoolehan, Adeline Machalinski and Willard Freeman Epigenetic regulation of endogenous viral elements in the aging brain
14:18	Michael K.B Ford, Ananth Hari and Cenk Sahinalp Genotyping complex immune loci using integer linear programming
14:42	Russell Schwartz Incorporating structural variation into single-cell clonal lineage inference
15:06	Siyuan Luo, Pierre-Luc Germain, Mark Robinson and Ferdinand von Meyenn Benchmarking computational methods for single-cell chromatin data analysis

15:30-16:00 Coffee Break

Coffee Break

Location: OMU Molly Shi Boren Ballroom

16:00-17:36 Session 9A: ICCABS - Deep Learning and Biomedical Imaging

ICCABS - Deep Learning and Biomedical Imaging

Chairs:

Sanguthevar Rajasekaran and Pavel Skums

Location: OMU Regents & Associates

16:00	Qinggong Tang Enhancing Biomedical Research Using Novel Optical Imaging Technologies and Deep Learning ABSTRACT. Enhancing Cancer Research Using Novel Optical Imaging Technologies and Deep Learning
16:24	Xihan Qin and Li Liao Improving Disease Comorbidity Prediction with Biologically Supervised Graph Embedding ABSTRACT. Comorbidity is vital for disease understanding and management. In graph machine learning, it is seen as a result of mutations in disease-associated genes linked through the protein-protein interactions of the human interactome. Yet, the incomplete human interactome presents challenges in extracting useful features for comorbidity prediction. In this study, we introduce a new method called Biologically Supervised Graph Embedding (BSE) to select the most relevant features, thereby improving the accuracy of predicting comorbid disease pairs. Our investigation into BSE's impact on both centered and uncentered embedding methods showcases its consistent superiority over the state-of-the-art techniques and its adeptness in selecting dimensions enriched with vital biological insights, thereby improving prediction performance significantly, up to 50% when measured by ROC. Further analysis indicates that BSE consistently and substantially improves the ratio of disease associations to gene connectivity, affirming its potential in uncovering latent biological factors affecting comorbidity. The study also reveals additional statistically significant enhancements across various metrics, further highlighting BSE's potential to introduce novel avenues for precise disease comorbidity predictions and other potential applications. The GitHub repository containing the source code can be accessed at the following link: https://github.com/xihan-qin/Biologically-Supervised-Graph-Embedding.
16:48	Armen Kasparian, Guohua Cao and Wu-Chun Feng A 3D Deep Learning Architecture for Denoising Low-Dose CT Scans ABSTRACT. Low-dose CT (LDCT) scans reduce the radiation dose of CT scans but come at the expense of image quality. Deep-learning (DL) image denoising techniques can enhance these LDCT images to match the quality of their regular-dose CT counterparts. To achieve better denoising performance than the current state of the art, we present a novel 3D DL architecture for LDCT image denoising called 3D-DDnet. The architecture leverages the inter-slice correlation in volumetric CT scans to obtain better denoising performance and employs distributed data parallel (DDP) strategies along with transfer learning to achieve faster training. The DDP training strategy enables a scalable multi-GPU approach on Nvidia A100 GPUs, which allows the training of previously prohibitively large volumetric samples. Our results show that 3D-DDnet achieves 10% better mean square error (MSE) on LDCT scans than its 2D predecessor (i.e., DDnet). In addition, the transfer learning in 3D-DDnet leverages existing trained 2D models to “jump start” the weights and biases of our DL model and reduces training time by 50% while maintaining accuracy.
17:12	Suba S and Nita Parekh Lightweight and Generalizable Model for COVID-19 Detection Using Chest Xray Images ABSTRACT. Deep learning (DL) has revolutionized the field of medical imaging, including chest radiology, by offering advanced tools for accurate and efficient detection of diseases for over a decade now. Analysis of Chest radiology images (chest X-ray - CXR and Computed tomography - CT) using DL models has widened its scope as a triaging tool since COVID-19 pandemic due to its speed, accuracy, and objectivity of disease detection, leading to better patient outcomes and more efficient healthcare delivery. CNNs are particularly well suited for image analysis tasks due to their ability to capture hierarchical features. Earlier work on Severe Acute Respiratory Syndrome (SARS) and Middle East Respiratory Syndrome (MERS) have also shown their applicability in the diagnosis of pulmonary diseases. This has led to much recent attention on the analysis of chest radiographs (CXR) using deep learning architectures for the detection of COVID-19 in a clinical setting. Applications developed for medical image analysis require high sensitivity, precision and generalizability along with reliability and that can provide radiologists and clinicians with an additional layer of information to aid in their diagnoses. To address this, here we propose a pixel-based attention mechanism into a lightweight CNN model (Attn-CNN) trained on one of the largest publicly available COVIDx CXR-3 dataset. The performance of the model is compared with four state-of-the-art (SOTA) DL models and is shown to be better. We also discuss the generalizability of our model by performing analysis on external dataset. With portable chest radiography (CXR) now being commonly used for early disease detection and follow up of lung abnormalities, there is a clear scope of the proposed model for assisting health experts in triaging the patients in the pandemic-like situations. The data and code used are available at https://github.com/aleesuss/c19.

16:00-17:36 Session 9B: CASCODA Workshop - II

CASCODA Workshop - II

Chairs:

Ion Mandoiu and Murray Patterson

Location: OMU Scholars

16:00	Mohsen Sharifitabar, Shiva Kazempour Dehkordi, Javad Razavian, Sogand Sajedi, Soroosh Solhjoo and Habil Zare A deep neural network to de-noise single-cell RNA sequencing data ABSTRACT. Single--cell RNA sequencing (scRNA-seq) is a powerful technique that allows researchers to investigate the transcriptome at the single-cell level, enabling the discovery of cell heterogeneity, rare cell populations, and transcriptional dynamics in individual cells. However, scRNA-seq has some limitations, one of which is the issue of dropouts. Addressing this limitation, our paper introduces ZiPo, a deep neural network model to denoise scRNA-seq data. ZiPo is specifically crafted to denoise scRNA-seq data, an essential endeavor given the prevalent amplification and measurement dropouts. Importantly, ZiPo can capture the dropouts by incorporating zero–inflation in the distribution. While ZiPo builds upon established concepts in the field—such as utilizing deep autoencoders and adopting the Poisson and negative binomial distributions-it incorporates novel strategies to improve the overall performance (e.g., library size prediction and residual connections). One novelty is introducing a scale-invariant loss term, making the weights sparse and, hence, the model biologically more interpretable. ZiPo can handle very large datasets (or even a mixture of several datasets) very quickly, with the processing time directly proportional to the number of cells. We train ZiPo on 3 different datasets and show its advantages over the existing ones. The codes for ZiPo which has been used to produce the results are available at: \url{https://bitbucket.org/habilzare/alzheimer/src/master/code/deep/ZiPo/}.
16:24	Akshay Juyal, Zahra Tayebi, Alex Zelikovsky, Simone Ciccolella, Gianluca Della Vedova and Murray Patterson Plastic: An easy to use and modular tool for designing tumor phylogeny reconstruction pipelines
16:48	Marmar Moussa Computational framework for spatial omics analysis of colon polyp tissue
17:12	Vrishabhadev Sathish Kumar A Computational Atlas for Transcriptomic Exploration of M. buryatense Using Unsupervised Machine Learning and Interactive Data Visualization ABSTRACT. Methanotrophs are organisms that naturally consume methane for energy, and through metabolic engineering hold potential to mitigate contributions of atmospheric methane to global warming. Characterized by its robust growth both in nature and experimental settings, Methylotuvimicrobium buryatense is a promising candidate. To develop tangible solutions at scale, biologists first require a deeper understanding of its genome. Here, I present an open-source software tool designed for biologists to interactively probe the M. buryatense transcriptome for exploratory analysis. By aggregating bulk RNA-seq datasets from the past decade of experimentation and applying unsupervised machine learning clustering algorithms, we cluster genes by their expression profiles in differing growth conditions. These gene clusters are annotated with statistically significant gene ontology (GO) terms using gene-set enrichment analysis for functional interpretation. To enhance domain-expert researchers’ ability to explore and drill-down into specific queries, I unify these cluster-specific analyses in a web application using interactive data visualization techniques built on a ReactJS frontend and Azure Cloud backend. With both exploratory and query-focused use cases, this software tool can support M. buryatense biologist workflows for confirming putative molecular biology, investigate existing biological questions, and possibly generate new experimental hypotheses for further testing in the wetlab.

18:30-21:30 Banquet

Banquet

Location: OMU Molly Shi Boren Ballroom