Computational and modeling aspects of analyzing high-throughput genomic data
ABSTRACT. The high-throughput sequencing technology has provided new opportunities in scientific discovery of genetic information. While the sequencing data pre-processing techniques have been very well developed and widely accepted by the research community, modeling and analyzing high-dimensional genomic data still post computational challenges. In this talk, I will present two of our recent projects. The first one is a framework on modelling sequencing data of multiple subjects for genomic feature discovery. Taking into consideration of the correlated structure of high throughput genomic data, we use the framework of a fused Lasso latent feature model to solve the problem, and further propose a modified information criterion for the tuning parameter selection when searching for common features shared by multiple samples. Simulation studies and application on DNA- sequencing data showed that the proposed approach can effectively identify individual genomic features of a single subject profile and common genomic features for multiple subjects. The second one is for jointly analyzing multiple types of genomics data, along with prognostic information, available within and across different studies. It has been a challenging and common task in modern statistical research to use all types of data to infer disease-prone genetic information and to link those features to cancer survival. We modelled the genomic, prognostic and survival datasets under a framework of an accelerated failure time with frailty (AFTF) to infer patients' survival time. Simulation results confirmed the good performance of the approach. The approach was applied to the analysis of the Cancer Genome Atlas (TCGA) multiple genomic datasets of Giloblastoma Multiforme (GBM), a lethal brain cancer, and interesting genomic features are identified, and biological interpretations are explored.
Identification of Chimeric RNAs: a novel machine learning perspective
ABSTRACT. Chimeric RNAs are transcripts generated by gene fusion and intergenic splicing events, thus comprising nucleotide sequences from different genes. Recent studies have shown that some chimeric RNAs can play a role in cancer development, and so can be used as diagnostics biomarkers when specifically expressed in cancerous cells and tissues. Most gene fusion prediction tools rely on an initial alignment step. However, alignments might be biased, especially for chimeric reads, creating many
false positives. Therefore, developing alignment-free prediction methods
of fusion genes would be helpful and may provide new insights into the
genomic breakage phenomenon in the cell.
In this direction, machine learning could pave the way for new solutions,
due to their success in predicting genomic regulatory elements and alternative
junction events from the genomic context. To date, however,
these techniques have had a marginal supporting role, and, furthermore,
manually-curated data sets, that are crucial for model training, are often
expensive, unreliable or simply unavailable.
Here we propose a novel ML-based method that learn to recognize the
hidden patterns that allow to identify chimeric RNAs deriving from
oncogenic gene fusions. Preliminary comparison with another state-ofthe-
art method shows promising results.
A Novel String Map-Based Approach for Distance Calculations with Applications to Faster Record Linkage
ABSTRACT. Numerous applications call for pairwise distance calculations among a set of input points. For example, given a set of n points in a Euclidean space, pairwise distance computations form the basis for clustering the n points. As another example, given records from multiple data sources, the record linkage problem is solved with the aid of pairwise distance computations of the records. Often, these distance computations are time intensive. For instance, in record linkage, each record can be thought of as a string of characters. If we use the edit distance to measure the distance between two records, the standard dynamic programming algorithm takes quadratic time. In this paper we present a technique based on mapping that can be used to compute an approximate distance between two points. Based on this computation, we can make a decision as to whether it is necessary to compute the actual distance between the two points or not. Since the approximate distance computation is very fast, the overall run time of the algorithm could reduce significantly. This technique is generic and can be used in solving a variety of problems. In this paper we demonstrate the applicability of this approach in the context of solving the record linkage problem.
On the Hardness of Wildcard Pattern Matching on de Bruijn Graphs
ABSTRACT. In the pattern matching on labeled graphs problem, given an edge labeled graph $G = (V, E)$ and a string $P$, one seeks to identify if there exists a walk in the graph whose concatenation of edge labels (approximately) matches $P$. This is an elementary subproblem for utilizing genome graphs to represent collections of genetic sequences where patterns arise as reads in the sequencing data. Unfortunately, for general graphs, it is known that an algorithm running in $O(|E||P|^{1-\varepsilon} + |E|^{1-\varepsilon}|P|)$ time for constant $\varepsilon > 0$ is not possible under the Strong Exponential Time Hypothesis (SETH). de Bruijn graphs provide a valuable exception, allowing for a path exactly matching a pattern to be found in $O(|E| + |P|)$ for constant-sized alphabets. This property has led de Bruijn graphs to be applied as indexes in the popular tool vg-toolkit. In this work, we consider the case where wildcards (that match with any edge label) are included in the pattern, and the graph is a de Bruijn graph. We demonstrate that adding these wildcards to the pattern is enough to again prove quadratic lower bounds conditioned on SETH for pattern matching on de Bruijn graphs, even when restricted to alphabets of size at most three and $k$-mer length $\Theta(\log |V|)$.
Investigating Microbial Community Functions in Chronic Bacterial Infections
ABSTRACT. Chronic polymicrobial infections (cPMIs) affect over 30 million people in the US and place a financial burden on healthcare systems. These complex polymicrobial communities harbor multiple bacterial species with a wide range of metabolic capacities. Although we’ve known multiple bacteria comprise these cPMIs for over 150 years, previous studies have focused on describing the physiology of only a handful of pathogens, and knowledge about the microbial community function in human infection is lacking. The lack of representative polymicrobial infection models to investigate the molecular mechanisms that drive microbial interactions also limits the study of microbial community functions in these cPMIs. To address these knowledge gaps, our initial work focused on analyzing 102 previously published metatranscriptomes collected from cystic fibrosis (CF) sputum and chronic wounds (CW) to identify key active bacterial members and community functions in these chronic infections. Community composition analysis revealed the high prevalence of pathogens, particularly Staphylococcus and Pseudomonas, and anaerobic members of the microbiota, including Porphyromonas, Anaerococcus, and Prevotella. Functional profiling revealed that while bacterial competition, oxidative stress response, and virulence functions are conserved across chronic infection types, catabolic functions drive disease progression in the CW community, and the CF community is driven by biosynthetic processes. Taken together, we showed that the infection environment strongly influences bacterial physiology and that community structure influences function. Further investigations around understanding microbe-microbe interactions that may exist between key bacteria in chronic wound infection using a polymicrobial community model are ongoing. Understanding the bacterial community functions and the microbe-microbe interactions that drive the expression of these functions is a critical step for developing novel therapeutics for these complex infections.
Association of baseline tumor-specific neoantigens and CD8+ T infiltration with immune-related adverse events secondary to immune checkpoint inhibitors
Repeated Measures Latent Dirichlet Allocation for Longitudinal Microbiome Analysis
ABSTRACT. Topic modeling algorithms generally examine a set of documents, referred to as a corpus in Natural Language Processing (NLP), and analyze the words observed in a document to uncover themes that run through each document in a collection. In the microbiome framework, they are used to identify co-occurring microbial species and reveal hidden patterns or relationships within the microbial communities. Longitudinal microbiome data analysis provides a robust framework for studying microbiome compositions over time. By collecting multiple samples from the same individuals at different time points, researchers can capture the temporal variation within an individual's microbiome and evaluate its impact on the subjects' health status during each of their visits. This paper extends the Latent Dirichlet Allocation (LDA) modeling technique to a repeated measures framework. We propose Repeated Measures Latent Dirichlet Allocation (RM-LDA) where each document (subject) is assumed to be a collection of multiple sub-documents (visits associated with a given subject). In this study, we examine microbiome data on subjects making multiple visits to a medical facility to provide data on their microbiome counts. Our model allows us to analyze hidden patterns in the microbiome data over multiple visits, estimate the latent topic correlation structure within each subject, and study their association with the individual's health status over each visit.
GenoDiffusion: Conditional Denoising Diffusion Model for Genomic Data Augmentation
ABSTRACT. With recent advancements in biotechnology, a substantial volume of genomic sequences has become readily accessible. However, the analysis and sharing of this data present noteworthy challenges. These challenges primarily stem from the inherent imbalance and biases within genomic data, attributed to factors such as the rarity of diseases and the affordability of testing. Moreover, sharing genomic data is hindered by apprehensions regarding privacy, security, and consent. To address these challenges, we introduce an innovative probabilistic approach, termed GenoDiffusion, to enhance genomic data by generating realistic synthetic data that is balanced and free to share. GenoDiffusion employs conditional denoising diffusion models, a type of generative model that learns the underlying distribution and captures complex dependencies among input data features. By leveraging the original genomic data as input, our GenoDiffusion method generates synthetic new data with similar population structures, variant frequency distributions, and linkage disequilibrium patterns. Our experimental results demonstrate that GenoDiffusion outperforms existing methods across various genomics datasets including genotypes related to the human leukocyte antigen region and prostate cancer.
PartialFibers: An efficient method for predicting Drug-Drug Interactions
ABSTRACT. Drug resistance is one of the fundamental challenges in modern medicine. Using combinations of drugs is an effective solution to counter drug resistance as is harder to develop resistance to multiple drugs simultaneously. Finding the correct dosage for each drug in the combination remains to be a challenging task. Testing all possible drug-drug combinations on various cell lines for different dosages in wet-lab experiments is infeasible since there are many combinations of drugs as well as their dosages yet the drugs and the cell lines are limited in availability and each wet-lab test is costly and time-consuming. Efficient and accurate in silico prediction methods are surely needed. Here we present a novel computational method, PartialFibers, to address this challenge. Unlike existing prediction methods PartialFibers takes advantage of the distribution of the missing drug-drug interactions and effectively predicts the dosage of a drug in the combination. Our results on real datasets demonstrate that PartialFibers is more flexible, scalable, and achieves higher accuracy in less time than the state of the art algorithms.
Machine Learning Application for Predicting Smoking Cessation Among US Adults: An Analysis of Waves 1-5 of the PATH Study
ABSTRACT. Objective: This study aims to harness advanced machine learning (ML) techniques to predict smoking cessation trajectories among US adults, thereby informing the development of tailored intervention strategies.
Methods: Leveraging longitudinal data from the Population Assessment of Tobacco and Health (PATH) study, spanning Waves 1-5, we employed a suite of ML algorithms, including Logistic Regression, Decision Tree, Random Forest, Support Vector Machines (SVM), XGBoost, Neural Network, and Deep Learning. Model selection and validation were rigorously conducted to ensure optimal performance and generalizability.
Results: Preliminary analyses identified key determinants of smoking cessation, including patterns of e-cigarette use, smoking initiation age, duration of tobacco use, poly-tobacco consumption, and BMI metrics. The models showcased robust predictive capabilities with significant accuracies, underscoring their potential in real-world cessation prediction scenarios.
Conclusions: Our findings emphasize the transformative potential of ML in enhancing smoking cessation efforts. By identifying nuanced cessation determinants and leveraging cutting-edge predictive modeling, this research paves the way for the next generation of personalized and effective tobacco intervention strategies.
ParMEC: A Set of Algorithms for Parallel Metagenomic Error Correction
ABSTRACT. Metagenomic error correction is challenging. Since low abundance microbiomes promulgate truly low coverage k-mers in the dataset, the low coverage k-mers cannot be treated as errors and rectified as it is done in whole genome sequence. Further, rapidly increasing metagenomic library sizes pose significant computational challenges in terms of their optimal error correction in a reasonable amount of time. Here we present ParMEC, a set of parallel, fully distributed algorithms for metagenomic error correction. In this presentation, we show our algorithms for substitution errors only which are common in short read sequences. We propose a Pearson skew coefficient-based novel methodology for detecting the errors and a majority vote-based method to rectify them.
Improving Disease Comorbidity Prediction with Biologically Supervised Graph Embedding
ABSTRACT. Comorbidity is vital for disease understanding and management. In graph machine learning, it is seen as a result of mutations in disease-associated genes linked through the protein-protein interactions of the human interactome. Yet, the incomplete human interactome presents challenges in extracting useful features for comorbidity prediction. In this study, we introduce a new method called Biologically Supervised Graph Embedding (BSE) to select the most relevant features, thereby improving the accuracy of predicting comorbid disease pairs. Our investigation into BSE's impact on both centered and uncentered embedding methods showcases its consistent superiority over the state-of-the-art techniques and its adeptness in selecting dimensions enriched with vital biological insights, thereby improving prediction performance significantly, up to 50% when measured by ROC. Further analysis indicates that BSE consistently and substantially improves the ratio of disease associations to gene connectivity, affirming its potential in uncovering latent biological factors affecting comorbidity. The study also reveals additional statistically significant enhancements across various metrics, further highlighting BSE's potential to introduce novel avenues for precise disease comorbidity predictions and other potential applications. The GitHub repository containing the source code can be accessed at the following link: https://github.com/xihan-qin/Biologically-Supervised-Graph-Embedding.
A 3D Deep Learning Architecture for Denoising Low-Dose CT Scans
ABSTRACT. Low-dose CT (LDCT) scans reduce the radiation dose of CT scans but come at the expense of image quality. Deep-learning (DL) image denoising techniques can enhance these LDCT images to match the quality of their regular-dose CT counterparts. To achieve better denoising performance than the current state of the art, we present a novel 3D DL architecture for LDCT image denoising called 3D-DDnet. The architecture leverages the inter-slice correlation in volumetric CT scans to obtain better denoising performance and employs distributed data parallel (DDP) strategies along with transfer learning to achieve faster training. The DDP training strategy enables a scalable multi-GPU approach on Nvidia A100 GPUs, which allows the training of previously prohibitively large volumetric samples. Our results show that 3D-DDnet achieves 10% better mean square error (MSE) on LDCT scans than its 2D predecessor (i.e., DDnet). In addition, the transfer learning in 3D-DDnet leverages existing trained 2D models to “jump start” the weights and biases of our DL model and reduces training time by 50% while maintaining accuracy.
Lightweight and Generalizable Model for COVID-19 Detection Using Chest Xray Images
ABSTRACT. Deep learning (DL) has revolutionized the field of medical imaging, including chest radiology, by offering advanced tools for accurate and efficient detection of diseases for over a decade now. Analysis of Chest radiology images (chest X-ray - CXR and Computed tomography - CT) using DL models has widened its scope as a triaging tool since COVID-19 pandemic due to its speed, accuracy, and objectivity of disease detection, leading to better patient outcomes and more efficient healthcare delivery. CNNs are particularly well suited for image analysis tasks due to their ability to capture hierarchical features. Earlier work on Severe Acute Respiratory Syndrome (SARS) and Middle East Respiratory Syndrome (MERS) have also shown their applicability in the diagnosis of pulmonary diseases. This has led to much recent attention on the analysis of chest radiographs (CXR) using deep learning architectures for the detection of COVID-19 in a clinical setting. Applications developed for medical image analysis require high sensitivity, precision and generalizability along with reliability and that can provide radiologists and clinicians with an additional layer of information to aid in their diagnoses. To address this, here we propose a pixel-based attention mechanism into a lightweight CNN model (Attn-CNN) trained on one of the largest publicly available COVIDx CXR-3 dataset. The performance of the model is compared with four state-of-the-art (SOTA) DL models and is shown to be better. We also discuss the generalizability of our model by performing analysis on external dataset. With portable chest radiography (CXR) now being commonly used for early disease detection and follow up of lung abnormalities, there is a clear scope of the proposed model for assisting health experts in triaging the patients in the pandemic-like situations. The data and code used are available at https://github.com/aleesuss/c19.
A deep neural network to de-noise single-cell RNA sequencing data
ABSTRACT. Single--cell RNA sequencing (scRNA-seq) is a powerful technique that allows researchers to investigate the transcriptome at the single-cell level, enabling the discovery of cell heterogeneity, rare cell populations, and transcriptional dynamics in individual cells. However, scRNA-seq has some limitations, one of which is the issue of dropouts. Addressing this limitation, our paper introduces ZiPo, a deep neural network model to denoise scRNA-seq data. ZiPo is specifically crafted to denoise scRNA-seq data, an essential endeavor given the prevalent amplification and measurement dropouts. Importantly, ZiPo can capture the dropouts by incorporating zero–inflation in the distribution. While ZiPo builds upon established concepts in the field—such as utilizing deep autoencoders and adopting the Poisson and negative binomial distributions-it incorporates novel strategies to improve the overall performance (e.g., library size prediction and residual connections). One novelty is introducing a scale-invariant loss term, making the weights sparse and, hence, the model biologically more interpretable. ZiPo can handle very large datasets (or even a mixture of several datasets) very quickly, with the processing time directly proportional to the number of cells. We train ZiPo on 3 different datasets and show its advantages over the existing ones. The codes for ZiPo which has been used to produce the results are available at:
\url{https://bitbucket.org/habilzare/alzheimer/src/master/code/deep/ZiPo/}.
A Computational Atlas for Transcriptomic Exploration of M. buryatense Using Unsupervised Machine Learning and Interactive Data Visualization
ABSTRACT. Methanotrophs are organisms that naturally consume methane for energy, and through metabolic engineering hold potential to mitigate contributions of atmospheric methane to global warming. Characterized by its robust growth both in nature and experimental settings, Methylotuvimicrobium buryatense is a promising candidate. To develop tangible solutions at scale, biologists first require a deeper understanding of its genome. Here, I present an open-source software tool designed for biologists to interactively probe the M. buryatense transcriptome for exploratory analysis. By aggregating bulk RNA-seq datasets from the past decade of experimentation and applying unsupervised machine learning clustering algorithms, we cluster genes by their expression profiles in differing growth conditions. These gene clusters are annotated with statistically significant gene ontology (GO) terms using gene-set enrichment analysis for functional interpretation. To enhance domain-expert researchers’ ability to explore and drill-down into specific queries, I unify these cluster-specific analyses in a web application using interactive data visualization techniques built on a ReactJS frontend and Azure Cloud backend. With both exploratory and query-focused use cases, this software tool can support M. buryatense biologist workflows for confirming putative molecular biology, investigate existing biological questions, and possibly generate new experimental hypotheses for further testing in the wetlab.