ICCABS 2023: 2023 12TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL ADVANCES IN BIO AND MEDICAL SCIENCES
PROGRAM FOR MONDAY, DECEMBER 11TH
Days:
next day
all days

View: session overviewtalk overview

09:15-10:15 Session 1: 1st Keynote Talk

1st Keynote Talk

09:15
Growing Opportunities for Computational Advancements in Biomedical Research

ABSTRACT. The last few years have seen both in technologies to generate whole genome sequencing, spatial and single-cell spatial transcriptomic, high-throughput proteomic, high-resolution molecular, and quantitative imaging data, as well as advances in and increased accessibility of artificial intelligence and deep learning methods. These data and methods have the potential to revolutionize our understanding of any number of health-related phenomena but are virtually useless without the ability to process and analyze them in an efficient way. This offers an unprecedented opportunity to the data science community. The purpose of my talk is to discuss these opportunities including the emerging technologies that are motivating them and the data they are creating. By presenting this information along with examples of the application of integrative analytical methods, specifically network-based and machine learning approaches to clinical and genomic data, to help data scientists and medical researchers optimize, expand, and innovate their current research practices.

10:30-12:30 Session 2A: ICCABS - Immunotherapy and Cancer Research

ICCABS - Immunotherapy and Cancer

10:30
Single-cell RNA-sequencing analysis of breast tumor infiltrating lymphocytes treated with localized ablative cancer immunotherapy

ABSTRACT. Lymphocytes, containing both B and T cells, serve as pivotal regulators in antitumor immunity, yet their activation pathways in response to localized ablative immunotherapy (LAIT) remain underexplored. We used LAIT, a combination of photothermal therapy (PTT) and intratumor delivery of N-dihydrogalactochitosan (GC), on mice bearing MMTV-PyMT tumors. Single-cell RNA sequencing (scRNAseq) was employed to analyze transcriptional changes in B and T cells within the tumor microenvironment (TME). LAIT extended survival in tumor-bearing mice and increased the proportions of tumor-infiltrating B cells, activated CD8+ T cells, and naïve/memory CD4+ T cells. GC and PTT+GC treatments induced activation signatures in both cell types. B cells exhibited upregulated interferon response genes and antigen presentation, evolving from a resting state towards an effector phenotype. LAIT consistently induced the expressions of co-stimulatory molecules and a series of antitumor cytokines, including type I/II IFNs, Tnf, Il1 and Il17 in CD8+ and CD4+ T cells. Meanwhile, LAIT downregulated immune-suppressing TGFβ signaling. LAIT also slightly induced immune checkpoints of Pdcd1 and Ctla4, laying the foundation to combine LAIT with ICB for effective treatment. Both B and T cells demonstrated gene expression patterns that correlated with longer survival in breast cancer patients. Our findings reveal that LAIT induces a broad proinflammatory response in lymphocytes, including activation of interferon signatures and antigen-presentation in B cells and, thereby reshaping the TME to potentiate antitumor immunity. These results provide the rationale for the combined use of LAIT with immune checkpoint inhibitors in metastatic, nonresponsive cancers and highlight the therapeutic potential of lymphocyte activation in cancer treatment.

10:54
Cancer and Tissue Prediction Using Mutational Signatures in Highly Mutated Cancers

ABSTRACT. Around three to five percent of all cancers have unknown primary origin and identifying their tissue type is crucial for clinical purposes, specially for highly mutated cancers which can benefit from immunotherapy. A mutational signature describes a distinct pattern of mutations caused by a specific mutagenic process and is usually associated with a specific tissue type. For example, tobacco exposure causes a high number of C to A mutations which are frequent in lung cancer, while UV light induces a high amount of CC to TT mutations, which occur in melanomas. The previous observation motivates the goal for our study, which is to use mutational signatures contributions to predict the cancer and tissue type of highly mutated tumor samples. We use the Mutational Signatures v.3.3 cohort from the Catalogue of Somatic Mutations in Cancer (COSMIC) and consider only nine highly mutated cancer types resulting in a set of 1,477 samples. We remove artifactual signatures and consider frequently occurring signatures, resulting in a core set of twenty signatures which we used as features for our models. We tested regression and tree-based models to predict cancer and tissue type. Random forests produced superior results predicting cancer type with an accuracy, specificity and sensitivity of 83.4%, 97.9%, and 76.4%, and predicting tissue type with an accuracy, specificity and sensitivity of 89.5%, 98.0%, and 84.8%. Our approach is limited in cancers that share similar mutational signatures, e.g. our lowest accuracy (76.7%) occurs in defective mismatch repair cases from endometrial, stomach, and colorectal cancers.

11:18
Decoding Heterogeneity in Quadruple-Negative Breast Cancer: A Data-Driven Clustering Approach

ABSTRACT. In a quest to decipher the complexities of Quadruple-Negative Breast Cancer (QNBC), this research harnesses advanced analytics applied to RNAseq gene expression data. Employing unsupervised clustering techniques, our rigorous methodology entails data preprocessing for enhanced interpretability, dimensionality reduction via autoencoders and Principal Component Analysis (PCA), and fine-tuning $k$-means clustering with internal validation indices. The analysis effectively discriminates two distinct QNBC subtypes, substantiated by high Silhouette (0.08) and Calinski-Harabasz (6.92) Scores. The profiles of these clusters are further unveiled through statistical analyses of top variant genes. Cluster 1 is typified by genes such as \emph{C9ORF57} and \emph{OR2AT4}, while Cluster 2 presents distinctive genetic features, including \emph{KRTAP10-10} and \emph{ADAM3A}. These data-driven clusters hold the promise of personalized assessments and interventions, contingent upon clinical validation. This study underscores the potential of integrated machine learning and statistical analysis, marking a pathway to more effective QNBC management.

11:42
A Unified Machine Learning Framework for Multi-subtype Tumour Classification across Diverse Datasets

ABSTRACT. In the domain of oncology, Machine Learning (ML) has emerged as a pivotal tool for the precise classification of digitized tissue images, which is instrumental in identifying cancer subtypes for targeted therapies. A common challenge is the narrow focus of most models on single tumor types using datasets from single origins, limiting generalizability across diverse cancer types and data sources. To address this, we formulated a unified ML framework employing a suite of ML models, illustrating our approach to Bone, Colon, and Prostate Cancer datasets from varied origins while asserting the framework's potential applicability across any histopathology datasets. Our framework is meticulously crafted to harmonize and learn from data variances, offering a robust classification model. Rigorous testing showcased Framework 1 achieving an overall accuracy of 88.48\%, while Framework 2, with classification corrections, displayed an enhanced overall accuracy of 90.28\%. The Area Under the Curve (AUC) values exceeded 0.97, with average specificity and sensitivity surpassing 0.98 and 0.99 across frameworks for individual classes. Additionally, when Framework 2 was extended to an unseen Breast Cancer dataset, it demonstrated a notable accuracy of 88.8\% for Normal vs Tumor tile classification, indicating its efficacy even in unseen datasets. This reflects not only the model's robustness but also the potential of manual feature extraction in predictive models. This endeavor significantly underscores the deployment of a unified ML framework in enhancing the accuracy and reliability of tumor subtype classification across diverse datasets, paving the path for substantial advancements in precision oncology and personalized medicine.

12:06
DNA Methylation Based Subtype Classification of Breast Cancer

ABSTRACT. Aberrant genome-wide DNA methylation patterns is common in cancers. Understanding how these affect the transcriptome can provide insights into subtype specific development and progression of tumorigenesis. In this study we carried out genome-wide analysis of DNA methylation and gene expression profiles in breast cancer patients from TCGA-BRCA and pro-pose a novel set of 35 methylation-based prognostic markers that may shed light on molecular subtype specific disease stratification. Gene-set enrichment and pathway analysis of the predicted markers using MSigDB and DAVID revealed their role in mammary gland development pathway, various signaling pathways (ERBB2, NOTCH, etc.), and other cancer pathways, and showed clear association with genes affected by hormone receptor status. We show that the reported DNA methylation signature has high discriminative power in classifying breast cancer samples into three molecular subtypes, viz., Luminal, HER2-enriched and Triple Negative. An accuracy of 94.12% and MCC of 0.87 is obtained in stratified 5-fold cross-validation for the three-class classification using SVM-RBF.

10:30-12:30 Session 2B: CAME Workshop - I

CAME Workshop - I

Location: OMU Scholars
10:30
Antigenic cooperation and cross-immunoreactivity networks
10:54
Hashed Embeddings: A Promising Technique for Efficient Protein Sequence Representation
11:18
Assessment of PHRED Score Characteristics in Illumina MiSeq Amplicon Sequencing
11:42
Gaussian Beltrami-Klein Model for Protein Sequence Classification: A Hyperbolic Approach
12:06
Mitogenome Assembly from Ultraconserved Elements Sequencing Reads
13:30-14:30 Session 3: 2nd Keynote Talk

2nd Keynote Talk

13:30
Unraveling the Early-Life Origins of Nonalcoholic Fatty Liver Disease (NAFLD) Using Multi-'Omics Approaches

ABSTRACT. The use of multi-'omics, a comprehensive approach integrating genomics, transcriptomics, proteomics, metabolomics, and epigenetics, has emerged as a transformative paradigm in biomedical research. This holistic strategy provides a deep understanding of biological systems, revealing intricate molecular mechanisms and the complex interplay of factors underlying health and disease. Our research program applies a systems-biology approach to investigate cellular processes and molecular mechanisms in liver, gut, fat, and bone marrow, with a focus on understanding the early life origins of adult disease. An increasing body of evidence suggests that maternal obesity, approaching 50% in the US, primes offspring in utero and increases their risk of developing NAFLD and obesity in later life. NAFLD and obesity are also key risk factors for increased prevalence of diabetes and liver cancer in both children and adults. Recent mouse studies from our lab have shown that the developing immune system is particularly susceptible to maternal diet and gut microbiome dysregulation, with metabolic and inflammatory reprogramming of innate immune cells in the liver and bone marrow associated with accelerated NAFLD progression in later life, characterized by oxidative stress, inflammation, and fibrosis. We have identified the antioxidant pyrroloquinoline quinone and microbial metabolites indole and indole-3-acetate as dietary interventions that, when administered during pregnancy and/or lactation, effectively halt the programming of inflammatory sequelae associated with metabolic disease risk in offspring. Our work, spanning cells and tissues in mice and non-human primates, uses single cell sequencing, metabolomics, and microbiome and bioinformatic approaches to elucidate microbiome - immune cell - tissue crosstalk with the goal of understanding complex mechanisms and interacting pathways underlying NAFLD’s early origins. The identification of circulating and tissue biomarkers, molecular pathways, and potential therapeutic targets is a crucial aspect of our work, propelling us towards more effective means for preventing developmental programming by designing interventions to mitigate the long-term health impacts of NAFLD.

14:30-16:30 Session 4A: ICCABS - Deep Learning and Predictive Models

ICCABS- Deep Learning and Predictive Models

14:30
SpaNN: a flexible deep neural network framework for predicting the transcript distribution of the spatial transcriptomics

ABSTRACT. Spatial transcriptomic sequencing is an innovative high-throughput sequencing technology that profiles the spatial distribution of gene expression on tissues or organs, thereby enhancing the research of gene regulation and disease mechanisms, etc. Currently, spatial transcriptomic sequencing based on in situ hybridization and image can obtain cell location information and transcriptome files at single-cell resolution, but they are limited in the number of detected genes, which hinders the discovery-driven studies using such spatial transcriptomic data. It is crucial to predict the spatial distribution of undetected genes in their spatial transcriptomic data. Here, we propose a data enhancement method, called SpaNN, to predict transcriptome expression levels in spatial configuration. SpaNN designs a tailored similar loss using the location information from spatial transcriptome data to train a deep neural network to capture joint embedding, then uses weighted k-nearest-neighbor to predict the spatial expres-sion levels of unmeasured genes. Our experimental results demonstrate that SpaNN can restore the unmeasured gene expression levels in the spatial transcriptomic data, and improve the effect of cell clustering. In addition, the scalability analysis reveals that SpaNN is scalable for large-scale datasets. The source code of SpaNN is openly available on https://github.com/CSUBioGroup/SpaNN.

14:54
An Explainable Deep Learning Framework for Mandibular Canal Segmentation from Cone Beam Computed Tomography volume

ABSTRACT. Cone Beam Computed Tomography (CBCT) is an indispensable imaging modality in oral radiology, offering comprehensive dental anatomical information. Accurate detection of the mandibular canal (MC), a crucial anatomical structure in the lower jaw, within CBCT volumes is essential to support clinical dentistry workflows, including diagnosis, preoperative treatment planning, and postoperative evaluation. In this study, we present a deep learning-based (DL) approach for MC segmentation using 3D U-Net and 3D Attention U-Net networks. We collected a unique dataset of CBCT scans from 20 anonymous hemisected mandibular bones, which were further processed for analysis. The samples were scanned using a CBCT scanner after inserting a wire through the whole length of the MC to identify its location in space (as a gold standard). Our experimental results demonstrate that the 3D Attention U-Net outperforms the standard 3D U-Net in detecting the MC’s location, with Dice coefficients, Precision, and Recall values of 0.65, 0.75, and 0.60, respectively. Unlike current DL-enabled methods for MC segmentation, which face deployment and trust challenges due to their ”blackbox” nature, our approach incorporates a post-hoc visual explainability feature through the Grad-CAM++ (Gradient-weighted Class Activation Mapping) algorithm. This tool highlights important regions within the CBCT volumes that influence the model’s predictions, providing valuable insights into the segmentation process, and bridging the gap between cutting-edge DL technology and clinical practice

15:18
FedDP: Secure Federated Learning with Differential Privacy for Disease Prediction

ABSTRACT. Integrative analysis of distributed biomedical data is essential for maximizing knowledge discovery, accelerating medical breakthroughs, and improving patient care through collaborative research and practices. However, it is challenging to share and aggregate biomedical data distributed among multiple institutions or computing resources due to various concerns including data privacy, security, and confidentiality. The federated Learning (FL) framework can effectively enable multiple institutions to jointly perform machine learning by training a robust model without sharing local data to satisfy the requirement of user privacy protection as well as data security. However, conventional FL methods are exposed to the risk of gradient leakage and cannot be directly applied to genomic data since they cannot address the unique challenges of data imbalance typically seen in biomedicine. To provide secure and efficient disease prediction based on biomedical data distributed across multiple parties, we propose an FL framework enhanced with differential privacy (FedDP) on trained model parameters. The key idea of FedDP is to deploy differential privacy on intermediate gradients that are computed and transmitted by optimizers from local parties. In addition, the unique weighted min-max loss in FedDP is deployed to address the challenge of fair prediction on highly imbalanced datasets. Our experiments on label-imbalanced datasets for cancer prediction demonstrate that FedDP provides a powerful tool to implement and evaluate various strategies in support of privacy preservation and model performance guarantee to overcome data imbalance.

15:42
Optimizing Deep Learning for Biomedical Imaging

ABSTRACT. With the significant increase in the use of deep learning (DL) for biomedical imaging, the corresponding DL models have become increasingly complex and computationally intensive to achieve high accuracy. This work presents both architecture-aware optimizations and sparsity optimizations to efficiently utilize underlying parallel hardware resources and reduce the computational demand of DL models while maintaining their accuracy, respectively. We demonstrate the efficacy of our optimization techniques on an existing DL model in the biomedical domain, i.e., DDNet, short for Densenet and Deconvolution Network, that is designed to enhance the quality of CT images. Our techniques reduce the total training time by 1.7× while delivering same accuracy.

16:06
Automated Formalization of Biological Model Properties into Temporal Logics using Large Language Models
PRESENTER: Sunny Raj

ABSTRACT. The engineering of in-silico biological models can greatly benefit from an automated verification procedure that can guarantee the correctness of the model using monitoring algorithms. The key bottleneck in adopting such a high-assurance design flow is the robust translation of properties of biological models from natural languages to executable formalisms, such as temporal logics. In this paper, we investigate the capacity of open-source and proprietary large language models in transforming natural language statements into temporal logic specifications. Our results suggest that large language models can be used to automatically formalize properties of biological models into temporal logics.

14:30-16:30 Session 4B: CAME Workshop - II

CAME Workshop - II

Location: OMU Scholars
14:30
A rigorous benchmarking of methods for SARS-CoV-2 lineage abundance estimation in wastewater
14:54
Tracking and modeling the dynamics of evolving SARS-CoV-2 Variants
15:18
Clustering-Based Multi-Characteristic Comparative Analysis of COVID-19 Epidemiology in Europe
15:42
Early detection of emerging viral variants with altered phenotypes using viral epistatic networks
16:06
On Multi-Phase Metagenomics Reads Binning

ABSTRACT. Metagenomics is the study of heterogeneous microbial samples extracted directly from their natural environment, e.g., from soil, water, or the human body. The detection and quantification of species that populate microbial communities have been the subject of many recent studies based on classification and clustering, motivated by being the first step in more complex pipelines (e.g. for functional analysis, de- novo assembly or comparison of metagenomes). In this paper we explore the idea of improving the overall quality of metagenomics binning at reads-level by proposing a framework that sequentially combine two complementary read binning approaches: one based on species abundances determination and another one relying on reads overlap in order to cluster reads together. Our preliminary results show that the combination of the two tools can lead to the improvement of the clustering quality in realistic conditions where the number of species is not known beforehand.

17:00-18:30 Session 5: Poster Session

Poster Session

VISTA: An integrated framework for structural variant discovery

ABSTRACT. Structural variation (SV), refers to insertions, deletions, inversions, and duplications in human genomes. With advances in whole genome sequencing (WGS) technologies, a plethora of SV detection methods have been developed. However, dissecting SVs from WGS data remains a challenge, with the majority of SV detection methods prone to a high false-positive rate, and no existing method able to precisely detect a full range of SV’s present in a sample. Previous studies have shown that none of the existing SV callers can maintain high accuracy across various SV lengths and genomic coverages. Here, we report an integrated structural variant calling framework, VISTA (Variant Identification and Structural Variant Analysis) that leverages the results of individual callers using a novel and robust filtering and merging algorithm. In contrast to existing consensus-based tools which ignore the length and coverage, VISTA overcomes this limitation by executing various combinations of top-performing callers based on variant length and genomic coverage to generate SV events with high accuracy. We evaluated the performance of VISTA on using comprehensive gold-standard datasets across varying organisms and coverage. We benchmarked VISTA using the Genome-in-a-Bottle (GIAB) gold standard SV set, haplotype-resolved de novo assemblies from The Human Pangenome Reference Consortium (HPRC)1,2, along with an in-house PCR-validated mouse gold standard set. VISTA maintained the highest F1 score among top consensus-based tools measured using a comprehensive gold standard across both mouse and human genomes. VISTA also has an optimized mode, where the calls can be optimized for precision or recall. VISTA-optimized is able to attain 100% precision and the highest sensitivity among other variant callers. In conclusion, VISTA represents a significant advancement in structural variant calling, offering a robust and accurate framework that outperforms existing consensus-based tools and sets a new standard for SV detection in genomic research.

Link to Biorxiv: https://www.biorxiv.org/content/10.1101/2023.08.11.553053v1.full

Genetic Predisposition to Alcoholism Significantly Alters the Functional Connectivity of the Brain

ABSTRACT. Alcoholism is a common disorder that affects many throughout the world and leads to cognitive deficiencies. Although much has been done to understand the effects of excessive alcohol consumption, less is known about the effects of genetic predisposition to alcoholism. Here, we focus on one such effect which alters the working memory of children who inherited genetic traits from alcoholic parents. More specifically, we focused on the functional connectivity changes of subjects that performed a working memory task while their brain activity was recorded using electroencephalography (EEG). We show that there is a significant increase in the functional connectivity density of subjects with a predisposition to alcoholism, which could indicate potential difficulties in coordination between different brain regions while conducting a working memory task. Furthermore, we show that in subjects with a predisposition to alcoholism, the posterior region of the brain acts more independently, which points to potential difficulties of spreading the encoded spatial information to other parts of the brain during a working memory task.

A Model for Elucidating the Single Molecule Synthesis Process through Temporal Information

ABSTRACT. Single molecule sequencing technologies have seen remarkable advancements due to their capacity to reveal heterogeneities and unique characteristics. However, the issue of relatively higher error rates remains a concern. To enhance the effectiveness of sequencing, we aim to extract more profound features from limited experimental data. Notably, the recorded temporal information in the biochemical processes of single molecule synthesis, accessible in next generation sequencing, bears valuable information. Simple curve fitting to statistical data often falls short in extracting substantial insights. This work introduces two probabilistic models for interpreting the reaction times in single molecule synthesis processes. These models offer enhanced insights into the biochemical reactions' interpretation and sequencing accuracy. By capitalizing on the recorded reaction times, we aim to provide a deeper understanding of single molecule synthesis processes, offering a potential avenue to improve sequencing results while minimizing resource utilization. In summary, these models provide opportunities for a richer comprehension of the biochemical reactions inherent in single molecule synthesis, thereby advancing the accuracy and efficiency of sequencing technology.

A rigorous benchmarking of methods for SARS-CoV-2 lineage abundance estimation in wastewater

ABSTRACT. As the SARS-CoV-2 virus continues to spread and evolve, there is a pressing need for a scalable, cost-effective, and long-term passive surveillance tool to accurately monitor the prevalence rates of known, emerging, and cryptic variants in the population. Wastewater-based genomic surveillance (WWGS) has proven repeatedly to provide early outbreak warnings and accurately assess the prevalence trends of different viral lineages over time. However, the inherent physical and chemical heterogeneity of wastewater samples, the low viral load, viral RNA degradation, background genetic material, and PCR inhibitors can significantly affect the quality of the sequencing data, negatively impacting the performance of bioinformatics algorithms employed for wastewater sequencing data analysis. In this study, we create extensive benchmarking datasets using in silico data mirroring past prevalence ratios of co-occurring lineages and sublineages in the United States. We use these datasets to run comprehensive benchmarks of more than 20 bioinformatics methods used for SARS-CoV-2 lineage detection and abundance estimation based on both classification and deconvolution approaches. We plan to assess how the performance of these methods is impacted by not only the extent of RNA degradation under different physical and chemical conditions, but also the design of the sequencing experiment, in particular the sequencing technology used, sequencing error rate, read length, sequencing coverage, and spike gene targeting compared to whole genome sequencing.

Role of walnut-derived urolithins in reducing colonic inflammation in a patient cohort elucidated by an integrated omics analysis

ABSTRACT. Diet can directly affect colon health and cancer risk. For example, the polyphenolic ellagitannins that are present in walnuts are converted by the microbiome to a panel of bioactive urolithin metabolites. Urolithin A has been studied for its anti-inflammatory and anti-cancer properties. In this pilot clinical trial, we evaluate the effects of walnut consumption (2-oz per day) on urinary urolithins and blood serum inflammatory markers, as well as on the transcriptome of normal colon and polyp tissue in 38 healthy obese and non-obese subjects. Our in-depth computational analysis of metabolomics and proteomics data suggests that subjects with higher urolithin A formation show significantly lower levels of several key serological markers of inflammation, including C-Peptide, sICAM-1, sIL6-R, Ghrelin, TRAIL, sVEGFR2, MCP2, and Haptoglobin. Furthermore, special IMC proteomics analysis of select colon polyp tissue taken from proximal and distal colon shows significant reduction of Vimentin expression in subjects with higher urolithin A formation. These studies indicate that higher urolithin A formation is linked to an anti-inflammatory response, warranting further studies to better understand the role of urolithins in the inflammatory process.

Computational tumor progression analysis via seriation based trajectory inference

ABSTRACT. Precise lineage determination plays a crucial role in discerning the dynamic temporal patterns of gene expression in single cell RNA-Seq data analysis workflows. In this work, we present a novel computational approach for trajectory inference of normal and tumor cell populations in single cell resolution via ordering their cellular transcriptional profiles and modeling their progression along differentiation or tumor evolution paths. We implement a seriation based pseudotime inference methods using optimally reordered heatmaps and provide advanced visualization for trajectory inference in 3-D latent space representation of scRNA-Seq data.

Predicting Antibiotic Metabolites with One and Two-Steps in Silico Biotransformations to Support Identification of Unknown Chemicals in Exposome and Metabolome Studies

ABSTRACT. Metabolomics and exposomics are emerging fields of research that aim to identify products of both secondary and primary metabolism that are linked to disease. Identifying these products in biological samples is often difficult because of the struggle of characterizing ‘unknown’ chemicals. It can take months or years to conclusively identify metabolites that are currently not available in chemical databases. Silico metabolite prediction strategies that predict potential metabolic products are one way to identify detected chemicals in untargeted metabolomics and exposomics studies.

Exploring Protein Design Landscapes with Semi-Supervised Adaptive Sampling

ABSTRACT. Impactful protein design is an enduring pursuit for chemical and bioengineers associated with therapeutics and other industries. Conventional experimental techniques, such as directed evolution, often face challenges in formulating rare, optimized variants of the target protein. In this study, I investigate the effectiveness of a semi-supervised adaptive sampling approach while it explores the extensive search space for enhancements to protein design. Diverging away from conventional black-box predictive models that are susceptible to bias stemming from training distribution, our model takes a different approach. I forgo direct optimization of the oracle to improve predictive power. Instead, I developed a technique to estimate a distribution reliant on desired properties. Consequently, I achieved a noteworthy navigation of the design space. Through a representative case study involving the beta 1 (GB1) domain of streptococcal protein G, I demonstrate the viability of machine-guided protein design in locating meaningful high-fitness protein variants. Our inventive approach represents a critical step towards leveraging costly laboratory experiments without sacrificing throughput, thus propelling a revolution in the field of protein design.

A Simple and Interpretable Deep Learning Model for Diagnosing Pneumonia from Chest X-Ray Images

ABSTRACT. Pneumonia is an infectious disease that has afflicted humanity for centuries. Its origins can be diverse, such as bacterial, viral, fungal, or chemical agents. It is one disease that causes the most deaths among children and adults worldwide. There are many ways to treat pneumonia, however, it is a fact that the sooner it is detected, the greater the chances of successful treatment. Therefore, it is right to think that developing ways to make the diagnosis faster, and facilitating early treatment, is of general interest. Therefore, the present work presents a neural network for the analysis of chest radiographs for the diagnosis of pneumonia. The proposed method showed remarkable results when compared to similar methods in the literature. In addition, the proposed method presents a more transparent diagnosis through relevance aggregation, highlighting the regions of images that were recognized by the neural network to perform the diagnosis, also contributing to interpretable results.

Network-Based Bioinformatics Highlights Broad Importance of Human Milk Hyaluronan

ABSTRACT. Background: Human milk (HM) is rich in bioactive factors thought to promote postnatal development of the small intestine (SI) and maturation of the microbiome. HM is also protective against necrotizing enterocolitis (NEC), a devastating inflammatory condition of the SI predominately occurring in preterm infants. The HM glycosaminoglycan, hyaluronan (HA), is present at high levels in colostrum and early milk. Our group has demonstrated that HA 35 kDa (HA35, a HM HA mimic) is protective in two murine NEC models. Further, HA35 promotes maturation of the murine neonatal SI through increased villus length and crypt depth, more abundant goblet and Paneth cells, and changes in the microbiome. However, the molecular mechanisms underpinning HA35-induced changes in the SI epithelium are unclear.

Objective: Determine if pathway and network analysis of bulk RNA-seq transcriptomics can establish mechanisms contributing to SI differentiation and maturation.

Design/Methods: Bulk RNA-seq was conducted on SI samples from postnatal (P) day 14 in CD-1 mouse pups treated with HA35 (30 mg/kg, P7-P14, n = 4) or control (n = 4). Using the CORALL Kit, stranded RNA-seq libraries were constructed from 200-500 ng RNA and indexed for multiplexed sequencing on the Illumina NovaSeq 6000. Reads were mapped to the mouse genome (mm10) by gSNAP and expression levels were quantified to FPKM using Cufflinks. Sample and treatment heterogeneity were assessed via principal component analysis (PCA). Log2 fold changes of differentially expressed genes (DEGs, false discovery rate [FDR]-adjusted p < 0.05) were uploaded to Qiagen Ingenuity Pathway Analysis (IPA) for pathway enrichment analysis, and gene set enrichment analysis (GSEA) was performed using the Hallmark molecular signature database. Mixed cell deconvolution using the robust linear regression algorithm by granulator R package and referencing the scRNA-seq Mouse Cell Atlas was utilized to delineate cell-type compositional changes. Protein-protein interaction (PPI) networks were generated in CytoScape using the STRINGapp plugin, and functional enrichment of the largest subnetwork performed with ClueGO.

Results: HA35-treated pups distinctly separated in PCA clustering. 247 DEGs were observed (200 upregulated, 47 downregulated) in HA35-treated pups compared to controls. Key upregulated pathways included 14-3-3-mediated, ERK/MAPK, hypoxia inducible factor-1 alpha (HIF-1α), mechanistic target of rapamycin (mTOR), erythropoietin, epidermal growth factor receptor (EGFR) family, hepatocyte growth factor (HGF), vascular endothelial growth factor (VEGF), insulin growth factor-1 (IGF-1), senescence, platelet-derived growth factor (PDGF) signaling, and NRF2-mediated oxidative stress response, while apoptosis and peroxisome proliferator-activated receptor (PPAR) signaling were downregulated. GSEA indicated HA35 associations with epithelial mesenchymal transition, MYC targets, E2F targets, genes downregulated in response to UV radiation, unfolded protein response, protein secretion, and G2M checkpoint. Cell deconvolution revealed intriguing increases in both Paneth and stem cells. Notably, ClueGo identified changes to antioxidant and growth pathways, KEAP1-NFE2L2 and mTOR, respectively.

Conclusions: The observed upregulation of pathways associated with cellular growth, differentiation, and response to stress, coupled with downregulation of apoptosis and inflammatory signaling, underscores HA35’s potential as a HM bioactive factor that protects against NEC. Further confirmation of these findings is ongoing in human preterm enteroid models.

mRNA stability prediction

ABSTRACT. SARS-CoV-2 required the rapid development of vaccines. While conventional vaccine design techniques struggled due to the extensive genomic characteristics of the virus, the advent of mRNA-based vaccines presented a promising alternative. In this study we introduce a two-stage pipeline for mRNA stability prediction, by harnessing the potential of bidirectional GRU (bi-GRU) and the RNA secondary structure prediction tools: Vienna, Contrafold and RNAstructure. The performance measured in the mean column-wise root mean square error (MCRMSE) can be achieved at as low as $0.3031$. Additionally, the secondary structure predicted from the three bioinformatics tools were observed statistical significantly better for stability prediction than the secondary structure coming with the dataset, on more than half of the sequences; yet the three tools themselves seemingly work equally well, i.e., statistical insignificantly different.

PHViT4Lung: Parallel Hybrid Vision Transformers Augmented by Transfer Learning to Enhance Lung Cancer Diagnosis

ABSTRACT. Lung cancer poses a significant global health threat, underscoring the urgency for accurate early detection of lung nodules. Despite the central role of chest CT scans, their subjective evaluation by radiologists often yields low accuracy compared to post-surgery tests. To address this issue, the paper introduces PHViT4Lung, a pioneering hybrid framework merging parallel Transformers and CNNs, leveraging transfer learning to extract features from chest CT images. This work represents a continuation of the authors’ previous research [BioSmart, 2023], furthering their exploration of innovative techniques for improved lung cancer diagnosis and detection. The approach shows great promise, boasting impressive accuracies of over 99% in training, validation, and testing for classifying 1190 lung CT scans into normal, benign, and malignant categories.

Integrating Neural Networks into DNA Sequence-Based Predictions of RSS Functionality and Activity Level
Fast Multi-level Neural Networks to Overcome Bias in Healthcare Applications

ABSTRACT. Healthcare datasets often suffer from both class and racial imbalance, which leads to underperforming machine learning models for the underrepresented groups. To overcome the bias, we perform graph coarsening to reduce the size of the data, train a neural network on the resulting balanced dataset, and gradually refine the model using boundary points that estimate and modify the decision boundary of the trained network.

Investigating Microbial Community Functions in Chronic Bacterial Infections

ABSTRACT. Chronic polymicrobial infections (cPMIs) affect over 30 million people in the US and place financial burden on healthcare systems. These complex polymicrobial communities harbor multiple bacterial species with a wide range of metabolic capacities. Although we’ve known multiple bacteria comprise these cPMIs for over 150 years, previous studies have focused on describing the physiology of only a handful of pathogens, and knowledge about the microbial community function in human infection is lacking. The lack of representative polymicrobial infection models to investigate the molecular mechanisms that drive microbial interactions also limits the study of microbial community functions in these cPMIs. To address these knowledge gaps, our initial work focused on analyzing 102 previously published metatranscriptomes collected from cystic fibrosis (CF) sputum and chronic wounds (CW) to identify key active bacterial members and community functions in these chronic infections. Community composition analysis revealed the high prevalence of pathogens, particularly Staphylococcus and Pseudomonas, and anaerobic members of the microbiota, including Porphyromonas, Anaerococcus, and Prevotella. Functional profiling revealed that while bacterial competition, oxidative stress response, and virulence functions are conserved across chronic infection types, catabolic functions drive disease progression in the CW community, and the CF community is driven by biosynthetic processes. Taken together, we showed that the infection environment strongly influences bacterial physiology and that community structure influences function. Further investigations around understanding microbe-microbe interactions that may exist between key bacteria in chronic wound infection using a polymicrobial community model are ongoing. Understanding the bacterial community functions and the microbe-microbe interactions that drive the expression of these functions is a critical step for developing novel therapeutics for these complex infections.

A computational framework for identifying TCell Receptor specificity in single cell resolution

ABSTRACT. We present a novel approach to identifying and profiling T Cell Receptor specificities in-silico and assess their diversity and phenotype in an integrated and interactive web-based tool. This method can be applied to viral as well as cancer specific epitopes using TCR sequencing data and single cell immune profiling assays. Our method implements metrics of sequence similarity, clonal expansion, gene usage, and defines a novel metric for repertoire diversity based on graph-based network modularity optimization. We use published antigen and epitope-specific TCR sequencing data of cancer neoepitopes presented on MHC-I in mice to extract and engineer CDR3 sequence-based features and train a machine learning classifier for TCR specificity prediction. Additionally, our method integrates single cell RNA-Seq functional analysis for each tested epitope with its corresponding specificity and repertoire features.