ABSTRACT. Principal Component Analysis (PCA) has long been a cornerstone in dimensionality reduction for high-dimensional data, including single-cell RNA sequencing (scRNA-seq). However, PCA's performance typically degrades with increasing data size, can be sensitive to outliers, and assumes linearity. Recently, Random Projection (RP) methods have emerged as promising alternatives, addressing some of these limitations. This study systematically and comprehensively evaluates PCA and RP approaches, including Singular Value Decomposition (SVD) and randomized SVD, alongside Sparse and Gaussian Random Projection algorithms, with a focus on computational efficiency and downstream analysis effectiveness. We benchmark performance using multiple scRNA-seq datasets including labeled and unlabeled publicly available datasets. We apply Hierarchical Clustering and Spherical K-Means clustering algorithms to assess downstream clustering quality. For labeled datasets, clustering accuracy is measured using the Hungarian algorithm and Mutual Information. For unlabeled datasets, the Dunn Index and Gap Statistic capture cluster separation. Across both dataset types, the Within-Cluster Sum of Squares (WCSS) metric is used to assess variability. Additionally, locality preservation is examined, with RP outperforming PCA in several of the evaluated metrics.
Our results demonstrate that RP not only surpasses PCA in computational speed but also rivals and, in some cases, exceeds PCA in preserving data variability and clustering quality. By providing a thorough benchmarking of PCA and RP methods, this work offers valuable insights into selecting optimal dimensionality reduction techniques, balancing computational performance, scalability, and the quality of downstream analyses.
DuoHash: fast hashing of spaced seeds with application to spaced k-mers counting
ABSTRACT. Alignment-free genomic sequence analysis has facilitated high-throughput processing within numerous bioinformatics workflows. A central task in alignment-free applications is hashing $k$-mers, commonly used for indexing, querying, and fast similarity searches. Recently, spaced seeds—a specialized pattern designed to accommodate errors or mutations—have increasingly replaced $k$-mers, enhancing sensitivity in various applications. However, spaced seed hashing is computationally intensive, introducing significant delays.
This paper addresses the challenge of efficient spaced seed hashing and presents DuoHash, a framework that enables the efficient computation of several hash functions. Our experimental results demonstrate that the proposed method substantially outperforms existing algorithms, achieving speedups of up to 11x. To illustrate practical utility, we further applied DuoHash to the problem of spaced $k$-mers counting. The code of DuoHash is available at \url{https://github.com/leonardoGemin/DuoHash}
ABSTRACT. Durbin's positional Burrows-Wheeler transform (PBWT) enables algorithms with the optimal time complexity of $O(MN)$ for reporting all vs all haplotype matches in a population panel with $M$ haplotypes and $N$ variant sites. However, even this efficiency may still be too slow when the number of haplotypes reaches millions. To further reduce the run time, in this paper, a parallel version of the PBWT algorithms is introduced for all versus all haplotype matching, which is called HP-PBWT (haplotype-based parallel PBWT). HP-PBWT parallelly executes the PBWT by splitting a haplotype panel into blocks of haplotypes. HP-PBWT algorithms achieve parallelization for PBWT construction, reporting all versus all L-long matches, and reporting all versus all set-maximal matches while maintaining memory efficiency. HP-PBWT has an $O((\frac{M}{T}+T)N)$ time complexity in PBWT construction, and $O((\frac{M}{T}+T +c^*)N)$ time complexity for reporting all versus all L-long matches and reporting all versus all set-maximal matches, where $T$ is the number of threads and $c^*$ is the maximum number of matches (of length L or maximum divergence value for L-long matches and set-maximal matches respectively) per haplotype per site. HP-PBWT achieves 4-fold speed-up in UK Biobank genotyping array data with 60 threads in the IO-included benchmarks. When applying HP-PBWT to a dataset of 8 million randomized haplotypes (random binary strings of equal length) in the IO-excluded benchmarks, it can achieve a 22-fold speed-up with 60 cores on the Amazon EC2 server. With further hardware optimization, HP-PBWT is expected to handle billions of haplotypes efficiently.
Fast and Succinct Compression of k-mer Sets with Plain Text Representation of Colored de Bruijn Graphs
ABSTRACT. A fundamental operation in computational genomics is the reduction of input sequences into their constituent $k$-mers. Designing space-efficient ways to represent a $k$-mer collection is essential to improve the scalability of bioinformatics analyses. A widely used approach involves converting the $k$-mer set into a de Bruijn graph and then producing a compact plain text representation by identifying the minimum path cover.
In this article, we present USTAR-CR, a novel algorithm for compressing multiple $k$-mer sets. USTAR-CR leverages node connectivity principles in the colored de Bruijn graph for a more compact plain text representation, combined with an efficient encoding of $k$-mers colors. We tested USTAR-CR on real read datasets and compared it with the state-of-the-art GGCAT. USTAR-CR demonstrated superior performance in terms of compression, requiring less memory and being significantly faster (up to 51x).
\url{https://github.com/enricorox/USTAR-CR}
Phylodynamics Analysis of HIV Epidemic History in Belarus in 1987-2022
ABSTRACT. This work presents the first systematic genomic epidemiology analysis of the HIV epidemic in Belarus, an Eastern European country that, like much of Eastern Europe and the Post-Soviet region, has been largely understudied in relation to HIV epidemics. A total of 867 HIV sequences collected nationwide between January 2018 and May 2022 were analyzed using phylogenetic and phylodynamic methods. The findings reveal two distinct epidemic waves spanning 1997–2005 and 2009–2018, each driven by different dominant modes of transmission. The study also identifies potential introduction and intra-country transmission routes, emphasizing the pivotal role of the capital city and eastern industrial hubs in shaping the epidemic’s trajectory. This work addresses an important gap in understanding HIV dynamics in Eastern Europe.
Benchmarking metagenomic software with the new CAMI web portal
ABSTRACT. To enable objective and comprehensive benchmarking of metagenomic software, the community-led initiative for the Critical Assessment of Metagenome Interpretation (CAMI) promotes standards and good practices agreed upon by the community, by providing comprehensive datasets for developers to assess their software on, together with software facilitating their assessment, benchmarking guidelines, and challenges. However, assessing methods in between challenges and comparing their performances to other techniques still requires substantial time and technical expertise. We introduce the new CAMI benchmarking web portal, which frees users from the need to install and execute other methods and evaluation software for a comprehensive assessment of individual tools.
Model Selection for Sparse Microbial Network Inference using Variational Approximation
ABSTRACT. Microbial communities are often composed of taxa from different taxonomic groups.
The associations among the constituent members in a microbial community play an important role in determining the functional characteristics of the community, and these associations can be modeled using an edge weighted graph (microbial network). A microbial network is typically inferred from a sample–taxa matrix that is obtained by sequencing multiple biological samples and identifying the taxa abundance in each sample. Motivated by microbiome studies that involve a large number of samples collected across a range of study parameters, here we consider the computational problem of identifying the number of microbial networks underlying the observed sample-taxa abundance matrix. Specifically, we consider the problem of determing the number of sparse microbial networks in this setting. We use a mixture model framework to address this problem, and present formulations to model both count data and proportion data. We propose several variational approximation based algorithms that allow the incorporation of the sparsity constraint while estimating the number of components in the mixture model. We evaluate these algorithms on a large number of simulated datasets generated using a collection of different graph structures (band, hub, cluster, random, and
scale-free).
ABSTRACT. Breast cancer remains a significant global health concern, and machine learning algorithms and computer-aided detection systems have shown great promise in enhancing the accuracy and efficiency of mammography image analysis. However, there is a critical need for large, benchmark datasets for training deep learning models for breast cancer detection. In this work we developed Mammo-Bench, a large-scale benchmark dataset of mammography images, by collating data from seven well-curated resources, viz., DDSM, INbreast, KAU-BCMD, CMMD, CDD-CESM, DMID, and RSNA Screening Dataset. To ensure consistency across images from diverse sources while preserving clinically relevant features, a preprocessing pipeline that includes breast segmentation, pectoral muscle removal, and intelligent cropping is proposed. The dataset consists of 74,436 high-quality mammographic images from 26,500 patients across 8 countries and is one of the largest open-source mammography databases to the best of our knowledge. To show the efficacy of training on the large dataset, performance of ResNet101 architecture was evaluated on Mammo-Bench and the results compared by training independently on a few member datasets and an external dataset, VinDr-Mammo. An accuracy of 78.8% (with data augmentation of the minority classes) and 77.8% (without data augmentation) was achieved on the proposed benchmark dataset, compared to the other datasets for which accuracy varied from 25 – 69%. Noticeably, improved prediction of the minority classes is observed with the Mammo-Bench dataset. These results establish baseline performance and demonstrate Mammo-Bench's utility as a comprehensive resource for developing and evaluating mammography analysis systems.
ABSTRACT. Wastewater genomic surveillance of SARS-CoV-2 emerged as a scalable, cost-effective, passive surveillance tool to monitor viral variants circulating in the human population. However, accurate estimation of viral lineage prevalence in communities critically relies on the performance of computational methods for analyzing wastewater sequencing data. We perform a comprehensive benchmarking of bioinformatics methods designed for quantifying SARS-CoV-2 (sub)lineages from wastewater sequencing data, along with RNA-Seq and metagenomics methods repurposed for this task. We systematically compare the accuracy of these methods in estimating the relative abundances of viral (sub)lineages, including closely related and low-abundance (sub)lineages. Preliminary results on simulated error-free Illumina reads and a few computational methods show that repurposed RNA-Seq methods RSEM, Kallisto, and Salmon, which use a read classification approach and the expectation-maximization (EM) algorithm, consistently achieve higher accuracy compared to wastewater-surveillance methods Alcov and PiGx, which use a deconvolution approach employing the least squares method. In particular, RSEM, Kallisto, and Salmon show higher accuracy (lower L1 error) on mixtures of lineages (L1 errors of 0.1%, 0.1% and 1.8%, respectively) and especially on mixtures of closely related (sub)lineages (3.8%, 3.9% and 4.9%) when compared to wastewater-surveillance methods Alcov and PiGx (lineages: 2.4% and 8.3%, respectively; (sub)lineages: 18.5% and 42.2%). The distribution of the prediction errors follows the same pattern. Thus, for the mixtures of lineages, all prediction errors for RSEM and Kallisto are smaller than 0.005%, while for Salmon, Alcov, and PiGx, roughly 100%, 97%, and respectively 87% of the errors are smaller than 0.04%. By contrast, for the mixtures of closely related (sub)lineages, roughly 85% of the absolute error values for RSEM, Kallisto, and Salmon are lower than 0.04%, compared to approximately 75% for Alcov and only 25% for PiGx.
We will extend this benchmark to a total of 21 identified methods and will contrast their performance for short and long read simulated data, assessing the impact of varying amplicon/read lengths and sequencing error rates. We will additionally benchmark the methods on simulated wastewater sequencing data, accounting simultaneously for such effects as variable amplicon abundance and amplicon dropout, RNA degradation, PCR and sequencing errors. Importantly, we will determine which of the evaluated bioinformatics methods can be repurposed to quantify (sub)lineages of the human respiratory syncytial virus (hRSV) and will compare their accuracy in this context.
ABSTRACT. The microbial world plays a fundamental role in shaping Earth's biosphere, steering global processes such as carbon and nitrogen cycling, soil rejuvenation, and ecological fortification. An overwhelming majority of microbial entities, however, remain unstudied. Metagenomics stands to elucidate this microbial “dark matter" by directly sequencing the microbial community DNA from environmental samples. Yet, our ability to explore these metagenomic sequences is limited to establishing their similarity to curated datasets of organisms or genes/proteins. Aside from the difficulties in establishing such similarity, the reference-based approaches, by definition, forgo discovery of any entities sufficiently unlike the reference collection.
Presenting a paradigm shift, language model-based methods, offer promising avenues for reference-free analysis of metagenomic reads. In this talk, we will introduce two language models, a pretrained foundation model REMME, aimed at understanding the DNA context of metagenomic reads, and the fine-tuned REBEAN model for predicting the enzymatic potential encoded within the read-corresponding genes. By emphasizing function over gene identification, REBEAN is able to label known functions carried both by previously explored genes and by new (orphan) sequences. Furthermore, even though it is not explicitly trained to do so, REBEAN identifies the functionally relevant parts of a gene. Our models’ demonstrate the potential for metagenomic read annotation and unearthing of novel enzymes, thus enriching our understanding of microbial communities.
Probabilistic Assembly and Uncertainty Quantification of Polyploid Haplotype Phase
ABSTRACT. The computation of haplotypes from sequencing data, i.e. haplotype assembly, is an important component of foundational molecular and population genetics problems, including interpreting the effects of genetic variation on complex traits and reconstructing genealogical relationships. Assembling the haplotypes of polyploid genomes remains a significant challenge due to the exponential search space of haplotype phasings and read assignment ambiguity; the latter challenge is particularly difficult for polyploid haplotype assemblers since the information contained within the observed sequence reads is insufficient for unambiguous haplotype assignment in the presence of identical haplotype segments. We present a probabilistic haplotype assembler for diploid and polyploid genomes that can leverage the uncertain evidence from sequence reads of any length. The joint distribution over haplotype phasings is modelled by a Markov random field whose conditional independence structure is defined by observed haplotype segments (e.g., from DNA-seq). We develop graph theoretic algorithms to transform the Markov random field into a representation that enables statistical inference and uncertainty quantification despite an exponential space of possible phasings. Moreover, we develop forward filtering backward sampling and clique decomposition algorithms to efficiently sample haplotype phasings and quantify uncertainty over these large state spaces. Our benchmarking of polyploid haplotype assemblers on synthetic and experimental data demonstrates the utility of maintaining explicit discrete distributions over large combinatorial spaces when subject to uncertain evidence.
DCCNV: Enhanced Copy Number Variation Detection in Single-Cell Sequencing Using Diffusion Processes and Contrastive Learning
ABSTRACT. Detecting copy number variations (CNVs) in single-cell DNA sequencing (scDNA-seq) data is challenging due to substantial noise and variability. To address this, we present DCCNV, a novel method that integrates diffusion processes, contrastive learning, and circular binary segmentation (CBS) for reliable CNV detection. Our method employs adaptive k-nearest neighbors (KNN) and multi-scale diffusion to reduce noise while preserving key biological signals, followed by contrastive learning to distinguish true genomic alterations from technical noise. The CBS algorithm is then used to partition the enhanced signals into discrete copy number segments.
We compared the performance of DCCNV with those of several current single-cell CNV detection methods, including DeepCopy, rcCAE, SCOPE, SCONE, HMMcopy, SeCNV, as well as filtering-based CNV detection approaches that employ commonly used filters such as Wavelet, Median, and Gaussian filters. This comparison was conducted using both simulated and real data.
The results show that DCCNV outperforms these approaches in terms of accuracy and computational efficiency.
Leveraging RNA LLMs for 3D Structure Prediction via Novel Data Augmentation
ABSTRACT. Ribonucleic acid (RNA) is a complex macromolecule essential for the proper functioning of living organisms. Understanding its three-dimensional (3D) structure is critical for elucidating its cellular roles. However, computational prediction of RNA 3D structures remains a significant challenge due to the vast conformational space that RNA molecules can adopt. Although machine learning, particularly deep learning-based methods, has recently shown promise in this area, progress has been hindered by the limited availability of large datasets containing native RNA structures for model training.
In this study, we leverage pre-trained RNA large language models to predict 3D conformations directly from RNA sequences. To address the issue of data scarcity, we propose a novel data augmentation method tailored for RNA 3D structure prediction. Our approach focuses on predicting backbone conformations to evaluate the method’s efficacy. Preliminary results demonstrate promising accuracy, with predicted structures achieving the average root-mean-square deviations (RMSDs) of approximately 3.5 Å when compared to native 3D structures in the Protein Data Bank (PDB). These findings highlight the potential of integrating RNA large language models and data augmentation strategies for advancing RNA structure and function prediction
Nanopore metagenomics as a rapid diagnostic for emerging high-threat bacterial pathogens on environmental interfaces
ABSTRACT. Metagenomics has opened a new window into understanding the genomic diversity of microbial communities, with a variety of sequencing methods and open-source analysis tools. A key question in metagenomics applied for the rapid, early detection of emerging pathogens on the nanopore platform is to sift signal from noise, such that P(read hits)>S/N, indicating a real outbreak rather than background levels of a microbial genome, or sequencing errors. We have developed and tested custom scripts and open-source tools for nanopore metagenomics analysis of microbial communities and clinical specimens from animals in Alaska, including living and stranded marine mammals, seabirds, and environmental (wastewater) specimens. We distinguish overlapping read classifier assignments using kraken2 to conserved genomic loci, from bona fide pathogen detection with reference-based assembly of reads landing across a pathogen genome or in specific virulence loci. Using this training data, we are developing criteria for genomics-based risk assessment of high-threat bacterial pathogens in Alaska and Arctic land and sea.
AmpliconHunter: A Scalable Tool for Accurate Amplicon Prediction from Microbiome Samples using Degenerate Primers
ABSTRACT. Sequencing of PCR amplicons generated using degenerate primers (typically targeting a region of the 16S ribosomal gene) is widely used in metagenomics to profile the taxonomic composition of complex microbial samples. To reduce taxonomic biases in primer selection it is important to conduct in silico PCR analyses of the primers against large collections of up to millions bacterial genomes. However, existing in silico PCR tools have impractical running time for analyses of this scale. In this paper we introduce AmpliconHunter, a highly scalable in silico PCR package distributed as open-source command-line tool and publicly available through a user-friendly web interface at https://ah1.engr.uconn.edu/. AmpliconHunter implements an accurate nearest-neighbor model for melting temperature calculations, allowing for mismatches between primers and target genomes, along with three complementary methods for estimating off-target amplification. By taking advantage of multi-core parallelism and SIMD operations available on modern CPUs, the AmpliconHunter web server can complete in silico PCR analyses of commonly used degenerate primer pairs against the 2.4M genomes in the latest AllTheBacteria collection in under 7 hours.
Opening an icy black box: diversity and functioning of soil bacteria in Antarctica’s largest ice-free region
ABSTRACT. The McMurdo Dry Valleys (MDV) of Antarctica possess a uniquely pristine model microbial soil ecosystem for understanding fundamental ecological phenomena; the impact of a warming climate on ecosystem functioning, community structure and composition; and the dynamics of adaptation. This system is an ideal model because first, perennial low temperatures, low soil moisture, frequent freeze-thaw cycles, short growing seasons, oligotrophy, high salinity and pH, and geographical isolation have reduced biodiversity and biomass in the MDV such that bacteria and microeukaryotes predominate and vascular plants and macrofauna are absent. Second, despite being one of the driest terrestrial sites on Earth, the MDV is surrounded by large bodies of ice (in the form of lake ice, glaciers, and ice sheets) that will release an increasing amount of liquid water into the system as it warms. Despite the scientific value of this model system, we still know little about the functional ecology of its biota, especially the bacteria. Here, we analyze the bacterial taxonomic and functional diversity of 18 previously generated shotgun metagenomes using the metagenome processing pipeline VEBA. We recovered 832 medium-to-high quality MAGs (completeness > 50% and contamination < 10%) and 207 high quality MAGs (completeness > 80% and < 10% contamination). We found that 1) similar to previous amplicon-sequencing based approaches, Actinomycetota (Actinobacteria) predominate in the more extreme, drier soils; 2) MDV soil bacteria mapped poorly to standard reference bacterial genome libraries and therefore may be either taxonomically novel or genotypically distinct from non-Antarctic sister lineages; 3) genes and gene pathways associated with virulence factors, antibiotic synthesis, photosynthesis, and nitrogen species processing were rare; and 4) that genes associated with virulence factors and extremotolerance did not appear to vary with gradients of extremeness, presumably because sites did not differ significantly in terms of their abiotic parameters. This works lays the groundwork for a more comprehensive understanding of this model ecosystem’s functioning and ecological resilience as well as the autecology of its bacterial taxa.
Assessment of Host and Bacterial Depletion Methods to Enhance RNA Virus Detection by Next-Generation Sequencing
ABSTRACT. Accurate viral detection using untargeted next-generation sequencing (NGS) often requires the effective removal of abundant host and bacterial reads to ensure comprehensive genome coverage. In this study, we evaluated multiple depletion methods and their impact on RNA virus detection using NGS. Specifically, we compared our established rRNA depletion protocol with filtration, benzonase, and neuraminidase treatments, both individually and in combination. Additionally, we assessed the benefits of DNase pretreatment prior to rRNA depletion for enhanced results. Clinical swabs were collected from specific-pathogen-free chickens infected with the highly pathogenic H5N1 avian influenza virus. RNA extracts, both treated and untreated, were subjected to sequence-independent, single-primer amplification. Subsequently, NGS libraries were prepared using the Illumina DNA Prep kit and sequenced with the 600-cycle MiSeq Reagent Kit v3 on an Illumina MiSeq instrument.
Our results revealed that all depletion methods effectively reduced host-specific reads compared to untreated samples in both oral and cloacal swabs. However, viral read recovery and increased genome breadth of coverage were not consistent across all treatments. Cloacal swabs showed the most significant improvement, with an increased breadth of viral genome coverage observed in all depletion methods. In contrast, oral swabs exhibited reduced genome coverage despite successful host read depletion after most treatments, except for benzonase, which slightly increased genome coverage, though the increase was not statistically significant. DNase pretreatment before rRNA depletion did not provide significant improvement in cloacal swabs and negatively affected oral swabs by reducing viral reads recovery and genome breadth of coverage.
Overall, oral swabs did not benefit from any tested depletion treatment. However, our established rRNA depletion protocol and swab filtration with 0.45 μm filters demonstrated statistically significant (p < 0.0005) improvements in viral genome breadth for cloacal swabs. Even though both methods performed similarly, our rRNA method also preserved approximately 300 bp region of depleted bacterial reads, enabling further metagenomic analysis, which could be crucial for identifying co-infection in disease settings. These findings emphasize the necessity of tailoring depletion strategies to specific sample types to achieve optimal results and advance RNA virus detection through NGS.
Earth Mover's Distance to recognize duplicate datasets in an automated molecular epidemiology system
ABSTRACT. The GHOST system accepts samples from many different laboratories. Users regularly upload samples which are sequenced more than once. However, the staff of the GHOST system do not have access to metadata which would indicate that this has happened. These duplicates complicate data analysis of GHOST results. Because of the inherit noise of the laboratory sequencing process the recognizing such duplication is not straightforward. We investigate the use of the Earth Mover's Distance between frequency distributions of GHOST datasets to find these duplicates in the system.
Applying Genetic Algorithm with Saltations to MAX-3SAT
ABSTRACT. Solving the MAX-3SAT problem, a classic NP-hard optimization challenge with numerous real-world applications, poses significant computational difficulties. While genetic algorithms (GAs) have proven effective in tackling such problems, their tendency to prematurely converge on suboptimal solutions often hampers their performance. To address this, we propose an enhanced genetic algorithm with evolutionary jumps (GA+EJ). Inspired by biological systems where correlated mutations allow viral populations to escape fitness plateaus, evolutionary jumps introduce statistically linked modifications to the population, enabling the algorithm to explore distant, high-fitness regions of the search space. Using the CliqueSNV algorithm, originally developed to detect minority haplotypes in noisy sequencing data, these jumps help prevent stagnation and enhance diversification of SAT solutions. Additionally, to accelerate computation, we parallelize the genetic algorithm across multiple CPUs, achieving significant speed-ups in execution time, with plans to extend this parallelization to GPUs using the CUDA framework. Preliminary results show that GA+EJ consistently outperforms traditional (parallelized) GAs, solving up to 5-9% more clauses across small (20 variables, 91 clauses), medium (50 variables, 218 clauses), and large (250 variables, 1065 clauses) problem instances. These findings highlight the potential of GA+EJ, especially when combined with parallel computing, as an efficient method for solving MAX-3SAT and similar optimization problems.
ABSTRACT. Molecular sequence analysis is crucial for comprehending several biological processes, including protein-protein interactions, functional annotation, and disease classification. The large number of sequences and the inherently complicated nature of protein structures make it challenging to analyze such data. Finding patterns and enhancing subsequent research requires the use of dimensionality reduction and feature selection approaches.
Recently, a method called Correlated Clustering and Projection (CCP) has been proposed as an effective method for biological sequencing data.
The CCP technique is still costly to compute even though it is effective for sequence visualization. Furthermore, its utility for classifying molecular sequences is still uncertain. To solve these two problems, we present a Nearest Neighbor Correlated Clustering and Projection (CCP-NN)-based technique for efficiently preprocessing molecular sequence data.
To group related molecular sequences and produce representative supersequences, CCP makes use of sequence-to-sequence correlations. As opposed to conventional methods, CCP doesn't rely on matrix diagonalization, therefore it can be applied to a range of machine-learning problems.
We estimate the density map and compute the correlation using a nearest-neighbor search technique.
We performed molecular sequence classification using CCP and CCP-NN representations to assess the efficacy of our proposed approach. Our findings show that CCP-NN considerably improves classification task accuracy as well as significantly outperforms CCP in terms of computational runtime.
DANCE: Deep Learning-Assisted Analysis of ProteiN Sequences Using Chaos Enhanced Kaleidoscopic Images
ABSTRACT. Cancer, a complex disease characterized by uncontrolled cell growth, requires accurate identification of the cancer type to determine suitable treatment strategies. T cell receptors (TCRs), crucial proteins in the immune system, play a key role in recognizing antigens, including those associated with cancer. Recent advancements in sequencing technologies have facilitated comprehensive profiling of TCR repertoires, uncovering TCRs with potent anti-cancer activity and enabling TCR-based immunotherapies. However, analyzing these intricate biomolecules necessitates efficient representations that capture their structural and functional information.
T-cell protein sequences pose unique challenges due to their relatively smaller lengths than other biomolecules. Traditional vector-based embedding methods may encounter problems such as loss of information when representing these sequences. Therefore, an image-based representation approach becomes a preferred choice for efficient embeddings, allowing for the preservation of essential details and enabling comprehensive analysis of T-cell protein sequences.
In this paper, we propose to generate images from the protein sequences using the idea of Chaos Game Representation (CGR). For this purpose, we design images using the Kaleidoscopic images approach. This \textbf{D}eep Learning-Assisted \textbf{A}nalysis of Protei\textbf{N} Sequences Using
\textbf{C}haos \textbf{E}nhanced Kaleidoscopic Images (called DANCE) provides a unique way to visualize protein sequences by recursively applying chaos game rules around a central seed point. The resulting kaleidoscopic images exhibit symmetrical patterns that offer a visually captivating representation of the protein sequences.
To investigate this approach's effectiveness, we perform the classification of the T cell receptors (TCRs) protein sequences in terms of their respective target cancer cells, as TCRs are known for their immune response against cancer disease. Before classification, the TCR sequences are converted into images using the DANCE method. We employ deep-learning vision models to classify the generated images to obtain insights into the relationship between the visual patterns observed in the generated kaleidoscopic images and the underlying protein properties.
By combining CGR-based image generation with deep learning classification, this study opens novel possibilities in the protein analysis domain.
Computing Gram Matrix for SMILES Strings using RDKFingerprint and Sinkhorn-Knopp Algorithm
ABSTRACT. SMILES (Simplified Molecular Input Line Entry System) strings are widely used to represent molecular structures in cheminformatics and drug discovery. However, effectively transforming these string-based representations into meaningful numerical features for machine learning remains a significant challenge due to the complex, non-Euclidean nature of molecular structures. Traditional fingerprint-based and deep learning approaches often struggle with scalability, interpretability, or computational efficiency.
Our approach leverages the Morgan Fingerprint to generate molecular feature representations, followed by a pairwise kernel function to compute a structured similarity matrix. We then refine this matrix using the Sinkhorn-Knopp algorithm, ensuring it satisfies probabilistic constraints. To reduce dimensionality, we apply Kernel Principal Component Analysis (PCA), producing compact embeddings suitable for downstream machine learning tasks.
We conduct a comprehensive empirical evaluation of the proposed method which is assessed for drug subcategory prediction (classification task) and solubility AlogPS ``\textcolor{red}{aqueous solubility and octanol/water partition coefficient}" (regression task) using the benchmark SMILES string dataset. The outcomes show the proposed method outperforms baseline methods in supervised analysis and has potential uses in molecular design and drug discovery. By integrating kernel-based learning with probabilistic refinement, our method offers a promising alternative to existing cheminformatics techniques.
Neuromorphic Spiking Neural Network Based Classification of COVID-19 Spike Sequences
ABSTRACT. The availability of SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) virus data post-COVID has reached exponentially to an enormous magnitude, opening research doors to analyze its behavior. Various studies are conducted by researchers to gain a deeper understanding of the virus, like genomic surveillance, etc, so that efficient prevention mechanisms can be developed. However, the unstable nature of the virus (rapid mutations, multiple hosts, etc) creates challenges in designing analytical systems for it. Therefore, we propose a neural network-based (NN) mechanism to perform an efficient analysis of the SARS-CoV-2 data, as NN portrays generalized behavior upon training. Moreover, rather than using the full-length genome of the virus, we apply our method to its spike region, as this region is known to have predominant mutations and is used to attach to the host cell membrane. In this paper, we introduce a pipeline that first converts the spike protein sequences into a fixed-length numerical representation and then uses Neuromorphic Spiking Neural Network to classify those sequences. We compare the performance of our method with various baselines using real-world SARS-CoV-2 spike sequence data and show that our method is able to achieve higher predictive accuracy compared to the recent baselines.
Assessing microbial genome representation across various reference databases: a comprehensive evaluation
ABSTRACT. Metagenomics research offers significant insights into the composition, diversity, and functions of microbial communities across different environments. Accurate identification of bacterial species relies on mapping sequencing reads to references in bacterial reference databases. However, inconsistencies in genomic representation and taxonomic identifiers across databases can hinder analysis. This project aimed to evaluate and resolve these discrepancies by comparing two widely used bacterial reference databases, PATRIC and RefSeq, using NCBI’s taxonomic identifiers as a benchmark for species-level agreement.
Species common to both databases were identified through shared taxonomic IDs, while the genomic representation was assessed using the BLAST tool to align contigs from one database to those in the other. This comparison extended to overlapping species with available strain information. Visualization of alignment data highlighted gaps in genomic representation, emphasizing the need to consolidate contigs for higher-quality references.
These findings underscore the importance of addressing discrepancies in reference databases to create a unified and comprehensive resource. Such efforts will enhance the accuracy and scalability of metagenomics research, facilitating new biological discoveries through improved secondary and large-scale analyses.
Innovative Tools and Strategies for Outbreak Preparedness: Integrating Environmental Microbiomes and Prospective Pathogen Collections in Hospital Settings
ABSTRACT. Hospital settings are dynamic ecosystems where microbial transmission can lead to outbreaks, particularly in high-risk units like Neonatal Intensive Care Units (NICUs). This work highlights novel tools and strategies to enhance outbreak preparedness, integrating insights from environmental microbiome mapping, prospective pathogen collection, and genomic epidemiology. Drawing on data from the CHOP NICU environments, we will explore how metagenomics and whole-genome sequencing reveal microbial reservoirs, resistomes, and cryptic transmission pathways. Additionally, we will showcase the development of bioinformatics tools, such as the CURED pipeline, enabling rapid, cost-effective diagnostic screening for emerging clones of pathogens such as Staphylococcus aureus. Together, these approaches aim to revolutionize infection control, bridging gaps in global genomic surveillance and equipping hospitals with actionable insights to mitigate nosocomial infections effectively.
ABSTRACT. We present a novel application of Hidden Markov Models (HMMs) to analyze temporal patterns in genetic linkage disequilibrium using time-sorted haplotype data. Our model employs two hidden states —linked and unlinked— to detect changes in linkage patterns between pairs of columns in genetic sequences over time. The emission probabilities are derived from minor allele frequencies (MAF) and linkage disequilibrium parameters, while transition probabilities are set to maintain state persistence with a minimal transition rate (0.05) between states.
We validated our approach using a dataset spanning from October to December 2021, evaluating linkage disequilibrium coefficients (D') ranging from 0.01 to 0.75. The model successfully identified significant temporal shifts in linkage patterns, particularly around late October and early December 2021. Through permutation testing with 20 iterations, we demonstrated the robustness of these detected state changes by controlling random swaps within each date group.
Our results suggest that this HMM-based approach provides a reliable method for detecting and analyzing temporal variations in genetic linkage patterns, offering potential insights into population genetic dynamics and evolutionary processes.
Epistatic Density of Viral Variants in Acute and Chronic HCV patients
ABSTRACT. Abstract. RNA viruses exhibit high mutation rates due to the lack of proofreading mechanisms during replication, leading to diverse intra-host viral populations. Variants with higher fitness tend to dominate the population due to enhanced transmissibility and immune escape. Fitness of viral variants depends on individual SNVs and epistatic links between pairs of SNVs as well as competition with other viral variants within the population. Recent machine learning methods have successfully predicted emerging COVID-19 variants based on epistatic SNV links, implying that SNV links contribute to fitness of viral variants.
We define the epistatic density of a viral variant as the number of positively linked SNV pairs between mutated positions in its genome. We computed epistatic density of intra-host Hepatitis C Virus (HCV) populations sampled from 85 chronic and 28 acute patients with HCV 1a genotypes. On average, epistatic density was higher in chronic patients than in acute cases. Additionally, the epistatic density distributions are more irregular and choppy in acute populations. Finally, we applied the epistatic density properties to distinguish between intra-host populations of chronic and acute HCV patients.