BiCOB 2019:Papers with Abstracts

Abstract. The near exponential growth in sequence data available to bioinformaticists, and the emergence of new fields of biological research, continue to fuel an incessant need for in- creases in sequence alignment performance. Today, more than ever before, bioinformatics researchers have access to a wide variety of HPC architectures including high core count Intel Xeon processors and the many-core Intel Xeon Phi.
In this work, the implementation of a distributed, NCBI compliant, BLAST+ (C++ toolkit) code, targeted for multi- and many-core clusters, such as those containing the Intel Xeon Phi line of products is presented. The solution is robust: distributed BLAST runs can use the CPU only, the Xeon Phi processor or coprocessor, or both by utilizing the CPU or Xeon Phi processor plus a Xeon Phi coprocessor. The distributed BLAST implementation employs static load balancing, fault tolerance, and contention aware I/O. The distributed BLAST implementation, HPC-BLAST, maintains greater than 90% weak scaling efficiency on up to 160 Xeon Phi (Knights Landing) nodes.
The source code and instructions, are available under the Apache License, Version 2.0 at
Abstract. cn.MOPS is a frequently cited model-based algorithm used to quantitatively detect copy-number variations in next-generation, DNA-sequencing data. Previous work has implemented the algorithm as an R package and has achieved considerable yet limited performance improvement by employing multi-CPU parallelism (maximum achievable speedup was experimentally determined to be 9.24). In this paper, we propose an alternative mechanism of process acceleration. Using one CPU core and a GPU device in the proposed solution, gcn.MOPS, we achieve a speedup factor of 159 and reduce memory usage by more than half compared to cn.MOPS running on one CPU core.
Abstract. Diabetes is a disease reported to be the 8th leading cause of death across the world. Nearly 38 million people worldwide have Type I diabetes caused by a dysfunction of beta cells that impairs insulin production. A better understanding of mechanisms related to gene expression in beta cells might help in the development of novel strategies for the effective treatment of diabetes. Two known transcription factors, Pdx-1 and NeuroD1, are shown to regulate gene expression in beta cells. Recently gene targets that are regulated by both Pdx-1 and NeuroD1 have been identified experimentally [7]. However, the motifs for this set of genes have not been found yet. Here we undertake the task of finding statistically overrepresented motifs in genes regulated by Pdx-1 and NeuroD1. The challenge of this project is to identify statistically significant pairs of motifs: one motif of each pair is for Pdx-1 and the other for NeuroD1. Commonly known motif-finding methods are usually restricted to finding a set of potential candidates, each of which is a single motif.
Abstract. In bioinformatics, DNA sequence assembly refers to the reconstruction of an original DNA sequence by the alignment and merging of fragments that can be obtained from several sequencing methods. The main sequencing methods process thousands or even millions of these fragments, which can be short (hundreds of base pairs) or long (thousands of base pairs) read sequences. This is a highly computational task, which usually requires the use of parallel programs and algorithms, so that it can be performed with desirable accuracy and within suitable time limits. In this paper, we evaluate the performance of DALIGNER long read sequences aligner in a system using the Intel Xeon Phi 7210 processor. We are looking for scalable architectures that could provide a higher throughput that can be applied to future sequencing technologies.
Abstract. Mapping short reads to a reference genome is an essential step in many next- generation sequencing (NGS) analysis. In plants with large genomes, a large fraction of the reads can align to multiple locations of the genome with equally good alignment scores. How to map these ambiguous reads to the genome is a challenging problem with big impacts in the downstream analysis. Traditionally, the default method is to assign an ambiguous read randomly to one of the many potential locations. In this study, we explore an enrichment method that assigns an ambiguous read to the location that has produced the most reads among all the potential locations. Our results show that the enrichment method produced better results than the default random method in the discovery of single nucleotide polymorphisms (SNPs). Not only did it produce more SNP markers, but it also produced markers with better quality, which was demonstrated by higher trait-marker correlation in genome-wide association studies (GWAS).
Abstract. The aim of this paper is to identify the difference of type I interferon expression in 2- day neonatal and six-to-eight-weeks adult mice infected by Sendai virus (SeV), a single- stranded RNA virus of the family Paramyxoviridae. Sendai virus mimics the influence of respiratory syncytial virus (RSV) on humans, but does not infect humans. Although RSV has a fatal impact on people across age groups, little is understood about this common virus and the disparity between neonatal and adult immune response to it. It has been suggested by past findings that Type I interferon mRNA is present in higher levels in adults than in neonates, however there is a greater amount of interferon proteins in neonates rather than adults. To test the hypothesis that neonates are more capable of interferon production and preventing the translation of viral protein, I observed mouse models of respiratory viral infection and determined the expression of IFN-α1, IFN-α2, IFN-α5, IFN-α6, IFN-α7, IFN-β in archived mouse lung tissue samples harvested on different days post-infection with quantitative real time PCR. Expression of Glyceraldehyde 3-phosphate dehydrogenase(GAPDH), a housekeeping gene expressed constitutively in all mouse models, was used as a positive control of the experiment. To determine the ideal concentration of primer used in qPCR, primer reconstitution, primer optimization, and gel electrophoresis were conducted in advance. In addition, technical replicates and biological replicates were used to reduce error and confirm results in qPCR. In accordance with previous discovery, I found an upward trend in adults’ interferon expression from post-infection day 1 to day 5, and levels off in day 7. In contrast, neonatal levels were much higher on day 1 and remained high over the course of infection. This explains how type I interferon expression is altered in neonates to help them clear the virus at the same efficiency as adults without causing inflammation. Future research on immune response differences in human infection should focus on the evaluation of interferon protein amounts, as well as the analysis of activation of molecules downstream of the type I interferon receptors, such as signal transducer and activator of transcription (STAT) protein family. It is also crucial to compare immune cells like macrophages and natural killer cell activity in adult and neonatal mice during viral infection.
Abstract. Sequencing depth, which refers to the expected coverage of nucleotides by reads, is computed based on the assumption that reads are synthesized uniformly across chromosomes. In reality, read coverage across genomes is not uniform. Although a coverage of 10x, for example, means a nucleotide is covered 10 times on average, in certain parts of a genome, nucleotides are covered much more or much less. One factor that influences coverage is the ability of a read aligner to align reads to genomes. If a part of a genome is complex, e.g. having many repeats, aligners might have troubles aligning reads to that region, resulting in low coverage.
We introduce a systematic approach to predict the effective coverage of genomes by short-read aligners. The effective coverage of a chromosome is defined as the actual amount of bases covered by reads. We show that the quantity is highly correlated with repeat complexity of genomes. Specifically, we show that the more repeats a genome has, the less it is covered by short reads. We demonstrated this strong correlation with five popular short- read aligners in three species: Homo sapiens, Zea mays, and Glycine max. Additionally, we show that compared to other measure of sequence complexity, repeat complexity is most appropriate. This works makes it possible to predict effective coverage of genomes at a given sequencing depth.
Abstract. Comparing whole genomes and finding variation is an important and difficult bioinformatic task. We present the Polygraph, a data structure for reference-free, multiple whole genome alignment that can be used to identify genomic structural variation. This data structure is built from assembled genomes and preserves the genomic structure from the assembly. It avoids the “hairball” graph structure that can occur in other graph methods such as de Bruijn graphs. The Polygraph can easily be visualized and be used for identification of structural variants. We apply the Polygraph to Escherichia coli and Saccharomyces cerevisiae for finding Structural Variants.
Abstract. Wavelet pooling methods can improve the classification accuracy of Convolutional Neural Networks (CNNs). Combining wavelet pooling with the Nesterov-accelerated Adam (NAdam) gradient calculation method can improve both the accuracy of the CNN. We have implemented wavelet pooling with NAdam in this work using both a Haar wavelet (WavPool-NH) and a Shannon wavelet (WavPool-NS). The WavPool-NH and WavPool- NS methods are most accurate of the methods we considered for the MNIST and LIDC- IDRI lung tumor data-sets. The WavPool-NH and WavPool-NS implementations have an accuracy of 95.92% and 95.52%, respectively, on the LIDC-IDRI data-set. This is an improvement from the 92.93% accuracy obtained on this data-set with the max pooling method. The WavPool methods also avoid overfitting which is a concern with max pool- ing. We also found WavPool performed fairly well on the CIFAR-10 data-set, however, overfitting was an issue with all the methods we considered. Wavelet pooling, especially when combined with an adaptive gradient and wavelets chosen specifically for the data, has the potential to outperform current methods.
Abstract. Healthcare is considered a data-intensive industry, offering large data volumes that can, for example, be used as the basis for data-driven decisions in hospital resource planning. A significant aspect in that context is the prediction of cost-intensive patients. The presented paper introduces prediction models to identify patients at risk of causing extensive costs to the hospital. Based on a data set from a private Australian hospital group, four logistic regression models designed and evaluated to predict cost-intensive patients. Each model utilizes different feature sets including attributes gradually available throughout a patient episode. The results show that in particular variables reflecting hospital resources have a high influence on the probability to become a cost-intensive patient. The corresponding prediction model that incorporates attributes describing resource utilization achieves a sensitivity of 94.32% and thus enables an effective prediction of cost-intensive patients.
Abstract. %MinMax, a model of intra-gene translational elongation rate, relies on codon usage frequencies. Historically, %MinMax has used tables that measure codon usage bias for all genes in an organism, such as those found at HIVE-CUT. In this paper, we provide evidence that codon usage bias based on all genes is insufficient to accurately measure absolute translation rate. We show that alternative ”High-φ” codon usage tables, generated by another model (ROC-SEMPPR), are a promising alternative. By creating a hybrid model, future codon usage analyses and their applications (e.g., codon harmonization) are likely to more accurately measure the ”tempo” of translation elongation. We also suggest a High- φ alternative to the Codon Adaptation Index (CAI), a classic metric of codon usage bias based on highly expressed genes. Significantly, our new alternative is equally well correlated with empirical data as traditional CAI without using experimentally determined expression counts as input.
Abstract. MoRFs usually play as "hub" site in interaction networks of intrinsically disordered proteins. With more and more serious diseases being found to be associated with disordered proteins, identifying MoRFs has become increasingly important. In this study, we introduce a multichannel convolutional neural network (CNN) model for MoRFs prediction. This model is generated by expanding the standard one-dimensional CNN model using multiple parallel CNNs that read the sequence with different n-gram sizes (groups of residues). In addition, we add an averaging step to refine the output result of machine learning model. When compared with other methods on the same dataset, our approach achieved a balanced accuracy of 0.682 and an AUC of 0.723, which is the best performance among the single model-based approaches.
Abstract. Noise remains a particularly challenging and ubiquitous problem in cancer gene expression data clustering research, which may cause inaccurate results and mislead the underlying biological meanings. A clustering method that is robust to noise is highly desirable. No one clustering method performs best across all data sets despite a vast number of methods available. Cluster ensemble provides an approach to automatically combine results from multiple clustering methods for improving robustness and accuracy. We have proposed a novel noise robust fuzzy cluster ensemble algorithm. It employs an improved fuzzy clustering approach with different initializations as its base clusterings to avoid or alleviate the effects of noise in data sets. Its results show effective improvements over most examined noisy real cancer gene expression data sets when compared with most evaluated benchmark clustering methods: it is the top performer on three of the eight data sets, more than any other methods evaluated, and it performs well on most of the other data sets. Also, our fuzzy cluster ensemble is robust on highly noisy synthetic data sets. Moreover, it is computationally efficient.
Abstract. Prostate cancer is widely known to be one of the most common cancers among men around the world. Due to its high heterogeneity, many of the studies carried out to identify the molecular level causes for cancer have only been partially successful. Among the techniques used in cancer studies, gene expression profiling is seen to be one of the most popular techniques due to its high usage. Gene expression profiles reveal information about the functionality of genes in different body tissues at different conditions. In order to identify cancer-decisive genes, differential gene expression analysis is carried out using statistical and machine learning methodologies. It helps to extract information about genes that have significant expression differences between healthy tissues and cancerous tissues. In this paper, we discuss a comprehensive supervised classification approach using Support Vector Machine (SVM) models to investigate differentially expressed Y-chromosome genes in prostate cancer. 8 SVM models, which are tuned to have 98.3% average accuracy have been used for the analysis. We were able to capture genes like CD99 (MIC2), ASMTL, DDX3Y and TXLNGY to come out as the best candidates. Some of our results support existing findings while introducing novel findings to be possible prostate cancer candidates.
Abstract. Digital pathology (DP) is a new research area which falls under the broad umbrella of health informatics. Owing to its potential for major public health impact, in recent years DP has been attracting much research attention. Nevertheless, a wide breadth of significant conceptual and technical challenges remain, few of them greater than those encountered in digital oncology. The automatic analysis of digital pathology slides of cancerous tissues is particularly problematic due to the inherent heterogeneity of the disease, extremely large images, and numerous others. In this paper we introduce a novel machine learning based framework for the prediction of colorectal cancer outcome from whole haematoxylin & eosin (H&E) stained histopathology slides. Using a real-world data set we demonstrate the effectiveness of the method and present a detailed analysis of its different elements which corroborate its ability to extract and learn salient, discriminative, and clinically meaningful content.
Abstract. Regulation of gene expression is one of the most important problems analyzed in systems biology. It involves, among other, interactions of mRNA with miRNA - a small (21-25 nt) single–stranded non–coding RNA molecule. Its main function is post-transcriptional regulation of gene expression leading to gene silencing. It is achieved either by inhibition of translation or by degradation of mRNA. The detailed mechanisms employed include inhibition of attaching the 60s ribosomal subunit, premature ribosome drop-off or inhibition of protein elongation process, cleavage of mRNA or destabilization of mRNA. Another mechanism of regulation of gene expression involves reactive oxygen species (ROS - radical and non-radical oxygen species formed by the partial reduction of oxygen) which, being released from mitochondrium cytochrome C and inducing DNA damage, induce the apoptosis pathway. ROS level can be regulated by antioxidant systems existing in a cell. This paper presents analysis of a model of gene regulation based on these molecules, in which Petri net is used to find the key reactions and, subsequently, an ODE-based model is used to verify these conclusions.
Abstract. A central challenge in template-free protein structure prediction is controlling the quality of computed tertiary structures also known as decoys. Given the size, dimensionality, and inherent characteristics of the protein structure space, this is non-trivial. The current mechanism employed by decoy generation algorithms relies on generating as many decoys as can be afforded. This is impractical and uninformed by any metrics of interest on a decoy dataset. In this paper, we propose to equip a decoy generation algorithm with an evolving map of the protein structure space. The map utilizes low-dimensional representations of protein structure and serves as a memory whose granularity can be controlled. Evaluations on diverse target sequences show that drastic reductions in storage do not sacrifice decoy quality, indicating the promise of the proposed mechanism for decoy generation algorithms in template-free protein structure prediction.
Abstract. The Eastern oyster, Crassostrea virginica, influences a range of economic and ecological systems. The effects of agricultural runoffs, such as the herbicide atrazine, on marine species, has not been well documented. To analyze the effect of atrazine, the oyster shell surface was analyzed. The shell protects the internal organs from predators. The shells of adult oysters are also required in the spawning of new oysters. This study analyzed the effects of atrazine on oyster shells using Scanning Electron Microscopy (SEM). SEM images were taken of juvenile oysters exposed to three different concentrations of atrazine (30, 10, 3 μg/L) to determine if there is a statistical significance in microstructure frequencies and GLCM texture features of three regions of the shell. Using a multivariate t-test, a significant difference was found in the texture features of the edge regions of the shells treated with 30μg/L and 0 μg/L of atrazine. The same t-test found a significant difference in microstructures near the edge regions for shells treated with 10μg/L and 0 μg/L of atrazine. These results provide computational strategies to distinguish shells treated with high concentrations of atrazine. Future work will tie evidence collected from imaging analysis into transcriptome data to illustrate the genetic impacts of atrazine exposure.
Abstract. Molecular dynamics simulation software now provides us with a view of the structure space accessed by a molecule. Increasingly, Markov state models are proposed to integrate various simulations of a molecule and extract its equilibrium structural dynamics. The approach relies on organizing the structures accessed in simulation into states as an at- tempt to identify thermodynamically-stable and semi-stable (macro)states among which transitions can then be quantified. Typically, off-the-shelf clustering algorithms are used for this purpose. In this paper, we investigate two additional complementary approaches to state identification that rely on graph embeddings of the structures. In particular, we show that doing so allows revealing basins in the energy landscape associated with the accessed structure space. Moreover, we demonstrate that basins, directly tied to stable and semi-stable states, yield to a better model of dynamics on a proof-of-concept application.
Abstract. PiRNAs are a particular type of small non-coding RNA. They are distinct from miRNA in size as well as other characteristics, such as the lack of sequence conservation and increased complexity when compared to their miRNA counterparts. PiRNA is considered the largest class of sRNA that is expressed especially in the animal cells. piRNAs are derived from long single-stranded RNAs, which are transcribed from genomic clusters, in contrast to other small silencing RNAs. It has been speculated that one locus could generate more than one piRNA. PiRNA corresponding to repetitive elements is fewer in mammals than in other species like Drosophila and Danio rerio, which signifies that piRNA might have possessed or gained some additional functionality in mammals. While the functionality of piRNAs may not be fully understood, they are believed to be involved in gene silencing. In this paper, we will examine a novel approach to identify potential piRNA clusters based on genes downstream and upstream location and order.