AI and Big Data Analytics for Health and Bioinformatics
ABSTRACT. With the technological advances that allow for high throughput profiling of biological systems at a low cost. The low cost of data generation is leading us to the "big data" era. The availability of big data provides unprecedented opportunities but also raises new challenges for data mining and analysis. In this talk, I will start will the concepts in the analysis of big data, specifically the AI algorithms.
My group has in The Biomedical Informatics Lab (BIL) is a research Centre is the focus of the education, research and development, and human-resource training in heath informatics and bioinformatics at NTU. The mission of BIL is to provide the interdisciplinary environment and training for students and researchers to engage in leading and cutting edge research in bioinformatics, and thereby become a part of the life sciences workforce in Singapore and elsewhere.
This talk, by presenting selected research activities, will provide an overview of some of the innovative and creative approaches with the application of AI in big data analytics to address the challenges and solutions in both health and bioinformatics.
The effect of reference species on reference-guided genome assembly
ABSTRACT. The rapid improvement of the next-generation sequencing (NGS) technologies has enabled unprecedented production of huge DNA sequence data at low cost. However, the NGS technologies are still limited to generate short DNA sequences, which has led to the development of many assembly algorithms to recover whole genome sequences from those short sequences. Unfortunately, the assembly algorithms alone can only construct scaffold sequences, which are generally much shorter than chromosome sequences. To generate chromosome sequences, additional expensive experimental data is required. To overcome this problem, there have been many studies to develop new computational algorithms to further merge the scaffold sequences, and produce chromosome-level sequences by utilizing an existing genome assembly of a related species called a reference. However, even though the quality of the chosen reference assembly is critical for generating a good final assembly, its effect is not well uncovered yet. In this study, we measured the effect of the reference genome assembly on the quality of the final assembly generated by reference-guided assembly algorithms. By using the genome assemblies of total eleven reference species (eight primates and three rodents), the human genome sequences were assembled from scaffold sequences by one of the reference-guided assembly algorithms, called RACA, and they were compared with known genome sequences to measure their quality in terms of the number of misassemblies. The effect of the quality of the reference assemblies was investigated in terms of divergence time against human, alignment coverage between the reference and human, and the amount of inclusion of core eukaryotic genes. We found that the divergence time is a good indicator of the quality of the final assembly when reference assemblies with high quality are used. We believe this study will contribute to broaden our understanding of the effect and importance of a reference assembly on the reference-guided assembly task.
Position-Residue Specific Dynamic Gap Penalty Scoring Strategy for Multiple Sequence Alignment
ABSTRACT. Multiple Sequence Alignment (MSA) is a basic tool for biological
sequence analysis. Effective Alignment of multiple sequences having
biologic relevance is still an open problem.. However, MSA is
a crucial step utilized by biologists to analyze phylogentic, gene
regulations, homology marker, drug discovery and predicting the
protein structure and its functions. Accuracy of MSA is highly dependent
on the scoring function, which aligns a given residue to
its appropriate position during alignment. Scoring function has
three possible cases to score a pair of residues: i) a residue with
same residue, ii) a residue with different residue and iii) a residue
with gap. A number of biological meaningful approaches are developed
for the first two cases. However, for the third case, most
of the approaches follow default score for gap penalty, which is
provided as an input by an expert. In this study, we propose a new,
biologically relevant, and position-residue specific dynamic scoring
approach for gap penalty. Position-Residue Specific Dynamic
Gap Penalty (PRSDGP) scoring function is tested on benchmark
dataset. The proposed PRSDGP scoring approach is compared with
the CLUSTAL O program and Quality metric improvement ranges
from 46.2% to 81.5%.
Proposal of application method of Inductive Logic Programming to microarray data
ABSTRACT. This paper describing a method of specifying common terms of genes from microarray data in 3 steps.
First, we use random forest for extracting disease-related genes and it give each gene variable importance. The higher the variable importance, the more effective feature for classification. We extract genes whose variable importance more than 0 and set them positive samples and the rest set negative samples for ILP.
Next, we annotate extracted genes by using Gene Ontology (GO) and use the term as predicate for ILP. Annotation is the process of assigning GO terms to gene products.
Finally, we obtain rules about common terms in positive samples by using ILP. ILP is a subfield of machine learning which uses logic programming as a uniform representation technique for examples, background knowledge and hypotheses. ILP learns based on background knowledge. Background knowledge is represented in first-order logic.
In the result, we extracted 1051 mRNA as positive samples for ILP from random forest and its F-measure score was 65.1%. We obtained about 4000 terms at each dataset and use them as predicates for ILP. We got eventually some rules about positive samples.
Meta-analysis of whole-transcriptome data for prediction of novel genes associated with autism spectrum disorder
ABSTRACT. neurodevelopmental disorder with typical symptoms such as impaired social interaction, language and communication abnormalities and stereotypical behavior. Since the genetics of ASDs is so diverse, information on genome function as provided by transcriptomic data is essential to further our understanding, because transcriptome is a key link between measuring protein levels and genetic information. These studies have been often performed by comparing groups of individuals with ASD and control samples to identify which genes are dysregulated in the ASD group using statistical techniques. However, these statistical techniques can only find genes solely accounting for ASD, but cannot reflect relationship among genes which could be the etiology of ASD. In this study, we propose a novel method to find the ASD-associated genes, which are predictive for ASD. To this end, we analyze whole transcriptomic data of previous studies for ASD, which were performed using different expression profiling platforms on different issues of interest. These predictive genes, which can differentiate a sample into either ASD or non-ASD, are selected by an optimization process. Comparing subsets selected from different tissues/platforms, we conclude that tissues contain different gene sets account for ASD. In addition, a platform can supply other ASD-associated genes of which other platforms cannot. Found genes are compared to those which have been well documented in SFARI, which is the most comprehensive and up-to-date data of ASD. Interestingly, we can find 16 novel genes with evidences from literature, which have not yet been recorded in this database. Taken together, meta-analysis on whole-transcriptome data of ASD could shed light on the etiology of ASD.
Revealing deep proteome diversity with community-scale proteomics big data
ABSTRACT. Translating the growing volumes of proteomics mass spectrometry data into reusable evidence of the occurrence and provenance of proteomics events requires the development of novel algorithms and community-scale computational workflows. MassIVE (http://massive.ucsd.edu) proposes to address this challenge in three stages.
First, systematic annotation of human proteomics big data requires automated reanalysis of all public data using open source workflows with detailed records of search parameters and of individual Peptide Spectrum Matches (PSMs). As such, our large-scale reanalysis of tens of terabytes of human data has now increased the total number of proper public PSMs by over 10-fold to over 320 million PSMs whose coverage includes over 95% of public human HCD data.
Second, proper synthesis of community-scale search results into a reusable knowledge base (KB) requires scalable workflows imposing strict statistical controls. Our MassIVE-KB spectral library has thus properly assembled 2+ million precursors from over 1.5 million peptides covering over 6.2 million amino acids in the human proteome, all of which at least double the numbers covered by the popular NIST spectral libraries. Moreover, MassIVE-KB detects 723 novel proteins (PE 2-5) for a total of 16,852 proteins observed in non-synthetic LCMS runs and 19,610 total proteins when including the recent ProteomeTools data.
Third, we show how advanced identification algorithms combine with public data to reveal dozens of unexpected putative modifications supported by multiple highly-correlated spectra. These show that protein regions can be observed in over 100 different variants with various combinations of post-translational modifications and cleavage events, thus suggesting that current coverage of proteome diversity (at ~1.3 variants per protein region) is far below what is observable in experimental data.
A machine learning approach for drug discovery from herbal medicine: Metabolite profiles to Therapeutic effects
ABSTRACT. Vietnam has an abundant of herbal traditional medicine with accumulated experience for thousands of years. They play an important role in the drug development. However, several therapeutic effects remain unknown among these plants. To explore active ingredients in the effective Vietnamese herbal medicine formulations for individual diseases and to understand therapeutic effects under scientific viewpoint, this project predicts therapeutic effects based on metabolite profiles. The herbal medicine database has been processed to get the useful information by the supporting of computational approach, particularly Random forest algorithm, Generalized Boosted Model and Support Vector Machine. Three specific therapeutic effects – metabolites binary classification model to deal with multi-class classification and imbalanced class data problem. Since this project can reveal the main predictors of specific therapeutic effect, they are valuable information for further research of drug development.
Drug Repurposing: Targeting mToR inhibitors for Anticancer Activity.
ABSTRACT. In search of finding better and safer drugs and due to the high cost and decreasing productivity of novel drug discovery programs, scientists are now becoming more interested in finding new therapeutic indications for the existing drugs, popularly known as drug repurposing. In drug repurposing, a conventional drug is used to cure a condition which was not earlier known to be therapeutically effective. Many drugs which have failed clinical trials for not being effective in their intended therapeutic indication have been repurposed and thus they have led to huge fortune for the pharmaceutical industries. For instance, sildenafil failed its clinical trials and was repurposed and currently in use as a repurposed drug. Many methods are available for drug repurposing but computational docking is a very cheap and convenient method for drug repurposing which uses computer software to find a possible binding site of a drug within a protein. For its advantages, computational docking approach was used for the present drug repurposing study of mTOR protein, where the drugs chosen were metformin, aspirin and rosuvastatin. Autodock Vina and PyMol was used to complete the study and it was found that aspirin and metformin have poor affinity (-5.8 kcal/mol) for this protein which is upregulated in various types of cancer such as- breast cancer and ovarian cancer. On the other hand, rosuvastatin was found to have a high affinity (-7.8 kcal/mol in case of flexible docking and -10.2 kcal/mol in case of rigid docking) for mTOR and binds to the same binding pocket where the immunosuppressant and anticancer drug rapamycin binds. The study therefore indicates that rosuvastatin might have significant immunosuppressive and anticancer activity by down regulating the activity of mTOR and needs further studies to prove it.
In silico Structure Based Designing of Dihydrofolate Reductase Enzyme Antagonists and Potential Small Molecules That Target DHFR Protein to Inhibit the Folic Acid Biosynthetic Pathways.
ABSTRACT. Molecular docking has exerted a profound effect
on the concept of drug discovery in recent years.
Novel parameters and procedures have enabled us
to discover numerous new molecules in the field
of cancer research. DHFR permits the genetic
evolution of carcinoma cells that are protein in
nature In order to mitigate such consequences, the
DHFR pathway needs to be ceased. Various
principles such as structure based drug designing,
computational issues, complementarity, docking
strategies, rigid and flexible approaches of
docking would provide the solutions to developing
new drugs in upcoming times. At present, with
numerous drugs and discovery tools, effective and
potent small molecule candidate identification is
expedient.Protein DHFR (Dihydrofolate
reductase) and drugs of interest are downloaded
from databases such as RCSB PDB (Protein Data Bank) and PubChem. PDB provides the pdb files
of protein and pubchem gives the sdf files of the
drugs which later is converted with OpenBable
that converts almost every format of protein and
drugs into format of interest. For more information
and research purpose if needed to see the sequence
similarities MEGA6 comes in handy to use and see
the conserved region of that protein sequence with
reference sequence. CLUSTAL Omega which is
an EMI-EMBL database software from European
region. To find out the binding site within the
protein for the drug POCASA 1.2 ans CASTp are
used. CASTp deals with more detailed
information regarding binding pocket where
POCASA gives the specific and exact position of
amino acids that deal with binding of drug within
the protein of interest. In the protein or receptor
Autodock, Pymol, PyRx and Vina use rectangular
boxes for the binding site identification.AutoDockTools provides an interactive method
for defining the torsional tree mostly for a given
ligand and rarely for the receptor. PyRx was also
used to determine the binding affinity of the drug
towards the protein and binding site as well. PyRx
is an automated software in computational
approach as every parameters and algorithms are
prefixed or default in nature.Analysis of
established cancer drugs and small molecules of
different classes, different binding affinities
towards the DHFR protein is seen. Among the
established ones a second generation antifolate
drug showed higher affinity of -11.6 Kcal/mole
and a small molecule which is also a glucosidase
inhibitor showed affinity at -13.0 Kcal/mole. Our
algorithm provides a distinct and clear concept
about the efficiency of the established anticancer
drugs working through DHFR pathway
throughout generations. On the other hand, small
molecules from different class also shows a good
binding affinity towards this DHFR protein which
leads us to the extended version of this work to
find out anticancer activity of very specific class
of molecules other than related to cancer.
Formal Validation of Neural Networks as Timed Automata
ABSTRACT. We propose a formalisation of spiking neural networks based on timed automata networks. Neurons are modelled as timed automata waiting for inputs on a number of different channels (synapses), for a given amount of time (the accumulation period). When this period is over, the current potential value is computed taking into account the current inputs and the previous decayed potential value. If the current potential overcomes a given threshold, the automaton emits a broadcast signal over its output channel, otherwise it restarts another accumulation period. After each emission, the automaton is constrained to remain inactive for a fixed refractory period. Spiking neural networks are formalised as sets of automata, one for each neuron, running in parallel and sharing channels according to the structure of the network. The model is then validated against some crucial properties defined via proper temporal logic formulae.
Estimating respiration rate using an accelerometer sensor
ABSTRACT. Breathing activity can be independently measured electronically, e.g., using a thoracic belt or a nasal thermistor or be reconstructed from noninvasive measurements such as an ECG. In this paper, the use of an accelerometer sensor to measure respiratory activity is presented. Movement of the chest was recorded by an accelerometer sensor attached to a belt around the chest. The acquisition is realized in different status: normal, apnea, deep breathing or after exhaustion and also in different postures: vertical (sitting, standing) or horizontal (lying down). The results of the experimental evaluation indicate that using a chest-accelerometer can correctly detect the waveform and the respiration rate. This method could, therefore, be suitable for automatic identification of some respiratory malfunction, for example during the obstructive apnea.
Modelling and Formal Verification of Neuronal Archetypes Coupling
ABSTRACT. In the literature, neuronal networks are often represented as graphs where each
node symbolizes a neuron and each arc stands for a synaptic connection. Some
specific neuronal graphs have biologically relevant structures and behaviors
and we call them archetypes. Six of them have already been characterized and
validated using formal methods. In this work, we tackle the next logical step and
proceed to the study of the properties of their couplings. For this purpose, we
rely on Leaky Integrate and Fire neuron modeling and we use the synchronous
programming language Lustre to implement the neuronal archetypes and to
formalize their expected properties. Then, we exploit an associated model checker
called kind2 to automatically validate these behaviors. We show that, when the
archetypes are coupled, either these behaviors are slightly modulated or
they give way to a brand new behavior. We can also observe that different
archetype couplings can give rise to strictly identical behaviors. Our results
show that time coding modeling is more suited than rate coding modeling for this
kind of studies.
Identifying microRNA targets in epithelial-mesenchymal transition using joint-intervention causal inference
ABSTRACT. microRNAs (miRNAs) are important gene regulators, controlling a wide range of biological processes and being involved in several types of cancers. Thus, exploring miRNA functions is important for diagnostics and therapeutics. Currently, several computational approaches have been developed to elucidate the miRNA-mRNA regulatory relationships. However, these approaches have their own limitations and we are still far from understanding the miRNA-mRNA relationships, especially in specific biological processes. In this paper, we adapt a causal inference method to infer miRNA targets from the Epithelial Mesenchymal Transition (EMT) dataset. EMT is a key process of cancer metastasis, and therefore elucidating miRNA-mRNA relationships in EMT plays an important role in understanding cancer metastasis. Our method utilises a causality based method that estimates the causal effect of each miRNA on a mRNA while controlling the effects of other miRNAs on the mRNA. The inferred causal effect is similar to the effect of a miRNA on a mRNA when we knockout all the other miRNAs. The experimental results show that our method is better than existing benchmark methods in finding experimentally confirmed miRNA targets. Moreover, we have found that the miR-200 family members (miR-141, miR-200a/b/c, and miR-429) synergistically regulate a number of target genes in EMT, suggesting their roles in controlling cancer metastasis. In addition, functional and pathway enrichment analyses show that the discovered miRNA-mRNA regulatory relationships are highly enriched in EMT, implying the validity of the proposed method. Novel miRNA-mRNA regulatory relationships discovered by our method provide a rich resource for follow up wet-lab experiments and EMT related studies.
Extraction of disease-related genes from PubMed paper using word2vec
ABSTRACT. Finding disease-related genes is important in drug discovery. Many genes are involved in the disease, and many studies have been conducted and reported for each disease. However, it is very costly to check these one by one. Therefore, machine learning is a suitable method to address this problem. By extracting study results from research papers by text mining, it is possible to make use of that knowledge. In this research, we aim to extract disease-related genes from PubMed papers using word2vec, which is a text mining method. The method extracts the top 10 genes whose known disease genes and vectors are close to those obtained by word2vec. Based on these, genes other than known disease-related genes are extracted and used as disease-related genes. We conducted experiments using schizophrenia, and confirmed the likelihood of this disease-related gene using random forest. Pattern 1: Only known genes. Pattern 2: Pattern 1 plus disease-related genes extracted in this study. Pattern 3: Pattern 1 plus the same number of random genes. Using these three patterns, we performed a random forest with microarray data and compared the classification accuracy. The result was that Pattern 2 had the highest accuracy. Therefore, we could extract genes with using genes related to disease by our method.
19:00-21:30Gala Dinner at Champa Island, 304, 2/4 road, Nha Trang