The website introduces a suite of deep learning tools we have developed for learning patterns and making predictions from data sets in biomedicine.
This web: our deep learning and representation learning tools to learn and predict on data from biomedicine.
Recent advances in next-generation sequencing have allowed biologists to profile a significant amount of DNA sequences, gene expression and chromatin patterns across many cell types covering the full human genome. These datasets have been made available through large-scale repositories, like ENCODE, REMC and TCGA. Processing and understanding this repository of “big” data has posed a number of computational challenges that conventional bioinformatics analysis can not handle.
We have designed novel and robust representation-learning and deep learning algorithms to process this flood of genome-wide datasets.
Here are two figures showing how our tools can be categorized with respect to the biology datasets we work on OR with respect the deep learning tools we developed.
Here is a table of these tools linking the page describing each tool in more details:
|No.||Tool Name||BioData||Short Description||Venue|
|0||AttentiveChrome||Epigenomics||Deep-learning for predicting gene expression from histone modifications||NIPS17|
|1||DeepChrome||Epigenomics||Attend~and~Predict:~Using~Deep~Attention~Model to Understand Gene Regulation by Selective Attention on Chromatin||Bioinformatics16|
|2||DeepMotif||Functional Genomics||Visualizing and Understanding Genomic Sequences Using Deep Neural Networks||PSB17|
|3||Prototype Matching Net||Functional Genomics||Prototype Matching Networks: A novel deep learning architecture for Large-Scale Multi-label Genomic Sequence Classification||Submission18|
|4||Memory Matching Net||Functional Genomics||Memory Matching Networks for Genomic Sequence Classification||ICLRwkp17|
|5||MUST-CNN||Protein Tagging||A Multilayer Shift-and-Stitch Deep Convolutional Architecture for Sequence-based Protein Structure Prediction||AAAI 16|
|6||GakCo-SVM||biomedical sequences||a Fast GApped k-mer string Kernel using COunting||ECML17|
|7||MultitaskProteinTag||Protein Tagging||A unified multitask architecture for predicting local protein properties||PLOS 12|
|8||TransferSK-SVM||Functinoal Genomics||Transfer String Kernel for Cross-Context DNA-Protein Binding Prediction||TCBB15|
Background of Learning: Representation Learning and Deep Learning
The performance of machine learning algorithms is largely dependent on the data representation (or features) on which they are applied. Deep learning aims at discovering learning algorithms that can find multiple levels of representations directly from data, with higher levels representing more abstract concepts. In recent years, the field of deep learning has lead to groundbreaking performance in many applications such as computer vision, speech understanding, natural language processing, and computational biology.
Background of Biology Relevant to Our DeepChrome and AttentiveChrome tools
DNA is a long string of paired chemical units that fall into four different types (ATCG). DNA carries information organized into units such as genes. The set of DNA in a cell is called its genome.
Gene regulation is the process of how a cell controls which genes in its genome are turned on (expressed) or off (not-expressed). The human body contains hundreds of different cell types, from liver cells to blood cells to neurons. Although these cells include the same set of DNA information, they function differently.
The regulation of different genes controls the destiny and function of each cell.
In addition to DNA sequence information, many factors, especially those in its environment (i.e., chromatin) can affect which genes the cell expresses. Our tools aim to invent novel machine learning, especially deep learning based architecture to learn from data how different chromatin factors, DNA sequences and other environmental factors influence gene expression in a cell. Such understanding of gene regulation can enable new insights into principles of life, the study of disease, and drug development.
‘Chromatin’ denotes DNA and its organizing proteins. The complex of DNA, histones, and other structural proteins is called chromatin. A cell uses specialized proteins to organize DNA in a condensed structure. These proteins include histones, which form `bead’-like structures that DNA wraps around, in turn organizing and making the DNA more compact. An important aspect of histone proteins is that they are prone to chemical modifications that can change the spatial arrangement of DNA, resulting in certain DNA regions becoming accessible or restricted and therefore affecting expressions of genes in the neighborhood region. Researchers have established the ‘Histone Code Hypothesis’ that explores the role of histone modifications in controlling gene regulation. Unlike genetic mutations, chromatin changes such as histone modifications are potentially reversible. This crucial difference makes the understanding of how chromatin factors determine gene regulation even more impactful because the knowledge can help developing drugs targeting genetic diseases.
At the whole genome level, researchers are trying to chart the locations and intensities of all the chemical modifications, referred to as marks, over the chromatin. In biology this field is called epigenetics. ‘Epi’ in Greek means over. The epigenome in a cell is the set of chemical modifications over the chromatin that alter gene expression.
Recent advances in next-generation sequencing have allowed biologists to profile a significant amount of gene expression and chromatin patterns as signals (or read counts) across many cell types covering the full human genome.
These datasets have been made available through large-scale repositories, the latest being the Roadmap Epigenome Project (REMC, publicly available).
REMC recently released 2,804 genome-wide datasets, among which 166 datasets are gene expression reads (RNA-Seq datasets) and the rest are signal reads of various chromatin marks across 100 different `normal’ human cells/tissues (1,821 datasets for histone modification marks).
The fundamental aim of processing and understanding this repository of ‘big’ data is to understand gene regulation. For each cell type, we want to know which chromatin marks are the most important and how they work together in controlling gene expression. However, previous machine learning studies on this task either failed to model spatial dependencies among mark signals or required additional feature analysis to explain the predictions
Thanks for reading!