Deep learning to study the fundamental biological processes underlying human disease

The study of cellular structure and core biological processes—transcription, translation, signalling, metabolism, etc.—in humans and model organisms will greatly impact our understanding of human disease over the long horizon.

Predicting how cellular systems respond to environmental perturbations and are altered by genetic variation remain daunting tasks. Deep learning offers new approaches for modelling biological processes and integrating multiple types of omic data, which could eventually help predict how these processes are disrupted in disease. Recent work has already advanced our ability to identify and interpret genetic variants, study microbial communities and predict protein structures, which also relates to the problems discussed in the drug development section. In addition, unsupervised deep learning has enormous potential for discovering novel cellular states from gene expression, fluorescence microscopy and other types of data that may ultimately prove to be clinically relevant.

Progress has been rapid in genomics and imaging, fields where important tasks are readily adapted to well-established deep learning paradigms. One-dimensional CNNs and RNNs are well suited for tasks related to DNA- and RNA-binding proteins, epigenomics and RNA splicing. Two- dimensional CNNs are ideal for segmentation, feature extraction and classification in fluorescence microscopy images. Other areas, such as cellular signalling, are biologically important but studied less-frequently to date, with some exceptions. This may be a consequence of data limitations or greater challenges in adapting neural network architectures to the available data. Here, we highlight several areas of investigation and assess how deep learning might move these fields forward.

Gene expression

Gene expression technologies characterize the abundance of many thousands of RNA transcripts within a given organism, tissue or cell. This characterization can represent the underlying state of the given system and can be used to study heterogeneity across samples as well as how the system reacts to perturbation. While gene expression measurements were traditionally made by quantitative polymerase chain reaction, low-throughput fluorescence-based methods and microarray technologies, the field has shifted in recent years to primarily performing RNA sequencing (RNA-seq) to catalogue whole transcriptomes. As RNA-seq continues to fall in price and rise in throughput, sample sizes will increase and training deep models to study gene expression will become even more useful.

Already several deep learning approaches have been applied to gene expression data with varying aims. For instance, many researchers have applied unsupervised deep learning models to extract meaningful representations of gene modules or sample clusters. Denoising autoencoders have been used to cluster yeast expression microarrays into known modules representing cell cycle processes and to stratify yeast strains based on chemical and mutational perturbations. Shallow (one hidden layer) denoising autoencoders have also been fruitful in extracting biological insight from thousands of Pseudomonas aeruginosa experiments and in aggregating features relevant to specific breast cancer subtypes. These unsupervised approaches applied to gene expression data are powerful methods for identifying gene signatures that may otherwise be overlooked. An additional benefit of unsupervised approaches is that ground-truth labels, which are often difficult to acquire or are incorrect, are non-essential. However, the genes that have been aggregated into features must be interpreted carefully. Attributing each node to a single specific biological function risks over-interpreting models. Batch effects could cause models to discover non-biological features, and downstream analyses should take this into consideration. Deep learning approaches are also being applied to gene expression prediction tasks. For example, a deep neural network with three hidden layers outperformed linear regression in inferring the expression of over 20 000 target genes based on a representative, well-connected set of about 1000 landmark genes. However, while the deep learning model outperformed existing algorithms in nearly every scenario, the model still displayed poor performance. The paper was also limited by computational bottlenecks that required data to be split randomly into two distinct models and trained separately. It is unclear how much performance would have increased if not for computational restrictions.

Epigenomic data, combined with deep learning, may have sufficient explanatory power to infer gene expression. For instance, the DeepChrome CNN improved the prediction accuracy of high or low gene expression from histone modifications over existing methods. AttentiveChrome added a deep attention model to further enhance DeepChrome. Deep learning can also integrate different data types. For example, Liang et al. combined RBMs to integrate gene expression, DNA methylation and miRNA data to define ovarian cancer subtypes. While these approaches are promising, many convert gene expression measurements to categorical or binary variables, thus ablating many complex gene expression signatures present in intermediate and relative numbers.

Deep learning applied to gene expression data is still in its infancy, but the future is bright. Many previously untestable hypotheses can now be interrogated as deep learning enables analysis of increasing amounts of data generated by new technologies. For example, the effects of cellular heterogeneity on basic biology and disease aetiology can now be explored by single-cell RNA-seq and high-throughput fluorescence-based imaging, techniques we discuss below that will benefit immensely from deep learning approaches.


Pre-mRNA transcripts can be spliced into different isoforms by retaining or skipping subsets of exons or including parts of introns, creating enormous spatio-temporal flexibility to generate multiple distinct proteins from a single gene. This remarkable complexity can lend itself to defects that underlie many diseases. For instance, splicing mutations in the lamin A (LMNA) gene can lead to specific variants of dilated cardiomyopathy and limb-girdle muscular dystrophy. A recent study found that quantitative trait loci that affect splicing in lymphoblastoid cell lines are enriched within risk loci for schizophrenia, multiple sclerosis and other immune diseases, implicating mis-splicing as a more widespread feature of human pathologies than previously thought. Therapeutic strategies that aim to modulate splicing are also currently being considered for disorders such as Duchenne muscular dystrophy and spinal muscular atrophy.

Sequencing studies routinely return thousands of unannotated variants, but which cause functional changes in splicing and how are those changes manifested? Prediction of a ‘splicing code’ has been a goal of the field for the past decade. Initial machine learning approaches used a naive Bayes model and a two-layer Bayesian neural network with thousands of hand-derived sequence-based features to predict the probability of exon skipping. With the advent of deep learning, more complex models provided better predictive accuracy. Importantly, these new approaches can take in multiple kinds of epigenomic measurements as well as tissue identity and RNA-binding partners of splicing factors. Deep learning is critical in furthering these kinds of integrative studies where different data types and inputs interact in unpredictable (often nonlinear) ways to create higher-order features. Moreover, as in gene expression network analysis, interrogating the hidden nodes within neural networks could potentially illuminate important aspects of splicing behaviour. For instance, tissue-specific splicing mechanisms could be inferred by training networks on splicing data from different tissues, then searching for common versus distinctive hidden nodes, a technique employed by Qin et al. for tissue-specific transcription factor (TF) binding predictions.

A parallel effort has been to use more data with simpler models. An exhaustive study using
readouts of splicing for millions of synthetic intronic sequences uncovered motifs that influence the strength of alternative splice sites. The authors built a simple linear model using hexamer motif frequencies that successfully generalized to exon skipping. In a limited analysis using single- nucleotide polymorphisms (SNPs) from three genes, it predicted exon skipping with three times the accuracy of an existing deep learning-based framework. This case is instructive in that clever sources of data, not just more descriptive models, are still critical.

We already understand how mis-splicing of a single gene can cause diseases such as limb-girdle muscular dystrophy. The challenge now is to uncover how genome-wide alternative splicing underlies complex, non-Mendelian diseases such as autism, schizophrenia, Type 1 diabetes and multiple sclerosis. As a proof of concept, Xiong et al. sequenced five autism spectrum disorder and 12 control samples, each with an average of 42 000 rare variants, and identified mis- splicing in 19 genes with neural functions. Such methods may one day enable scientists and clinicians to rapidly profile thousands of unannotated variants for functional effects on splicing and nominate candidates for further investigation. Moreover, these nonlinear algorithms can deconvolve the effects of multiple variants on a single splice event without the need to perform combinatorial in vitro experiments. The ultimate goal is to predict an individual’s tissue-specific, exon-specific splicing patterns from their genome sequence and other measurements to enable a new branch of precision diagnostics that also stratifies patients and suggests targeted therapies to correct splicing defects. However, to achieve this we expect that methods to interpret the ‘black box’ of deep neural networks and integrate diverse data sources will be required.

Transcription factors

TFs are proteins that bind regulatory DNA in a sequence-specific manner to modulate the activation and repression of gene transcription. High-throughput in vitro experimental assays that quantitatively measure the binding specificity of a TF to a large library of short oligonucleotides provide rich datasets to model the naked DNA sequence affinity of individual TFs in isolation. However, in vivo TF binding is affected by a variety of other factors beyond sequence affinity, such as competition and cooperation with other TFs, TF concentration and chromatin state (chemical modifications to DNA and other packaging proteins that DNA is wrapped around). TFs can thus exhibit highly variable binding landscapes across the same genomic DNA sequence across diverse cell types and states. Several experimental approaches such as chromatin immunoprecipitation followed by sequencing (ChIP-seq) have been developed to profile in vivo binding maps of TFs. Large reference compendia of ChIP-seq data are now freely available for a large collection of TFs in a small number of reference cell states in humans and a few other model organisms. Owing to fundamental material and cost constraints, it is infeasible to perform these experiments for all TFs in every possible cellular state and species. Hence, predictive computational models of TF binding are essential to understand gene regulation in diverse cellular contexts.

Several machine learning approaches have been developed to learn generative and discriminative models of TF binding from in vitro and in vivo TF binding datasets that associate collections of synthetic DNA sequences or genomic DNA sequences to binary labels (bound/unbound) or continuous measures of binding. The most common class of TF binding models in the literature are those that only model the DNA sequence affinity of TFs from in vitro and in vivo binding data. The earliest models were based on deriving simple, compact, interpretable sequence motif representations such as position weight matrices (PWMs) and other biophysically inspired models. These models were outperformed by general k-mer-based models including support vector machines (SVMs) with string kernels.

In 2015, Alipanahi et al. developed DeepBind, the first CNN to classify bound DNA sequences based on in vitro and in vivo assays against random DNA sequences matched for dinucleotide sequence composition. The convolutional layers learn pattern detectors reminiscent of PWMs from a one-hot encoding of the raw input DNA sequences. DeepBind outperformed several state-of-the-art methods from the DREAM5 in vitro TF-DNA motif recognition challenge. Although DeepBind was also applied to RNA-binding proteins, in general, RNA binding is a separate problem and accurate models will need to account for RNA secondary structure. Following DeepBind, several optimized convolutional and recurrent neural network architectures as well as novel hybrid approaches that combine kernel methods with neural networks have been proposed that further improve performance. Specialized layers and regularizers have also been proposed to reduce parameters and learn more robust models by taking advantage of specific properties of DNA sequences such as their reverse complement equivalence.

While most of these methods learn independent models for different TFs, in vivo multiple TFs compete or cooperate to occupy DNA binding sites, resulting in complex combinatorial co-binding landscapes. To take advantage of this shared structure in in vivo TF binding data, multi-task neural network architectures have been developed that explicitly share parameters across models for multiple TFs. Some of these multi-task models train and evaluate classification performance relative to an unbound background set of regulatory DNA sequences sampled from the genome rather than using synthetic background sequences with matched dinucleotide composition.

The above-mentioned TF binding prediction models that use only DNA sequences as inputs have a fundamental limitation. Because the DNA sequence of a genome is the same across different cell types and states, a sequence-only model of TF binding cannot predict different in vivo TF binding landscapes in new cell types not used during training. One approach for generalizing TF binding predictions to new cell types is to learn models that integrate DNA sequence inputs with other cell-type-specific data modalities that modulate in vivo TF binding such as surrogate measures of TF concentration (e.g. TF gene expression) and chromatin state. Arvey et al. showed that combining the predictions of SVMs trained on DNA sequence inputs and cell-type specific DNase-seq data, which measures genome-wide chromatin accessibility, improved in vivo TF binding prediction within and across cell types. Several ‘footprinting’-based methods have also been developed that learn to discriminate bound from unbound instances of known canonical motifs of a target TF based on high-resolution footprint patterns of chromatin accessibility that are specific to the target TF. However, the genome-wide predictive performance of these methods in new cell types and states has not been evaluated.

Recently, a community challenge known as the ‘ENCODE-DREAM in vivo TF Binding Site Prediction Challenge’ was introduced to systematically evaluate the genome-wide performance of methods that can predict TF binding across cell states by integrating DNA sequence and in vitro DNA shape with cell-type-specific chromatin accessibility and gene expression. A deep learning model called FactorNet was among the top three performing methods in the challenge. FactorNet uses a multimodal hybrid convolutional and recurrent architecture that integrates DNA sequence with chromatin accessibility profiles, gene expression and evolutionary conservation of sequence. It is worth noting that FactorNet was slightly outperformed by an approach that does not use neural networks. This top ranking approach uses an extensive set of curated features in a weighted variant of a discriminative maximum conditional likelihood model in combination with a novel iterative training strategy and model stacking. There appears to be significant room for improvement because none of the current approaches for cross cell-type prediction explicitly account for the fact that TFs can co-bind with distinct cofactors in different cell states. In such cases, sequence features that are predictive of TF binding in one cell state may be detrimental to predicting binding in another.

Singh et al. developed transfer string kernels for SVMs for cross-context TF binding. Domain adaptation methods that allow training neural networks which are transferable between differing training and test set distributions of sequence features could be a promising avenue going forward. These approaches may also be useful for transferring TF binding models across species.

Another class of imputation-based cross cell type in vivo TF binding prediction methods leverage the strong correlation between combinatorial binding landscapes of multiple TFs. Given a partially complete panel of binding profiles of multiple TFs in multiple cell types, a deep learning method called TFImpute learns to predict the missing binding profile of a target TF in some target cell type in the panel based on the binding profiles of other TFs in the target cell type and the binding profile of the target TF in other cell types in the panel. However, TFImpute cannot generalize predictions beyond the training panel of cell types and requires TF binding profiles of related TFs.

It is worth noting that TF binding prediction methods in the literature based on neural networks and other machine learning approaches choose to sample the set of bound and unbound sequences in a variety of different ways. These choices and the choice of performance evaluation measures significantly confound systematic comparison of model performance (see Discussion).

Several methods have also been developed to interpret neural network models of TF binding. Alipanahi et al. visualize convolutional filters to obtain insights into the sequence preferences of TFs. They also introduced in silico mutation maps for identifying important predictive nucleotides in input DNA sequences by exhaustively forward propagating perturbations to individual nucleotides to record the corresponding change in output prediction. Shrikumar et al. proposed efficient backpropagation-based approaches to simultaneously score the contribution of all nucleotides in an input DNA sequence to an output prediction. Lanchantin et al. developed tools to visualize TF motifs learned from TF binding site classification tasks. These and other general interpretation techniques (see Discussion) will be critical to improve our understanding of the biologically meaningful patterns learned by deep learning models of TF binding.

Promoters and enhancers

From transcription factor binding to promoters and enhancers. Multiple TFs act in concert to coordinate changes in gene regulation at the genomic regions known as promoters and enhancers. Each gene has an upstream promoter, essential for initiating that gene’s transcription. The gene may also interact with multiple enhancers, which can amplify transcription in particular cellular contexts. These contexts include different cell types in development or environmental stresses.

Promoters and enhancers provide a nexus where clusters of TFs and binding sites mediate downstream gene regulation, starting with transcription. The gold standard to identify an active promoter or enhancer requires demonstrating its ability to affect transcription or other downstream gene products. Even extensive biochemical TF binding data has thus far proven insufficient on its own to accurately and comprehensively locate promoters and enhancers. We lack sufficient understanding of these elements to derive a mechanistic ‘promoter code’ or ‘enhancer code’. But extensive labelled data on promoters and enhancers lends itself to probabilistic classification. The complex interplay of TFs and chromatin leading to the emergent properties of promoter and enhancer activity seems particularly apt for representation by deep neural networks.

Promoters. Despite decades of work, computational identification of promoters remains a stubborn problem. Researchers have used neural networks for promoter recognition as early as 1996. Recently, a CNN recognized promoter sequences with sensitivity and specificity exceeding 90%. Most activity in computational prediction of regulatory regions, however, has moved to enhancer identification. Because one can identify promoters with straightforward biochemical assays, the direct rewards of promoter prediction alone have decreased. But the reliable ground-truth provided by these assays makes promoter identification an appealing test bed for deep learning approaches that can also identify enhancers.

Enhancers. Recognizing enhancers presents additional challenges. Enhancers may be up to 1 000 000 bp away from the affected promoter, and even within introns of other genes. Enhancers do not necessarily operate on the nearest gene and may affect multiple genes. Their activity is frequently tissue- or context-specific. No biochemical assay can reliably identify all enhancers. Distinguishing them from other regulatory elements remains difficult, and some believe the distinction somewhat artificial. While these factors make the enhancer identification problem more difficult, they also make a solution more valuable.

Several neural network approaches yielded promising results in enhancer prediction. Both Basset and DeepEnhancer used CNNs to predict enhancers. DECRES used a feed-forward neural network to distinguish between different kinds of regulatory elements, such as active enhancers and promoters. DECRES had difficulty distinguishing between inactive enhancers and promoters. They also investigated the power of sequence features to drive classification, finding that beyond CpG islands, few were useful.

Comparing the performance of enhancer prediction methods illustrates the problems in using metrics created with different benchmarking procedures. Both the Basset and DeepEnhancer studies include comparisons to a baseline SVM approach, gkm-SVM. The Basset study reports gkm-SVM attains a mean area under the precision-recall curve (AUPR) of 0.322 over 164 cell types. The DeepEnhancer study reports for gkm-SVM a dramatically different AUPR of 0.899 on nine cell types. This large difference means it is impossible to directly compare the performance of Basset and DeepEnhancer based solely on their reported metrics. DECRES used a different set of metrics altogether. To drive further progress in enhancer identification, we must develop a common and comparable benchmarking procedure.

Promoter–enhancer interactions. In addition to the location of enhancers, identifying enhancer–promoter interactions in three- dimensional space will provide critical knowledge for understanding transcriptional regulation. SPEID used a CNN to predict these interactions with only sequence and the location of putative enhancers and promoters along a one-dimensional chromosome. It compared well to other methods using a full complement of biochemical data from ChIP-seq and other epigenomic methods. Of course, the putative enhancers and promoters used were themselves derived from epigenomic methods. But one could easily replace them with the output of one of the enhancer or promoter prediction methods above.

MicroRNA binding

Prediction of miRNAs and miRNA targets is of great interest, as they are critical components of gene regulatory networks and are often conserved across great evolutionary distance. While many machine learning algorithms have been applied to these tasks, they currently require extensive feature selection and optimization. For instance, one of the most widely adopted tools for miRNA target prediction, TargetScan, trained multiple linear regression models on 14 hand- curated features including structural accessibility of the target site on the mRNA, the degree of site conservation and predicted thermodynamic stability of the miRNA–mRNA complex. Some of these features, including structural accessibility, are imperfect or empirically derived. In addition, current algorithms suffer from low specificity.

As in other applications, deep learning promises to achieve equal or better performance in predictive tasks by automatically engineering complex features to minimize an objective function. Two recently published tools use different recurrent neural network-based architectures to perform miRNA and target prediction with solely sequence data as input. Though the results are preliminary and still based on a validation set rather than a completely independent test set, they were able to predict microRNA target sites with higher specificity and sensitivity than TargetScan. Excitingly, these tools seem to show that RNNs can accurately align sequences and predict bulges, mismatches and wobble base pairing without requiring the user to input secondary structure predictions or thermodynamic calculations. Further incremental advances in deep learning for miRNA and target prediction will likely be sufficient to meet the current needs of systems biologists and other researchers who use prediction tools mainly to nominate candidates that are then tested experimentally.

Protein secondary and tertiary structure

Proteins play fundamental roles in almost all biological processes, and understanding their structure is critical for basic biology and drug development. UniProt currently has about 94 million protein sequences, yet fewer than 100 000 proteins across all species have experimentally solved structures in Protein Data Bank (PDB). As a result, computational structure prediction is essential for a majority of proteins. However, this is very challenging, especially when similar solved structures, called templates, are not available in PDB. Over the past several decades, many computational methods have been developed to predict aspects of protein structure such as secondary structure, torsion angles, solvent accessibility, inter-residue contact maps, disorder regions and side-chain packing. In recent years, multiple deep learning architectures have been applied, including DBNs, LSTMs, CNNs and deep convolutional neural fields.

Here, we focus on deep learning methods for two representative sub-problems: secondary structure prediction and contact map prediction. Secondary structure refers to local conformation of a sequence segment, while a contact map contains information on all residue–residue contacts. Secondary structure prediction is a basic problem and an almost essential module of any protein structure prediction package. Contact prediction is much more challenging than secondary structure prediction, but it has a much larger impact on tertiary structure prediction. In recent years, the accuracy of contact prediction has greatly improved.

One can represent protein secondary structure with three different states (α-helix, β-strand and loop regions) or eight finer-grained states. The accuracy of a three-state prediction is called Q3, and accuracy of an eight-state prediction is called Q8. Several groups applied deep learning to protein secondary structure prediction but were unable to achieve significant
improvement over the de facto standard method PSIPRED, which uses two shallow feed- forward neural networks. In 2014, Zhou & Troyanskaya demonstrated that they could improve Q8 accuracy by using a deep supervised and convolutional generative stochastic network. In 2016, Wang et al. developed a DeepCNF model that improved Q3 and Q8 accuracy as well as prediction of solvent accessibility and disorder regions. DeepCNF achieved a higher Q3 accuracy than the standard maintained by PSIPRED for more than 10 years. This improvement may be mainly due to the ability of convolutional neural fields to capture long-range sequential information, which is important for β-strand prediction. Nevertheless, the improvements in secondary structure prediction from DeepCNF are unlikely to result in a commensurate improvement in tertiary structure prediction because secondary structure mainly reflects coarse- grained local conformation of a protein structure.

Protein contact prediction and contact-assisted folding (i.e. folding proteins using predicted contacts as restraints) represent a promising new direction for ab initio folding of proteins without good templates in PDB. Coevolution analysis is effective for proteins with a very large number (more than 1000) of sequence homologues, but fares poorly for proteins without many sequence homologues. By combining coevolution information with a few other protein features, shallow neural network methods such as MetaPSICOV and CoinDCA-NN have shown some advantage over pure coevolution analysis for proteins with few sequence homologues, but their accuracy is still far from satisfactory. In recent years, deeper architectures have been explored for contact prediction, such as CMAPpro, DNCON and PConsC. However, blindly tested in the well-known CASP competitions, these methods did not show any advantage over MetaPSICOV.

Recently, Wang et al. proposed the deep learning method RaptorX-Contact, which significantly improves contact prediction over MetaPSICOV and pure coevolution methods, especially for proteins without many sequence homologues. It employs a network architecture formed by one one-dimensional residual neural network and one 2D residual neural network. Blindly tested in the latest CASP competition (i.e. CASP12), RaptorX-Contact ranked first in F1 score on free-modelling targets as well as the whole set of targets. In CAMEO (which can be interpreted as a fully automated CASP), its predicted contacts were also able to fold proteins with a novel fold and only 65–330 sequence homologues. This technique also worked well on membrane proteins even when trained on non-membrane proteins. RaptorX-Contact performed better mainly due to the introduction of residual neural networks and exploitation of contact occurrence patterns by simultaneously predicting all the contacts in a single protein.

Taken together, ab initio folding is becoming much easier with the advent of direct evolutionary coupling analysis and deep learning techniques. We expect further improvements in contact prediction for proteins with fewer than 1000 homologues by studying new deep network architectures. The deep learning methods summarized above also apply to interfacial contact prediction for protein complexes but may be less effective because on average protein complexes have fewer sequence homologues. Beyond secondary structure and contact maps, we anticipate increased attention to predicting 3D protein structure directly from amino acid sequence and single residue evolutionary information.

Structure determination and cryo-electron microscopy

Complementing computational prediction approaches, cryo-electron microscopy (cryo-EM) allows near-atomic resolution determination of protein models by comparing individual electron micrographs. Detailed structures require tens of thousands of protein images. Technological development has increased the throughput of image capture. New hardware, such as direct electron detectors, has made large-scale image production practical, while new software has focused on rapid, automated image processing.

Some components of cryo-EM image processing remain difficult to automate. For instance, in particle picking, micrographs are scanned to identify individual molecular images that will be used in structure refinement. In typical applications, hundreds of thousands of particles are necessary to determine a structure to near-atomic resolution, making manual selection impractical. Typical selection approaches are semi-supervised; a user will select several particles manually, and these selections will be used to train a classifier. Now CNNs are being used to select particles in tools like DeepPicker and DeepEM. In addition to addressing shortcomings from manual selection, such as selection bias and poor discrimination of low-contrast images, these approaches also provide a means of full automation. DeepPicker can be trained by reference particles from other experiments with structurally unrelated macromolecules, allowing for fully automated application to new samples.

Downstream of particle picking, deep learning is being applied to other aspects of cryo-EM image processing. Statistical manifold learning has been implemented in the software package ROME to classify selected particles and elucidate the different conformations of the subject molecule necessary for accurate 3D structures. These recent tools highlight the general applicability of deep learning approaches for image processing to increase the throughput of high-resolution cryo-EM.

Protein–protein interactions

Protein–protein interactions (PPIs) are highly specific and non-accidental physical contacts between proteins, which occur for purposes other than generic protein production or degradation. Abundant interaction data have been generated in part thanks to advances in high- throughput screening methods, such as yeast two-hybrid and affinity-purification with mass spectrometry. However, because many PPIs are transient or dependent on biological context, high-throughput methods can fail to capture a number of interactions. The imperfections and costs associated with many experimental PPI screening methods have motivated an interest in high- throughput computational prediction.

Many machine learning approaches to PPI have focused on text mining the literature, but these approaches can fail to capture context-specific interactions, motivating de novo PPI prediction. Early de novo prediction approaches used a variety of statistical and machine learning tools on structural and sequential data, sometimes with reference to the existing body of protein structure knowledge. In the context of PPIs—as in other domains—deep learning shows promise both for exceeding current predictive performance and for circumventing limitations from which other approaches suffer.

One of the key difficulties in applying deep learning techniques to binding prediction is the task of representing peptide and protein sequences in a meaningful way. DeepPPI made PPI predictions from a set of sequence and composition protein descriptors using a two-stage deep neural network that trained two subnetworks for each protein and combined them into a single network. Sun et al. applied autocovariances, a coding scheme that returns uniform-size vectors describing the covariance between physico-chemical properties of the protein sequence at various positions. Wang et al. used deep learning as an intermediate step in PPI prediction. They examined 70 amino acid protein sequences from each of which they extracted 1260 features. A stacked sparse autoencoder with two hidden layers was then used to reduce feature dimensions and noisiness before a novel type of classification vector machine made PPI predictions.

Beyond predicting whether or not two proteins interact, Du et al. employed a deep learning approach to predict the residue contacts between two interacting proteins. Using features that describe how similar a protein’s residue is relative to similar proteins at the same position, the authors extracted uniform-length features for each residue in the protein sequence. A stacked autoencoder took two such vectors as input for the prediction of contact between two residues. The authors evaluated the performance of this method with several classifiers and showed that a deep neural network classifier paired with the stacked autoencoder significantly exceeded classical machine learning accuracy.

Because many studies used predefined higher-level features, one of the benefits of deep learning —automatic feature extraction—is not fully leveraged. More work is needed to determine the best ways to represent raw protein sequence information so that the full benefits of deep learning as an automatic feature extractor can be realized.

Major histocompatibility complex-peptide binding

An important type of PPI involves the immune system’s ability to recognize the body’s own cells. The major histocompatibility complex (MHC) plays a key role in regulating this process by binding antigens and displaying them on the cell surface to be recognized by T cells. Owing to its importance in immunity and immune response, peptide–MHC binding prediction is a useful problem in computational biology, and one that must account for the allelic diversity in MHC- encoding gene region.

Shallow, feed-forward neural networks are competitive methods and have made progress towards pan-allele and pan-length peptide representations. Sequence alignment techniques are useful for representing variable-length peptides as uniform-length features. For pan-allelic prediction, NetMHCpan used a pseudo-sequence representation of the MHC class I molecule, which included only polymorphic peptide contact residues. The sequences of the peptide and MHC were then represented using both sparse vector encoding and Blosum encoding, in which amino acids are encoded by matrix score vectors. A comparable method to the NetMHC tools is MHCflurry, a method which shows superior performance on peptides of lengths other than nine. MHCflurry adds placeholder amino acids to transform variable-length peptides to length 15 peptides. When training the MHCflurry feed-forward neural network, the authors imputed missing MHC-peptide binding affinities using a Gibbs sampling method, showing that imputation improves performance for datasets with roughly 100 or fewer training examples. MHCflurry’s imputation method increases its performance on poorly characterized alleles, making it competitive with NetMHCpan for this task. Kuksa et al. developed a shallow, higher-order neural network (HONN) comprised both mean and covariance hidden units to capture some of the higher-order dependencies between amino acid locations. Pre-training this HONN with a semi-RBM, the authors found that the performance of the HONN exceeded that of a simple deep neural network, as well as that of NetMHC.

Deep learning’s unique flexibility was recently leveraged by Bhattacharya et al., who used a gated RNN method called MHCnuggets to overcome the difficulty of multiple peptide lengths. Under this framework, they used smoothed sparse encoding to represent amino acids individually. Because MHCnuggets had to be trained for every MHC allele, performance was far better for alleles with abundant, balanced training data. Vang et al. developed HLA-CNN, a method which maps amino acids onto a 15-dimensional vector space based on their context relation to other amino acids before making predictions with a CNN. In a comparison of several current methods, Bhattacharya et al. found that the top methods—NetMHC, NetMHCpan, MHCflurry and MHCnuggets—showed comparable performance, but large differences in speed. Convolutional neural networks (in this case, HLA-CNN) showed comparatively poor performance, while shallow networks and RNNs performed the best. They found that MHCnuggets—the recurrent neural network—was by far the fastest-training among the top performing methods.

Article presented as excerpt of original text: Opportunities and obstacles for deep learning in biology and medicine.

Travers Ching, Daniel S. Himmelstein, Brett K. Beaulieu-Jones, Alexandr A. Kalinin, Brian T. Do, Gregory P. Way, Enrico Ferrero, Paul-Michael Agapow, Michael Zietz, Michael M. Hoffman, Wei Xie, Gail L. Rosen, Benjamin J. Lengerich, Johnny Israeli, Jack Lanchantin, Stephen Woloszynek, Anne E. Carpenter, Avanti Shrikumar, Jinbo Xu, Evan M. Cofer, Christopher A. Lavender, Srinivas C. Turaga, Amr M. Alexandari, Zhiyong Lu, David J. Harris Dave DeCaprio, Yanjun Qi, Anshul Kundaje, Yifan Peng, Laura K. Wiley, Marwin H. S. Segler, Simina M. Boca,
S. Joshua Swamidass, Austin Huang, Anthony Gitter, Casey S. Greene

Published 4 April 2018. DOI: 10.1098/rsif.2017.0387