Information

Main methods used to predict functional annotations in GO

Main methods used to predict functional annotations in GO


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Can someone provide me some information about the main method used to predict the inferred electronic annotations in Gene Ontology?


I believe the most common source of electronic annotations comes from analysis of peptide sequences. A collection of InterPro to GO mappings were created manually and can generate GO annotations. DNA binding domains of transcription factors would be given "DNA binding" GO annotations say.

This method has its flaws - if the domain detected has evolved away from the function used for the Interpro to GO listing, there is a potential for error in this method.


You can refer the IEA documentation maintained by GO consortium to get an idea about the automatically assigned evidence codes using IEA. Also please note that IEA is different from Computational Analysis Evidence Codes

  • ISS: Inferred from Sequence or Structural Similarity
  • ISO: Inferred from Sequence Orthology
  • ISA: Inferred from Sequence Alignment
  • ISM: Inferred from Sequence Model
  • IGC: Inferred from Genomic Context
  • IBA: Inferred from Biological aspect of Ancestor
  • IBD: Inferred from Biological aspect of Descendant
  • IKR: Inferred from Key Residues
  • IRD: Inferred from Rapid Divergence
  • RCA: inferred from Reviewed Computational Analysis

Predicting protein function and annotating complex pathways with machine learning

Proteins are the main working units of biology. Identifying and understanding what proteins do is crucial for biologists hoping to solve the complex interactions and systems that drive cellular processes. Although protein function needs to be ultimately validated by hand in the wet lab, researchers first need a hypothesis in order to design assays, which can then define the probable function of a protein.

Bioinformatics for predicting protein function
Biologists can build such hypotheses of gene function with computers. As genome sequencing becomes routine in experimental laboratories, computational gene function prediction has also become increasingly important. Computational methods are very suitable for function prediction because function information of a gene can be inferred from a database search that identifies similarity between the gene and known proteins or experimental data. Sequence similarity tools like the Basic Local Alignment Search Tool (BLAST) is one such method that searches against all previously recorded sequences and suggests a scored list of possible roles for it.

Problems with previous computational methods
However, existing bioinformatic tools can’t always predict protein function accurately, and often end up incorrectly annotating proteins within a biological system. Traditional protein function prediction tools like BLAST are usually reliable when a high sequence similarity is detected, but their accuracy falls quickly for sequences with lower similarities. For example, enzyme functions differ immensely when similarity scores fall below a certain level. Moreover, in many cases traditional methods do not annotate any function if highly similar sequences are not found, leaving many genes unannotated. In addition, other metrics such as similarity in three-dimensional structure, gene expression, or interaction data could be used. However, each of these metrics are often missing for many proteins under investigation, and so have limited applicability in reliable research.

New tools for better accuracy
Recently, several new protein annotation methods have been developed to improve overall prediction accuracy. One such developer is Dr Daisuke Kihara from Purdue University, who develops function prediction methods with new logical frameworks. In 2009, his team created an automated predictive algorithm, called the extended similarity group (ESG) method, which runs a continual comparing system, instead of a single search. From each sequence found from the first inquiry, the ESG algorithm runs a second search through the database. By combining results from this multi-levelled tactic, the ESG method significantly improves functional scoring for query proteins and outperforms previous function prediction algorithms.

Yet the team did not stop here. In a 2019 paper, they combined phylogenetic tree construction tools with traditional sequence-based prediction, called the Phylo-PFP method. They first confirmed that close similarities of protein sequences did not align with the proteins’ distances on a phylogenetic tree. By adding these distances into the sequence homology score, the protein query ranks became more reliable, and they could be more accurately linked to their gene source. Unsurprisingly, the study established Phylo-PFP significantly improved the function prediction accuracy over existing methods.

Protein group function annotation
Protein function annotation is typically run on a one-protein-one-function approach, yet this mindset can grossly oversimplify the protein function universe. In fact, most experiments find dozens of interacting proteins related to a single biological event. To understand the role of an entire protein set, their function should be determined from the group as a whole, even if the function of each individual protein is unknown. This is no simple task.

Dr Daisuke Kihara from Purdue University develops function prediction methods with new logical frameworks.

Therefore, Dr Kihara’s team focused on a new computational approach for annotating the functions of protein groups. In 2019, they proposed an iterative Group Function Prediction (iGFP) method, which holds a completely new logical framework at its core. The iGFP algorithm considers a set of proteins as input, and predicts the role of the function of the entire group, as well as its individual members. The iGFP algorithm blends sequence data from multiple sources and builds a complementary network. The method then separates the proteins into clusters that have functional relevance and compares them based on functional and interaction relationships.

The iGFP algorithm iteratively assigns functions to protein groups and to individual proteins in the groups.

Moreover, the system automatically assumes that some proteins are unknown and uses a range of other comparative features to make an accurate prediction. During this scan, the algorithm considers protein-protein interactions, phylogenetic profile similarity, gene co-expression, large-scale pathway similarity, and gene ontology similarity. This type of comprehensive group function prediction could be an altogether improved reflection of the real mechanisms at work in, for example, developmental or disease-causing pathways.

Identifying proteins with multiple functions
In addition to analysing protein groups, the Kihara team has taken another step away from the one-protein-one-function scheme by studying multi-functional proteins. Most bioinformatic tools do not take into account that proteins, enzymes in particular, can be multi-functional. The Kihara lab has thus aimed to predict whether a query protein is a moonlighting protein – one that has multiple autonomous and often unrelated functions. These proteins are difficult to annotate, since their functions are not genome or protein family specific, nor linked to other indicators, such as a shared switching mechanism. Yet these proteins play key roles in cellular disease states such as cancers, and so identifying them is important.

Aashish Jain and Dr Kihara discuss functions assigned to a metabolic pathway.

To solve the problem, Dr Kihara’s team has developed a new systematic approach to study moonlighting proteins. In 2016, the team proposed an automated prediction framework which uses several non-sequence-based data to identify moonlighting proteins. They used machine learning classifiers to predict multi-functional proteins, after which they cross validated the results using existing databases. Dr Kihara’s team could predict moonlighting proteins that had previous gene sequence data with 98% accuracy. Even if no sequence data was available, the system showed an impressive 75% accuracy.

The iGFP algorithm considers a set of proteins as input and predicts the function of the entire group, as well as its individual proteins.

Furthermore, in a 2018 paper the team used deep learning to sniff out moonlighting proteins from previously published literature. Their text mining tool DextMP could find out whether a protein had multiple functions or not based on information from journal publications and functional descriptions from protein databases. Using systematic literature processing tools, the researchers could significantly reduce time to annotate moonlighting proteins and move closer to clarifying the complex interplay of proteins within the cell.

Improvements and future predictions
Computational biology desperately needs new ways to accurately reflect the true nature of biological processes. Dr Kihara’s team has made innovative strides to step away from a traditional one-protein-one-function effort and identified functions for entire protein groups. Their algorithms outperform previous sequence-based methods by layering multiple protein characteristics and taking into account evolutionary relationships, which can be better indicators of shared functions than the simple amino acid backbone. Further, the team’s machine learning methods can predict whether a protein serves a double role, and whether such proteins have unknowingly been described in previous literature.

Despite these promising developments, bioinformatic prediction tools are only as intelligent as their design, and there is still a way to go towards fully automated, AI-driven research in protein function annotation. Overall, Dr Kihara’s team suggests that combining previous methods with emerging ones from omics experiments and evolution distance analysis will further solidify functional prediction accuracies in the future.

Personal Response

What kind of role will machine learning play in protein function prediction and understanding biological processes?

Machine learning has already been playing a big role in protein function prediction, and more widely, in bioinformatics. It is particularly effective in identifying subtle signatures that are easily overlooked by humans in input data including protein sequences that are relevant to particular functions. It is also very suitable for integrating many different types of data together to make predictions.


1 Introduction

1.1 Background

Gene annotation databases capture the current biological knowledge allowing researchers to interpret the results of life science experiments. In spite of their unquestionable importance, significant problems concerning the annotation databases still exist. One problem is that the annotation databases are currently incomplete. For virtually all sequenced organisms, only a subset of genes is known, and an even smaller subset of genes is functionally annotated [28]. As more knowledge is accumulated, genes and annotations are gradually added to such databases. This means that at any moment in time, it is likely that an annotation database will contain only a subset of all genes of the given organism, and even for those genes that are included, possibly only a subset of their functions is present in the database. In addition to this, most of the annotations are introduced by curators who manually examine the literature. In this process, it is possible that certain confirmed facts reported in existing publications might get overlooked [25]. Another problem is caused by the way these annotations are stored in the structure of the Gene Ontology (GO). There are, for instance, genes that are annotated for a particular molecular function but are not annotated for the corresponding biological process. This is not a problem for a database curator or a life scientist looking for the annotations of a specific gene, since a human can easily make obvious extrapolations. However, this is not how such databases are used most of the time. In a more typical scenario, the researcher will try to interpret the results of a high-throughput experiment using a software that performs an ontological analysis [11], [12], [24], [27], [26], [2], [4], [21], [35], [42], [43]. Such software will query an annotation database in each of the three main branches of the GO graph and calculate a statistical significance based strictly on the data retrieved, making no extrapolations. This type of analysis fails to correctly compute the statistical significance of the genes involved if they are not correctly annotated for each of the three GO categories. We should note here that no matter how thorough the annotators are, as our knowledge improves, new functions will continue to be added, and some of the older ones will be changed or revoked. Thus, due to the intrinsic evolution of scientific knowledge, gene annotations are likely to maintain a dynamic character and hence are unlikely to be considered complete anytime in the near future.

To overcome some of these problems, we previously proposed a method capable of finding gene-function associations that are not explicitly represented in the annotation databases [25]. This technique employs a latent semantic indexing (LSI) approach and was demonstrated using the human genome annotations. This first attempt used a binary representation of the relationships between genes and their functional annotations. However, the binary representation fails to properly capture the hierarchical relationships between various terms. Previous research in information retrieval (IR) has shown that the use of a weighted representation, rather than a binary one, can improve the quality of retrieval operations. Intuitively, IR term weighting attempts to exploit two simple observations: 1) terms that appear repeatedly in a document are better suited to describe the topic of the document than terms that are rarely used, and 2) infrequent terms across the document collection are better differentiators between documents than terms that appear in most or in all documents. Similar relationships might exist between genes and their annotations. Functions that are only associated with few genes carry more information about the genes and can better differentiate between them. Conversely, several closely related functions associated with a given gene will better describe what the gene actually does.

This paper explores the use of vector space model (VSM) weighting schemes in the context of a semantic analysis of biological annotations. The technique described here is able to discover implicit gene-function relationships and propose them to researchers and database curators as novel annotations. We present the results obtained with several weighting schemes on the annotations of the human genome stored in the Onto-Tools database [11], [24], which includes all known annotations from the GO Consortium.

1.2 Related Work

A VSM [5], [6], [16] has been used previously to cluster genes by creating a vector space of genes and MEDLINE abstracts of papers discussing those particular genes [17]. The similarity between genes was assessed by computing a distance between the vectors that were representing them. It was found that weighted vectors improved the results significantly over Boolean vectors [17]. VSM was also used to compute the similarity between GO terms, and the results were compared with two other nonlexical methods for analyzing the GO graph [7]. LSI [5], [6], [9] has recently been utilized for genome-wide expression data analysis [3]. LSI was also employed to identify relations between genes by creating a vector space of genes and MEDLINE abstracts [20]. Earlier IR research has shown that LSI is 30 percent more effective than word matching methods [9]. Ontologies were used in the recent past to overcome the limitations of keyword-based search, especially after the emergence of the Semantic Web [32], [39]. In [39], the authors describe an IR method that combines document annotation and query expansion using ontology terms and results ranking using VSM. Similar techniques are employed by MELISA [1] and Textpresso [30], two medical literature search tools. MELISA uses MEDLINE’s own ontology, MeSH, to semantically enrich the user queries. Textpresso builds an ontology, 80 percent of which is based on GO terms, and uses it for document annotation and query expansion.

Other approaches for predicting functional annotations for a given gene also exist. The most commonly used approach for function prediction uses sequence similarity. This approach is based on the hypothesis that a function can be transferred between similar sequences in different organisms since such similarity has been conserved over long periods of evolution [10]. This method of annotation transfer can result in incorrect function predictions due to reasons such as divergence of function within homologous proteins. Furthermore, this type of inference can also be incorrect because the annotations are only transferred from the closest homolog [23]. In order to overcome these problems, approaches combining sequence similarity data with structural information have been proposed [14], [38]. The guilt by association (GBA) approach [33], [40], [44], based on the observation that functionally related genes tend to share similar mRNA expression profiles, has also been widely applied to predict gene functions [8], [13], [22], [36], [41]. This approach clusters the genes based on their expression profiles in order to predict the gene functions. The GBA approaches are affected by issues such as data transformation [15], [31] and filtering intended to boost the signal-to-noise ratio [19]. An alternative approach uses sequence similarity and protein domain data in order to predict functional annotations [37]. Raychaudhuri et al. [34] proposed a natural language processing approach for automatically extracting gene-function associations from the literature abstracts.


Methods

Experiment overview

The time line for the second CAFA experiment followed that of the first experiment and is illustrated in Fig. 1. Briefly, CAFA2 was announced in July 2013 and officially started in September 2013, when 100,816 target sequences from 27 species were made available to the community. Teams were required to submit prediction scores within the (0,1] range for each protein–term pair they chose to predict on. The submission deadline for depositing these predictions was set for January 2014 (time point t 0). We then waited until September 2014 (time point t 1) for new experimental annotations to accumulate on the target proteins and assessed the performance of the prediction methods. We will refer to the set of all experimentally annotated proteins available at t 0 as the training set and to a subset of target proteins that accumulated experimental annotations during (t 0,t 1] and used for evaluation as the benchmark set. It is important to note that the benchmark proteins and the resulting analysis vary based on the selection of time point t 1. For example, a preliminary analysis of the CAFA2 experiment was provided during the Automated Function Prediction Special Interest Group (AFP-SIG) meeting at the Intelligent Systems for Molecular Biology (ISMB) conference in July 2014.

Time line for the CAFA2 experiment

The participating methods were evaluated according to their ability to predict terms in GO [3] and Human Phenotype Ontology (HPO) [8]. In contrast with CAFA1, where the evaluation was carried out only for the Molecular Function Ontology (MFO) and Biological Process Ontology (BPO), in CAFA2 we also assessed the performance for the prediction of Cellular Component Ontology (CCO) terms in GO. The set of human proteins was further used to evaluate methods according to their ability to associate these proteins with disease terms from HPO, which included all sub-classes of the term HP:0000118, “Phenotypic abnormality”.

In total, 56 groups submitting 126 methods participated in CAFA2. From those, 125 methods made valid predictions on a sufficient number of sequences. Further, 121 methods submitted predictions for at least one of the GO benchmarks, while 30 methods participated in the disease gene prediction tasks using HPO.

Evaluation

The CAFA2 experiment expanded the assessment of computational function prediction compared with CAFA1. This includes the increased number of targets, benchmarks, ontologies, and method comparison metrics.

We distinguish between two major types of method evaluation. The first, protein-centric evaluation, assesses performance accuracy of methods that predict all ontological terms associated with a given protein sequence. The second type, term-centric evaluation, assesses performance accuracy of methods that predict if a single ontology term of interest is associated with a given protein sequence [2]. The protein-centric evaluation can be viewed as a multi-label or structured-output learning problem of predicting a set of terms or a directed acyclic graph (a subgraph of the ontology) for a given protein. Because the ontologies contain many terms, the output space in this setting is extremely large and the evaluation metrics must incorporate similarity functions between groups of mutually interdependent terms (directed acyclic graphs). In contrast, the term-centric evaluation is an example of binary classification, where a given ontology term is assigned (or not) to an input protein sequence. These methods are particularly common in disease gene prioritization [9]. Put otherwise, a protein-centric evaluation considers a ranking of ontology terms for a given protein, whereas the term-centric evaluation considers a ranking of protein sequences for a given ontology term.

Both types of evaluation have merits in assessing performance. This is partly due to the statistical dependency between ontology terms, the statistical dependency among protein sequences, and also the incomplete and biased nature of the experimental annotation of protein function [6]. In CAFA2, we provide both types of evaluation, but we emphasize the protein-centric scenario for easier comparisons with CAFA1. We also draw important conclusions regarding method assessment in these two scenarios.

No-knowledge and limited-knowledge benchmark sets

In CAFA1, a protein was eligible to be in the benchmark set if it had not had any experimentally verified annotations in any of the GO ontologies at time t 0 but accumulated at least one functional term with an experimental evidence code between t 0 and t 1 we refer to such benchmark proteins as no-knowledge benchmarks. In CAFA2 we introduced proteins with limited knowledge, which are those that had been experimentally annotated in one or two GO ontologies (but not in all three) at time t 0. For example, for the performance evaluation in MFO, a protein without any annotation in MFO prior to the submission deadline was allowed to have experimental annotations in BPO and CCO.

During the growth phase, the no-knowledge targets that have acquired experimental annotations in one or more ontologies became benchmarks in those ontologies. The limited-knowledge targets that have acquired additional annotations became benchmarks only for those ontologies for which there were no prior experimental annotations. The reason for using limited-knowledge targets was to identify whether the correlations between experimental annotations across ontologies can be exploited to improve function prediction.

The selection of benchmark proteins for evaluating HPO-term predictors was separated from the GO analyses. We created only a no-knowledge benchmark set in the HPO category.

Partial and full evaluation modes

Many function prediction methods apply only to certain types of proteins, such as proteins for which 3D structure data are available, proteins from certain taxa, or specific subcellular localizations. To accommodate these methods, CAFA2 provided predictors with an option of choosing a subset of the targets to predict on as long as they computationally annotated at least 5,000 targets, of which at least ten accumulated experimental terms. We refer to the assessment mode in which the predictions were evaluated only on those benchmarks for which a model made at least one prediction at any threshold as partial evaluation mode. In contrast, the full evaluation mode corresponds to the same type of assessment performed in CAFA1 where all benchmark proteins were used for the evaluation and methods were penalized for not making predictions.

In most cases, for each benchmark category, we have two types of benchmarks, no-knowledge and limited-knowledge, and two modes of evaluation, full mode and partial mode. Exceptions are all HPO categories that only have no-knowledge benchmarks. The full mode is appropriate for comparisons of general-purpose methods designed to make predictions on any protein, while the partial mode gives an idea of how well each method performs on a self-selected subset of targets.

Evaluation metrics

Precision–recall curves and remaining uncertainty–misinformation curves were used as the two chief metrics in the protein-centric mode [10]. We also provide a single measure for evaluation of both types of curves as a real-valued scalar to compare methods however, we note that any choice of a single point on those curves may not match the intended application objectives for a given algorithm. Thus, a careful understanding of the evaluation metrics used in CAFA is necessary to properly interpret the results.

Precision (pr), recall (rc), and the resulting F max are defined as

where P i(τ) denotes the set of terms that have predicted scores greater than or equal to τ for a protein sequence i, T i denotes the corresponding ground-truth set of terms for that sequence, m(τ) is the number of sequences with at least one predicted score greater than or equal to τ, (mathbbm <1>left (cdot ight)) is an indicator function, and n e is the number of targets used in a particular mode of evaluation. In the full evaluation mode n e=n, the number of benchmark proteins, whereas in the partial evaluation mode n e=m(0), i.e., the number of proteins that were chosen to be predicted using the particular method. For each method, we refer to m(0)/n as the coverage because it provides the fraction of benchmark proteins on which the method made any predictions.

The remaining uncertainty (ru), misinformation (mi), and the resulting minimum semantic distance (S min) are defined as

where ic(f) is the information content of the ontology term f [10]. It is estimated in a maximum likelihood manner as the negative binary logarithm of the conditional probability that the term f is present in a protein’s annotation given that all its parent terms are also present. Note that here, n e=n in the full evaluation mode and n e=m(0) in the partial evaluation mode applies to both ru and mi.

In addition to the main metrics, we used two secondary metrics. Those were the weighted version of the precision–recall curves and the version of the remaining uncertainty–misinformation curves normalized to the [ 0,1] interval. These metrics and the corresponding evaluation results are shown in Additional file 1.

For the term-centric evaluation we used the area under the receiver operating characteristic (ROC) curve (AUC). The AUCs were calculated for all terms that have acquired at least ten positively annotated sequences, whereas the remaining benchmarks were used as negatives. The term-centric evaluation was used both for ranking models and to differentiate well and poorly predictable terms. The performance of each model on each term is provided in Additional file 1.

As we required all methods to keep two significant figures for prediction scores, the threshold τ in all metrics used in this study was varied from 0.01 to 1.00 with a step size of 0.01.

Data sets

Protein function annotations for the GO assessment were extracted, as a union, from three major protein databases that are available in the public domain: Swiss-Prot [11], UniProt-GOA [12] and the data from the GO consortium web site [3]. We used evidence codes EXP, IDA, IPI, IMP, IGI, IEP, TAS, and IC to build benchmark and ground-truth sets. Annotations for the HPO assessment were downloaded from the HPO database [8].

Figure 2 summarizes the benchmarks we used in this study. Figure 2 a shows the benchmark sizes for each of the ontologies and compares these numbers to CAFA1. All species that have at least 15 proteins in any of the benchmark categories are listed in Fig. 2 b.

CAFA2 benchmark breakdown. a The benchmark size for each of the four ontologies. b Breakdown of benchmarks for both types over 11 species (with no less than 15 proteins) sorted according to the total number of benchmark proteins. For both panels, dark colors (blue, red, and yellow) correspond to no-knowledge (NK) types, while their light color counterparts correspond to limited-knowledge (LK) types. The distributions of information contents corresponding to the benchmark sets are shown in Additional file 1. The size of CAFA 1 benchmarks are shown in gray. BPO Biological Process Ontology, CCO Cellular Component Ontology, HPO Human Phenotype Ontology, LK limited-knowledge, MFO Molecular Function Ontology, NK no-knowledge

Comparison between CAFA1 and CAFA2 methods

We compared the results from CAFA1 and CAFA2 using a benchmark set that we created from CAFA1 targets and CAFA2 targets. More precisely, we used the stored predictions of the target proteins from CAFA1 and compared them with the new predictions from CAFA2 on the overlapping set of CAFA2 benchmarks and CAFA1 targets (a sequence had to be a no-knowledge target in both experiments to be eligible for this evaluation). For this analysis only, we used an artificial GO version by taking the intersection of the two GO snapshots (versions from January 2011 and June 2013) so as to mitigate the influence of ontology changes. We, thus, collected 357 benchmark proteins for MFO comparisons and 699 for BPO comparisons. The two baseline methods were trained on respective Swiss-Prot annotations for both ontologies so that they serve as controls for database change. In particular, SwissProt2011 (for CAFA1) contained 29,330 and 31,282 proteins for MFO and BPO, while SwissProt2014 (for CAFA2) contained 26,907 and 41,959 proteins for the two ontologies.

To conduct a head-to-head analysis between any two methods, we generated B=10,000 bootstrap samples and let methods compete on each such benchmark set. The performance improvement δ from CAFA1 to CAFA2 was calculated as

where m 1 and m 2 stand for methods from CAFA1 and CAFA2, respectively, and (F_^<(b)>(cdot)) represents the F max of a method evaluated on the b-th bootstrapped benchmark set.

Baseline models

We built two baseline methods, Naïve and BLAST, and compared them with all participating methods. The Naïve method simply predicts the frequency of a term being annotated in a database [13]. BLAST was based on search results using the Basic Local Alignment Search Tool (BLAST) software against the training database [14]. A term will be predicted as the highest local alignment sequence identity among all BLAST hits annotated with the term. Both of these methods were trained on the experimentally annotated proteins available in Swiss-Prot at time t 0, except for HPO where the two baseline models were trained using the annotations from the t 0 release of the HPO.


Conclusions

We introduce a new concept for the prediction of GO terms, namely the annotation transfer based on similarity of embeddings obtained from deep learning language models (LMs). This approach conceptually replaces sequence information by complex embeddings that capture some non-local information beyond sequence similarity. The underlying LMs (SeqVec & ProtBert) are highly involved and complex, and their training is time-consuming and data intensive. Once that is done, those pre-trained LMs can be applied, their abstracted understanding of the language of life as captured by protein sequences can be transferred to yield an extremely simple, yet effective novel method for annotation transfer. This novel prediction method complements homology-based inference. Despite its simplicity, this new method outperformed by several margins of statistically significance homology-based inference (“BLAST”) with Fmax values of BPO + 11 ± 2% (Fmax(embedding)-Fmax(sequence)), MFO + 8 ± 3%, and CCO + 11 ± 2% (Table 1, Fig. 1) it even might have reached the top ten, had it participated at CAFA3 (Fig. 1). Embedding-based transfer remained above the average for sequence-based transfer even for protein pairs with PIDE < 20% (Fig. 2), i.e., embedding similarity worked for proteins that diverged beyond the recognition in pairwise alignments (Figs. S2 & S3). Embedding-based transfer is also blazingly fast to compute, i.e., around 0.05 s per protein. The only time-consuming step is computing embeddings for all proteins in the lookup database which needs to be done only once it took about 30 min for the entire human proteome. GO annotations added from 2017 to 2020 improved both sequence- and embedding-based annotation transfer significantly (Table 1). Another aspect of the simplicity is that, at least in the context of the CAFA3 evaluation, the choice of none of the two free parameters really mattered: embeddings from both LMs tested performed, on average, equally, and the number of best hits (k-nearest neighbors) did not matter much (Table S2). The power of this new concept is generated by the degree to which embeddings implicitly capture important information relevant for protein structure and function prediction. One reason for the success of our new concept was the limited correlation between embeddings and sequence (Table 2). Additionally, the abstraction of sequence information in embeddings appeared to make crucially meaningful information readily available (Fig. S6). This implies that embeddings have the potential to revolutionize the way sequence comparisons are carried out.


Discussion

DeepPheno can predict sets of gene–phenotype associations from gene functional annotations. Specifically, it is designed to predict phenotypes which arise from a loss of function (where functions are represented using the Gene Ontology) and we have illustrated how DeepPheno relates loss of functions to their downstream phenotypic effects. While DeepPheno was trained using phenotypes arising from the loss of function of a gene, its reliance on functions (instead of structural features) may allow it to also be applied to different alterations of gene function such as partial loss of function. Together with function prediction methods such as DeepGOPlus [31], DeepPheno can, in principle, predict phenotype associations for protein-coding genes using only the protein’s amino acid sequence. However, DeepGOPlus was trained on experimentally annotated sequences of many organisms, including several animal model organisms. It further combines global sequence similarity and a deep learning model which learns to recognize sequence motifs as well as some elements of protein structure. The combination of this information is implicitly used in DeepGOPlus and its predictions, and is therefore able to predict physiological functions that are closely related to the abnormal phenotypes predicted by DeepPheno.

Evaluation

We evaluated DeepPheno on two datasets and compared its predictions with the top performing methods in the CAFA2 challenge. DeepPheno showed overall the best performance in the evaluation with time based split. However, when we compared the performance of DeepPheno on 5-fold cross-validation on CAFA2 challenge training set with other hierarchical classification methods such as PhenoStruct [15] and HTD/TPR [34], our method did not outperform HTD/TPR methods combined with support-vector machine classifiers and resulted in the same performance as PhenoStruct. We think that the main reason for this is that we only rely on function annotations and the other methods use additional features such as protein–protein interactions, literature and disease causing variants associated through gene-disease associations from HPO [10]. We did not use gene expression data because it was not available during CAFA2 challenge. However, in our experiment with recent data, we have shown that DeepPheno can easily combine features from multiple sources which resulted in improvement of its performance.

Hierarchical classifier

We implemented a novel hierarchical classfication neural network in DeepPheno. It was inspired by our previous hierarchical classifer in DeepGO [32]. However, the version used in DeepPheno is significantly faster and scalable. The main difference here is that DeepPheno uses only one layer which stores ontology structure whereas DeepGO had a layer for each class in the ontology which required a connection to its children classes. Also, our new model achieves hierarchical consistency by a simple matrix multiplication operation followed by a MaxPooling layer and does not require complex operations. In DeepGO, the largest model can predict around 1, 000 classes while DeepPheno predicts around 4, 000.

We specifically compare DeepPheno with other hierarchical classification methods such as PhenoStruct [15] and HTD/TPR [34]. Also, we use the true path rule [27] to fix hierarchical dependencies of DeepPhenoFlat classifiers and compare them with our hierarchical classifiers. In all cases, the DeepPheno models outperform flat classifiers that apply the true path rule after predictions.

Hierarchical deep neural networks have also been used to simulate interactions between processes within a cell and predict (cellular) phenotypes, notably in the DCell model [59]. DCell established a correspondence between the components of a deep neural network and ontology classes, both to model the hierarchical organization of a cell and to provide a means to explain genotype–phenotype predictions by identifying which parts of the neural network (and therefore which cell components or functions) are active when a prediction is made. DeepPheno uses ontologies both as input and output and to ensure that predictions are consistent with the HPO, but does not directly enable the interpretability of models such as DCell. DeepPheno also solves a different problem compared to DCell while DCell relates (yeast) genotypes to growth phenotypes, DeepPheno predicts the phenotypic consequences of a loss of function while DCell can simulate the processes within a cell, DeepPheno aims to simulate some aspects of human physiology and the phenotypes resulting from altering physiological functions.

Limitations and future research

Currently, DeepPheno suffers from several limitations. Firstly, we use mainly function annotations and gene expressions as features. This gives our model the ability to predict phenotypes for many genes however, phenotypes do not only depend on functions of individual gene products but also they arise from complex genetic and environmental interactions. Including such information may further improve our model. Specifically, we plan to include different types of interactions between genes in order to improve prediction of complex phenotypes.

Secondly, DeepPheno currently can only predict a limited number of phenotypes for which we find at least 10 annotated genes. This limitation is caused by the need to train our neural network model and limits DeepPheno’s ability to predict specific phenotypes which are the most informative. One way to overcome this limitation is to include phenotype associations with different evidence, such as those derived from GWAS study instead of using only phenotypes resulting from Mendelian disease as included in the HPO database.

Finally, DeepPheno uses a simple fully connected layer and sparse representation of functional annotations and do not considers the full set of axioms in GO and HPO. Although, this model gave us the best performance in our experiments, we think that more “complex” learning methods which encode all semantics in the ontologies need to be considered in the future.


Protein Function Prediction Using Deep Restricted Boltzmann Machines.

Proteins are the major components of living cells, they are the main material basis that form and maintain life activities. Proteins engage with various biological activities, such as catalysis of biochemical reactions and transport to signal transduction [1, 2]. High-throughput biotechniques produce explosive growth of biological data. Due to experimental techniques and the research bias in biology [3, 4], the gap between newly discovered genome sequences and functional annotations of these sequences is becoming larger and larger. The Human Proteome Project consortium recently claimed that we still have very little information about the cellular functions of approximately two-thirds of human proteins [5]. Wet-lab experiments can precisely verify functions of proteins, but it is time consuming and costly to do so. In practice, wet-lab techniques can only verify a portion of functions of proteins. In addition, it is difficult to efficiently verify functional annotations of massive proteins by wet-lab techniques. Therefore, it is important and necessary to develop computational models to make use of available functional annotations of proteins and a variety of types genomic and proteomic data, to automatically infer protein functions [2, 6].

Various computational methods have been proposed to predict functional annotations of proteins. These methods are often driven by data-intensive computational models. Data may come from amino acids sequences [7], protein-protein interactions [8], pathways [9], and multiple types of biological data fusion [10-12]. Gene Ontology (GO) is a major bioinformatics tool to unify gene products' attributes across all species, it uses GO terms to describe the gene products attributes [13], and these terms are structured in a directed acyclic graph (DAG). Each GO term in the graph can be viewed as a functional label and is associated with a distinct alphanumeric identifier, that is, GO:0008150 (biological process). GO is not static. Researchers and GO consortium contribute to updating GO as the revolved biological knowledge. Currently, most functional annotations of proteins are shallow and far from complete [3-5]. Given the true path rule of GO [13], if a protein is annotated with a GO term, then all the ancestor terms of that term are also annotated to the protein, but it is uncertain whether its descendant terms should be annotated to the protein or not. Therefore, it is more desirable to know the specific annotations of a protein, rather than the general ones, and the corresponding specific terms can provide more biological information than the shallow ones, which are ancestor terms of these specific terms. In this work, we investigate to predict deep (or specific) annotations of a protein based on the available annotations of proteins.

Functional associations between proteins and GO structure have been directly employed to predict protein functions [14-18]. Functional annotations of proteins can be encoded by a protein function association matrix, in which each row corresponds to a protein and each column represents a type of function. King et al. [14] directly used decision tree classifier (or Bayes classifier) on the pattern of annotations to infer additional annotations of proteins. But these two classifiers need sufficient annotations and they get rather poor performance on specific GO terms, which are annotated to fewer than 10 proteins. Khatri et al. [15] used truncated single value decomposition (tSVD) to replenish the missing functions of proteins based on protein function matrix. This approach is able to predict missing annotations in existing annotation databases and improve prediction accuracy. But this method does not take advantage of the hierarchical and flat relationships between GO terms. Previous researches have demonstrated that the ontology hierarchy plays important roles in predicting protein function [2,16,18]. Done et al. [16] used a vector space model and a number of weighting schemes, along with latent semantic indexing approach to extract implicit semantic relationships between proteins and those between functions to predict protein functions. This method is called NtN [16]. NtN takes into account GO hierarchical structure and can weigh different GO terms situated at different locations of GO DAG [19]. Tao et al. [17] proposed a method called information theory-based semantic similarity (ITSS). ITSS first calculates the semantic similarity between pairwise GO terms in a hierarchy and then sums up these pairwise similarity for pairwise GO terms annotated to two proteins. Next, it uses a kNN classifier to predict novel annotations of a protein. Yu et al. [18] proposed downward random walks (dRW) to predict missing (or new) functions of partially annotated proteins. Particularly, dRW applies downward random walks with restart [20] on the GO DAG, started on terms annotated to a protein, to predict additional annotations of the protein.

A protein is often engaged with several biological activities and thus is annotated with several GO terms. Each term can be regarded as a functional label, and protein function prediction can be modeled as a multilabel learning problem [21, 22]. From this viewpoint, protein function prediction using incomplete annotations can be modeled as a multilabel weak learning problem [22]. More recently, Yu et al. [23] proposed a method called PILL to replenish missing functions for partially annotated proteins using incomplete hierarchical labels information. Fu et al. [24] proposed a method called dHG to predict novel functions of proteins using a directed hybrid graph, which is consisted with GO DAG, protein-protein interaction network, and available functional associations between GO terms and proteins. These aforementioned methods (except DRBM) can be regarded as shallow machine learning approaches [25]. They do not capture deep associations between proteins and GO terms.

In this paper, we investigate the recently widely applied technique, deep learning [25], to capture deep associations between proteins and GO terms, and to replenish the missing annotations of incompletely annotated proteins. For this investigation, we apply deep restricted Boltzmann machines (DRBM) to predict functional annotations of proteins. DRBM utilizes the archived annotations of four model species (Homo sapiens, Saccharomyces cerevisiae, Mus musculus, and Drosophila) to explore the hidden associations between proteins and GO terms and the structural relationship between GO terms. At the same time, it optimizes the parameters of DRBM. After that, we validate the performance of DRBM by comparing its predictions with recently archived GO annotations of these four species. The empirical and comparative study shows DRBM achieves better results than other related methods. DRBM also runs faster than some of these comparing methods.

The structure of this paper is organized as follows. Section 2 briefly reviews some related deep learning techniques that are recently applied for protein function prediction. Section 3 introduces the restricted Boltzmann machine and deep restricted Boltzmann machine for protein function prediction. The experimental datasets, setup, and results are discussed in Section 4. Conclusions are provided in Section 5.

Some pioneers have already applied deep learning for some bioinformatics problems [26], but few works have been reported for protein function prediction. Autoencoder neural networks (AE) can process complex structural data better than shallow machine learning methods [25, 27, 28]. AE has been applied in computer vision [28], speech recognition [25, 27], and protein residue-residue contacts prediction [26]. Chicco et al. [29] recently used deep AE to predict protein functions. Experiments show that deep AE can explore the deep associations between proteins and GO terms and achieve better performance than other shallow machine learning based function prediction methods, including tSVD [29].

Deep AE takes much more time in fine-tuning network if the network is very deep, it will lead to vanishing gradient problem. In this work, we suggest to use deep restricted Boltzmann machines (DRBM), instead of AE, to predict functional annotations of proteins. DRBM has rapid convergence speed and good stability. DRBM has been used to construct the deep belief networks [30], for speech recognition [31, 32], collaborative filtering [33], computational biology [34], and other fields. Recently, Wang and Zeng [34] proposed to predict drug-target interactions using restricted Boltzmann machines and achieved good prediction performance. More recently, Li et al. [35] used conditional restricted Boltzmann machines to capture high-order label dependence relationships and facilitate multilabel learning with incomplete labels. Experiments have demonstrated the efficacy of restricted Boltzmann machines on addressing multilabel learning with incomplete labels.

To the best of our knowledge, few teams investigate DRBM for large-scale missing functions prediction. For this purpose, we study it for predicting functions of proteins of Homo sapiens, Saccharomyces cerevisiae, Mus musculus, and Drosophila and compare it with a number of related methods. The experimental results show that DRBM achieves better results than these comparing methods on various evaluation metrics.

In this section, we will describe the deep restricted Boltzmann machines to predict missing GO annotations of proteins.

3.1. Restricted Boltzmann Machine. A restricted Boltzmann machine (RBM) is a network of undirected graphical model with stochastic binary units [32]. As shown in Figure 1, an RBM is a two-layer bipartite graph with two types of units, a set of visible units v [member of] <0,1>, and a set of hidden units h [member of] <0,1>. Input units and hidden units are fully connected there is no connection between nodes in the same layer. In this paper, the number of visible units is equal to the number of GO terms, and these units take the protein function association matrix as inputs.

RBM is an unsupervised method it learns one layer of hidden features. When the number of hidden units is smaller than that of visual units, the hidden layer can deal with nonlinear complex dependency and structure of data, capture deep relationship from input data [30], and represent the input data more compactly. Latent feature values are represented by the hidden units and visible units encode available GO annotations of proteins. Suppose there are c (the number of GO terms) visible units and m hidden units in an RBM. [v.sub.i] (i = 1,c) indicates the state of the ith visible unit, where [v.sub.i] = 1 means the ith term is annotated to the protein and [v.sub.i] = 0 means the ith term is not associated with the protein. Binary variable [h.sub.j] (j = 1,m) indicates the state of hidden unit, and [h.sub.j] = 1 denotes the jth hidden unit which is active. Let [W.sub.ij] be the weight associated with the connection between [v.sub.i] and [h.sub.j]. (v, h) is a joint configuration of an RBM.

The energy function capturing the interaction patterns between visual layer and hidden layer can be modeled as follows:

[mathematical expression not reproducible], (1)

where [theta] = <[W.sub.ij], [a.sub.i], [b.sub.j]>are parameters of RBM, while [a.sub.i] and [b.sub.j] are biases for the visible and hidden variables, respectively. W [member of] [R.sup.cxm] encodes the weights of connection between c visual variables and m hidden variables. Then, a joint probability configuration of v and h can be defined as

where Z is a normalization constant or partition function, [mathematical expression not reproducible]. The marginal distribution over visible data is

[mathematical expression not reproducible]. (3)

There is no connection between visible units (or hidden units) in an RBM the conditional distributions over the visible and hidden units are given by logistic functions as follows:

P([v.sub.i] = 1 | h) = [sigma] ([a.sub.i] + [j.summation over ([h.sub.j][W.sub.ij])) (4)

P([v.sub.i] = 1 | v) = [sigma] ([b.sub.j] + [i.summation over ([v.sub.i][W.sub.ij])), (5)

where [sigma](x) = 1/(1 + exp(-v)) is a logistics sigmoid function.

It is difficult to train an RBM with a large number of parameters. To efficiently train an RBM and to optimize the parameters, we maximize the likelihood of visible data with respect to the parameters. To achieve this goal, the derivative of log probability of the training data derived from (4) can be adopted to incrementally adjust the weights as follows:

[mathematical expression not reproducible], (6)

where <*> indicates expectations under the distribution. It is very easy to learn the log-likelihood probability of training data:

[mathematical expression not reproducible], (7)

where [epsilon] controls the learning rate. Since there are no direct connections in the hidden layer of an RBM, so we can get an unbiased sample of [<[v.sub.i][h.sub.j]>.sub.data] easily. Unfortunately, it is difficult to compute an unbiased sample of [<[v.sub.i][h.sub.j]>.sub.model], since it requires exponential time. To avoid this problem, a fast learning algorithm, called Contrastive Divergence (CD) [36], is proposed by Hinton [37]. CD sets visible variables as training data. Then the binary states of hidden units are all computed in parallel using (5). Once the states have been chosen for the hidden units, a "reconstruction" is produced by setting each v to 1 with a probability given by (4). In addition, weights are also adjusted in each training pass as follows:

[mathematical expression not reproducible]. (8)

[<[v.sub.i][h.sub.j]>.sub.data] is the average value over all input data for each update and [<[v.sub.i][h.sub.j]>.sub.recon] is the average value over reconstruction it is considered as a good approximation to [<[v.sub.i][h.sub.j]>.sub.model].

3.2. Deep RBM. In this paper, we will use a fully connected restricted Boltzmann machine and consider learning a multilayer RBMs (as shown in Figure 2). In the networkstructure, each layer captures complicated correlations between hidden layer and its beneath layer.

DRBM is adopted for several reasons [38]. Firstly, DRBM, like deep belief networks, has the potential of learning internal representations that become increasingly complex it is regarded as a promising way to solve complex problems [30]. Second, high-level representations can be built from large volume incomplete sensory inputs and scarce labeled data and then be used to unfold the model. Finally, DRBM can well propagate the uncertainty information and hence robustly deal with ambiguous inputs. Hinton et al. [30] introduced a greedy, layer-by-layer unsupervised learning algorithm that consists of learning a stack of RBMs. After the stacked RBMs have been learned, the whole stack can be viewed as a single probabilistic model. In this paper, we use that greedy algorithm to optimize the parameters of DRBM. DRBM greedily trains a stack of more than two RBMs, and the modification only needs to be used for the first and last RBMs in the stack. Retraining consists of learning a stack of RBMs each RBM has only one layer of feature detectors. The learned feature activation of one RBM is used as the input data to train the next RBM in the stack. After that, these RBMs are popped up (or unfolded) to create a DRBM. Through the above training, we can optimize the parameters of DRBM and then take the outputs of the network as the results of protein function prediction.

4.1. Datasets and Experimental Setup. To study the performance of DRBM on predicting missing GO annotations of incompletely annotated proteins. We downloaded the GO file (http://geneontology.org/page/downloadontology) (archived date: 2015-10-22), which describes hierarchical relationships between GO terms using a DAG. These GO terms are divided into three branches, describing molecular functions (MF), cellular component (CC), and biological process (BP) functions of proteins. We also downloaded the Gene Ontology Annotation (GOA) (archived date: 2014-10-27) files (http://geneontology.org/page/downloadannotations) of Saccharomyces cerevisiae, Homo sapiens, Mus musculus, and Drosophila. We preprocessed the GO file to exclude the GO terms tagged "obsolete." To avoid circular prediction, we processed the GOA file to exclude the annotations with evidence code "IEA" (inferred from Electronic Annotation). The missing annotations of a protein often correspond to the descendants of the terms currently annotated to the protein. So the terms corresponding to these missing annotations are located at deeper level than their ancestor terms, and these terms characterize more specific biological functions of proteins than their ancestors. These specific terms are usually annotated to no more than 30 proteins they are regarded as sparse functions. On the other hand, root terms, GO:0008150 for BP, GO:0003674 for MF, and GO:0005575 for CC, are annotated to majority of proteins the prediction on these terms is not interesting, so we removed these three root terms. We kept the terms annotated at least one protein in the GOA file for experiments. The statistics of preprocessed GO annotations of proteins in these four model species are listed in Table 1.

We also downloaded recently archived GOA files (date: 2015-10-12) of these four species to validate the performance of DRBM and processed these GOA files in a similar way. We use the data archived in 2014 to train DRBM and then use the data archived in 2015 for validation.

In order to comparatively evaluate the performance of DRBM, we compare it with SvD [15], NtN [16], dRW [18], and AE [29]. SVD, NtN, and dRW are shallow machine learning algorithms. AE and DRBM are deep machine learning methods. DRBM is set with a learning rate of 0.01 for 25 iterations [29]. L2 regularization is used on all weights, which are initialized randomly from the uniform distribution between 0 and 1. We set the hidden unit function as sigmoid and the number of hidden units as half of visible units and the number of the second hidden layer as half of the first hidden layer and so on. The number of hidden layers is 5. In the following experiments, to prevent overfitting, we used weight-decay and dropout. Weight-decay adds an extra term to the normal gradient. This extra term is the derivative of a function that penalizes large weights. We used the simplest L2 penalty function. As well as that, dropout is a regularization technique for reducing overfitting in neural networks by preventing complex coadaptations on training data [39].

The accuracy of protein function prediction can be evaluated by different evaluation metrics, and the performance of different prediction models is affected by the adopted evaluation metrics. To do a fair and comprehensive comparison, we used four evaluation metrics, MacroAvgF 1, AvgROC, RankingLoss, and Fmax. These evaluation metrics measure the performance of protein function prediction from different aspects. The first three metrics have been applied to evaluate the results of multilabel learning [40]. AvgROC and Fmax are recommended metrics for evaluating protein function prediction [6, 41]. MacroAvgFl gets the F1-Score of each term and then takes the average of F1-score across all the terms. AvgAUC firstly calculates the area under receiver operating curve of each term and then takes the average value of these areas as whole to measure the performance. Fmax [6] is the overall maximum harmonic mean of recall and precision across all possible thresholds on the predicted protein function association matrix. RankingLoss computes the average fraction of wrongly predicted annotations ranking ahead of ground-truth annotations of proteins. To be consistent with other evaluation metrics, we use 1 -RankLoss instead of RankingLoss. Namely, the higher the value of these metrics is, the better the performance is. The formal definition of these metrics can be found in [6, 22, 40]. Since these metrics capture different aspects of a function prediction method, it is difficult for an approach to consistently outperform the others across all the evaluation metrics.

4.2. Experimental Results. Based on the experimental protocols introduced above, we conduct experiments to investigate the performance of DRBM on protein function prediction.

In Table 2, we report the experimental results on proteins of Homo sapiens annotated with BP, CC, and MF terms, respectively. The results on Mus musculus, Saccharomyces cerevisiae, and Drosophila are provided in Tables 3-5. In these tables, the best results are in boldface.

From these tables, we can see that DRBM achieves better results than NtN, dRW, SVD, and AE in most cases. We further analyze the differences between DRBM and these comparing methods by Wilcoxon signed rank test [42, 43], we find that DRBM performs significantly better than NtN, dRW, and SVD on the first three metrics (where p values are all smaller than 0.004), and it also gets better performance than deep AE across these four metrics (p value smaller than 0.001). dRW often obtains larger Fmax than DRBM the possible reason is that dRW utilizes threshold to filter out some predictions and thus increases the true positive rate.

dRW applies downward random walks with restart on the GO directed acyclic graph to predict protein function dRW takes into count the hierarchical structure relationship between GO terms and achieves better results than NtN and SVD. This observation confirms that the hierarchical relationship between terms plays important roles in protein function prediction. Although dRW utilizes the hierarchical structure relationship between terms, it is still a shallow machine learning method and it does not capture the deep associations between proteins and GO terms as DRBM does, so it is often outperformed by DRBM.

The results of NtN and SVD are always lower than those of AE and DRBM. The possible reason is that singular value decomposition on sparse matrix is not suitable for this kind of protein function prediction problem, in which there are complex hierarchical relationships between GO terms. NtN uses the ontology hierarchy to adjust the weights of protein function associations, but it does not get better results than SVD. The reason is that NtN gives large weights to specific annotations but small weights to shallow annotations. From the true path rule, ancestor terms are generally annotated to more proteins than their descendant terms. For this reason, NtN is often outperformed by SVD and say nothing of AE and DRBM. Both AE and DRBM are deep machine learning techniques, but DRBM frequently performs better than AE. That is because the generalization ability of AE is not as well as that of DRBM, and AE is easy to fall into local optimal. In summary, these results and comparisons demonstrate that DRBM can capture deep associations between proteins and GO terms, and thus it achieves better performance than other related methods across different evaluation measures. DRBM is an effective alternative approach for protein function prediction.

4.3. Runtime Analysis. Here, we study runtime (include training phase and test phase) cost of these comparing methods on Homo sapiens and Mus musculus in GO BP subontology, since this subontology includes much more annotations and GO terms. The experimental platform is Windows Server 2008, Intel Xeon E7-4820, 64 GB RAM. The recorded runtime for these comparing methods is reported in Table 6.

From this table, we can see that DRBM is faster than these comparing methods, except SVD. NtN and dRW spend a lot of time to compute semantic similarity between GO terms, so they take more time than others. In contrast, SVD directly applies matrix decomposition on the protein function association matrix and the matrix is sparse, so SVD takes fewer time than DRBM. AE employs back propagation neural networks to tune parameters it costs a large amount of time. DRBM utilizes Contrastive Divergence, which is a fast learning algorithm, to optimize the parameters, so its runtime is fewer than AE. This comparison further confirms DRBM is an efficient and effective alternative solution for protein function prediction.

In this paper, we study how to predict additional functional annotations of annotated proteins. We investigate deep restricted Boltzmann machines (DRBM) for this purpose. Our empirical study on the proteins of Saccharomyces cerevisiae, Homo sapiens, Mus musculus, and Drosophila shows that DRBM outperforms several competitive related methods, especially shallow machine learning models. This paper will drive more research on using deep machine learning techniques for protein function prediction. As part of our future work, we will integrate other types of proteomic data with DRBM to further boost the prediction performance.

The authors declare that there are no conflicts of interest regarding the publication of this paper.

This work is partially supported by Natural Science Foundation of China (no. 61402378), Natural Science Foundation of CQ CSTC (nos. cstc2014jcyjA40031 and cstc2016jcyjA0351), Science and Technology Development of Jilin Province of China (20150101051JC and 20160520099JH), Science and Technology Foundation of Guizhou (Grant no. QKHJC20161076), the Science and Technology Top-Notch Talents Support Project of Colleges and Universities in Guizhou (Grant no. QJHKY2016065), and Fundamental Research Funds for the Central Universities of China (nos. XDJK2016B009 and 2362015XK07).

[1] R. J. Roberts, "Identifying protein functiona call for community action," PLoS Biology, vol. 2, no. 3, p. e42, 2004.

[2] G. Pandey, V. Kumar, and M. Steinbach, in Computational approaches for protein function prediction: a survey, pp. 6-28, Department of Computer Science and Engineering, University of Minnesota, A survey, 2006.

[3] A. M. Schnoes, D. C. Ream, A. W. Thorman, P. C. Babbitt, and I. Friedberg, "Biases in the experimental annotations of protein function and their effect on our understanding of protein function space," PLoS Computational Biology, vol. 9, no. 5, Article ID e1003063, 2013.

[4] P D. Thomas, V. Wood, C. J. Mungall, S. E. Lewis, and J. A. Blake, "On the use of gene ontology annotations to assess functional similarity among orthologs and paralogs: a short report," PLoS Computational Biology, vol. 8, no. 2, Article ID e1002386, 2012.

[5] P Legrain, R. Aebersold, A. Archakov et al., "The human proteome project: current state and future direction," Molecular & CellularProteomics, vol. 10, no. 7, article 009993, 2011.

[6] P Radivojac, W. Clark, T. Oron et al., "A large-scale evaluation ofcomputational protein function prediction," Nature Methods, vol. 10, no. 3, pp. 221-227, 2013.

[7] D. Lee, O. Redfern, and C. Orengo, "Predicting protein function from sequence and structure," Nature Reviews Molecular Cell Biology, vol. 8, no. 12, pp. 995-1005, 2007.

[8] R. Sharan, I. Ulitsky, and R. Shamir, "Network-based prediction of protein function," Molecular Systems Biology, vol. 3, p. 88, 2007.

[9] M. Cao, C. M. Pietras, X. Feng et al., "New directions for diffusion-based network prediction of protein function: Incorporating pathways with confidence," Bioinformatics, vol. 30, no. 12, pp. I219-I227, 2014.

[10] N. Cesa-Bianchi, M. Re, and G. Valentini, "Synergy of multilabel hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference," Machine Learning, vol. 88, no. 1-2, pp. 209-241, 2012.

[11] G. Yu, C. Domeniconi, H. Rangwala, G. Zhang, and Z. Yu, "Transductive multi-label ensemble classification for protein function prediction," in Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2012, pp. 1077-1085, chn, August 2012.

[12] G. Yu, G. Fu, J. Wang, and H. Zhu, "Predicting Protein Function via Semantic Integration of Multiple Networks," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 13, no. 2, pp. 220-232, 2016.

[13] M. Ashburner, C. A. Ball, J. A. Blake et al., "Gene ontology: tool for the unification of biology," Nature Genetics, vol. 25, no. 1, pp. 25-29, 2000.

[14] O. D. King, R. E. Foulger, S. S. Dwight, J. V. White, and F. P. Roth, "Predicting gene function from patterns of annotation," Genome Research, vol. 13, no. 5, pp. 896-904, 2003.

[15] P. Khatri, B. Done, A. Rao, A. Done, and S. Draghici, "A semantic analysis of the annotations of the human genome," Bioinformatics, vol. 21, no. 16, pp. 3416-3421, 2005.

[16] B. Done, P Khatri, A. Done, and S. Draghici, "Predicting novel human gene ontology annotations using semantic analysis," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 1, pp. 91-99, 2010.

[17] Y. Tao, L. Sam, J. Li, C. Friedman, and Y. A. Lussier, "Information theory applied to the sparse gene ontology annotation network to predict novel gene function," Bioinformatics, vol. 23, no. 13, pp. i529-i538, 2007.

[18] G. Yu, H. Zhu, C. Domeniconi, and J. Liu, "Predicting protein function via downward random walks on a gene ontology," BMC Bioinformatics, vol. 16, no. 1, article no. 271, 2015.

[19] G. Salton, A. Wong, and C. S. Yang, "A vector space model for automatic indexing," Communications of the ACM, vol. 18, no. 11, pp. 613-620, 1975.

[20] H. Tong, C. Faloutsos, and J.-Y. Pan, "Random walk with restart: Fast solutions and applications," Knowledge and Information Systems, vol. 14, no. 3, pp. 327-346, 2008.

[21] G. Yu, H. Rangwala, C. Domeniconi, G. Zhang, and Z. Yu, "Protein function prediction with incomplete annotations," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 11, no. 3, pp. 579-591, 2013.

[22] G. Yu, C. Domeniconi, H. Rangwala, and G. Zhang, "Protein function prediction using dependence maximization," in Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, vol. 8188 of Lecture Notes in Computer Science, pp. 574-589, Springer Berlin Heidelberg.

[23] G. Yu, H. Zhu, and C. Domeniconi, "Predicting protein functions using incomplete hierarchical labels," BMC Bioinformatics, vol. 16, no. 1, article no. 1, 2015.

[24] G. Fu, G. Yu, J. Wang, and Z. Zhang, "Novel protein function prediction using a direct hybrid graph," Science China-Information Science, vol. 46, no. 4, pp. 461-475, 2016.

[25] L. Deng and D. Yu, "Deep learning: methods and applications," Foundations and Trends in Signal Processing, vol. 7, no. 3-4, pp. 197-387, 2013.

[26] J. Eickholt and J. Cheng, "Predicting protein residue-residue contacts using deep networks and boosting," Bioinformatics, vol. 28, no. 23, pp. 3066-3072, 2012.

[27] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436-444, 2015.

[28] G. E. Hinton and R. R. Salakhutdinov, "Reducing the dimensionality of data with neural networks," American Association for the Advancement of Science. Science, vol. 313, no. 5786, pp. 504-507, 2006.

[29] D. Chicco, P. Sadowski, and P Baldi, "Deep autoencoder neural networks for gene ontology annotation predictions," in Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACMBCB 2014, pp. 533-540, usa, September 2014.

[30] G. E. Hinton, S. Osindero, and Y.-W. Teh, "A fast learning algorithm for deep belief nets," Neural Computation, vol. 18, no. 7, pp. 1527-1554, 2006.

[31] I. Fasel and J. Berry, "Deep belief networks for real-time extraction of tongue contours from ultrasound during speech," in Proceedings of the 20th International Conference on Pattern Recognition, ICPR 2010, pp. 1493-1496, tur, August 2010.

[32] A. Fischer and C. Igel, "An Introduction to Restricted Boltzmann Machines," in Progress in Pattern Recognition, Image Analysis, Computer Vision, andApplications, vol. 7441 of Lecture Notes in Computer Science, pp. 14-36, Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.

[33] R. Salakhutdinov, A. Mnih, and G. Hinton, "Restricted Boltzmann machines for collaborative filtering," in Proceedings of the 24th International Conference on Machine learning (ICML '07), vol. 227, pp. 791-798, Corvallis, Oregon, June 2007.

[34] Y. Wang and J. Zeng, "Predicting drug-target interactions using restricted Boltzmann machines," Bioinformatics, vol. 29, no. 13, pp. 1126--1134, 2013.

[35] X. Li, F. Zhao, and Y. Guo, "Conditional restricted boltzmann machines for multi-label learning with incomplete labels," in Proceedings of the in Proceedings of 18th International Conference on Artificial Intelligence and Statistics, pp. 635-643, 2015.

[36] G. E. Hinton, "Training products of experts by minimizing contrastive divergence," Neural Computation, vol. 14, no. 8, pp. 1771-1800, 2002.

[37] G. Hinton, "A practical guide to training restricted Boltzmann machines," in Neural Networks: Tricks of the Trade, G. Montavon, G. B. Orr, and K.-R. Muller, Eds., vol. 7700 of Lecture Notes in Computer Science, pp. 599-619, Springer, Berlin, Germany, 2nd edition, 2012.

[38] R. Salakhutdinov and G. E. Hinton, "Deep Boltzmann Machines," in Proceedings of the In Proceedings of 12th International Conference on Artificial Intelligence and Statistics, pp. 448-455, 2009.

[39] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: a simple way to prevent neural networks from overfitting," The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929-1958, 2014.

[40] M.-L. Zhang and Z.-H. Zhou, "A review on multi-label learning algorithms," IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 8, pp. 1819-1837, 2014.

[41] Y. Jiang, "An expanded evaluation of protein function prediction methods shows an improvement in accuracy," Genome Biology, vol. 17, no. 1-19, pp. 1819-1837, 2016.

[42] L. Wilcoxon, "Individual comparison by ranking methods," Biometrics, vol. 1, no. 6, pp. 80-83, 1945.

[43] J. Demsar, "Statistical comparisons of classifiers over multiple data sets," The Journal of Machine Learning Research, vol. 7, no. 1, pp. 1-30, 2006.

Xianchun Zou, Guijun Wang, and Guoxian Yu

College of Computer and Information Science, Southwest University, Chongqing, China


Main methods used to predict functional annotations in GO - Biology

Function SIG: Gene and Protein Function Annotation

COSI Track Presentations

Presentation Overview: Show

The analysis of correlated evolution between genes can be used to infer functional interactions between the proteins they encode. Co-evolutionary analyses are often validated by their ability to identify proteins engaged in a physical complex or with a shared metabolic pathway. However, in addition the prediction of interaction, they may also provide valuable information about independence. Using folate metabolism as a case study, we find a pair of enzymes that co-evolve, statistically and experimentally, with one another, but independently from the rest of the pathway. A strategy for identifying groups of proteins that adapt and function as self-contained units would help render cellular systems more tractable and predictable, and suggest practical strategies for metabolic engineering.

  • Giuseppe Profiti, University of Bologna and ELIXIR Italy, Italy
  • Castrense Savojardo , University of Bologna, Italy
  • Pier Luigi Martelli, University of Bologna, Italy
  • Rita Casadio, University of Bologna, Italy

Presentation Overview: Show

Critical Assessment of protein Function Annotation algorithms (CAFA) is a scientific challenge ran every two years, consisting in predicting Gene Ontology (GO) terms from protein sequences.
The organizers release a set of protein sequences, participant’s predictions should be deposited by the following January, and the evaluation is performed on the experimental annotation accumulated in the following months (at least 6).
A paper with the results is usually published before the following instalment of the challenge: CAFA1 (2010-2011) results were published in 2013, CAFA2 (2013-2014) in 2016, CAFA3 2016-2017 evaluation is still in progress.
Journals like NAR Web Server issue require CAFA results for predictors submitted for publication, however such results are available years after the method was tested in CAFA, and in any case the challenge is run every two years. This leads to a gap: either scientists will use old scores, or they should perform “in house” CAFA-like evaluations.
Given this scenario, we propose to have a centralized continuous evaluation system for CAFA-like assessments. This will help in having consistent and certified scores, clear dataset references and openness. Existing benchmarking platforms like OpenEBench could be exploited in that sense.

  • Peter Freddolino , University of Michigan, United States
  • Mehdi Rahimpour, University of Michigan, United States
  • Chengxin Zhang, University of Michigan, United States
  • Yang Zhang, University of Michigan, United States

Presentation Overview: Show

Computational functional annotation is frequently hampered by the lack of high-identity templates for any new target of interest. We have recently developed a hybrid pipeline combining structural prediction/alignment, sequence alignment, and protein-protein interaction information to obtain combined structure predictions and functional annotations for entire proteomes. We find that our inclusion of structural information makes our workflow unusually strong in performance on difficult targets with limited sequence identity to annotated proteins. Importantly, we also observe that in silico structure prediction can now replace experimental structures for the purposes of functional annotation pipelines. The combined structure/function predictions provided by our pipeline provide an unusual richness of information, and we show several usage cases where insight from these predictions accurately guided follow-up experiments.

Examination of our predictions on several model proteomes reveals a range of commonly over-represented functionalities among poorly annotated proteins, including transcription factors, kinases/phosphatases, and pathogenicity genes. Our findings provide fundamental new insight into the genetic capacity encoded in proteomes across all domains of life, yield a rich new source of information to seed detailed investigation of the functions of many previously mysterious protein-coding genes, and pave the way for large-scale structure/function annotation for a broader range of proteomes of interest.

  • Linhua Wang , Icahn School of Medicine at Mount Sinai, United States
  • Jeffrey Law, Virginia Tech, United States
  • Shiv Kale, Virginia Tech, United States
  • T. M. Murali, Virginia Tech, United States
  • Gaurav Pandey, Icahn School of Medicine at Mount Sinai, United States

Presentation Overview: Show

An effective approach to leveraging the complementarity of methods proposed for protein function prediction (PFP) is to assimilate them into heterogeneous ensembles. We have illustrated that such ensembles can provide significant performance gains over individual PFP predictors. However, our previous work has been limited to a few GO terms due to the computational costs of constructing these ensembles. Here, we report the results of large-scale PFP using heterogeneous ensembles.

Specifically, we constructed and evaluated ensembles for 277 GO terms using 12 diverse base classifiers, and two types of methods, namely stacking with 8 different meta-classifiers and Caruana et al’s Ensemble Selection algorithm (CES). Stacking using Logistic Regression (SLR) was the best-performing stacker, and also performed competitively with CES. SLR generally outperformed the best base classifier, with median Fmax improvement increasing with GO term size, namely 0.010 (p=0.21), 0.027 (p=1.1x10-7) and 0.033 (p=1.7x10-10) for small (200-500 proteins), medium (500-1000 proteins) and large (over 1000 proteins) terms respectively. Furthermore, the entire computation took less than 48 hours on a sizeable computing cluster. These results demonstrate that large-scale PFP using heterogeneous ensembles constructed systematically using stacking and CES can be predictive and computationally feasible.

  • Sergey Nepomnyachiy, Tel Aviv University, Israel
  • Nir Ben-Tal, Tel Aviv University, Israel
  • Rachel Kolodny , University of Haifa, Israel

Presentation Overview: Show

Reuse – the co-option of segments from unrelated proteins to produce new proteins – underlies protein evolution. Thus, characterizing reuse can offer insights to protein function and evolution. To study reuse patterns, we developed an algorithm that identifies 'themes' – reused segments of similar sequence and structure from protein alignments. Our algorithm finds themes of varying minimal lengths, ranging from 35-200 residues. Using it, we quantify and study reuse in the ECOD database of domains and in the PDB. Indeed, theme reuse is prevalent, and reuse is more extensive when including shorter themes. Structural domains, which are autonomously folded protein parts and the best-characterized form of reuse in proteins, are just one of many, complex and intertwined, evolutionary traces. Others include long themes shared among a few proteins, which encompass and overlap with shorter themes that recur in more proteins. The observed complexity is consistent with evolution by duplication and divergence, suggesting that some of the themes might include descendants of ancestral segments. The observed recursive footprints, where the same amino acid can simultaneously participate in several intertwined themes, has interesting ramifications to characterizing evolution and predicting protein function.

  • Magdalena Antczak , University of Kent, United Kingdom
  • Mark Wass, The Univesity of Kent, United Kingdom

Presentation Overview: Show

Nearly 20 years after the first human genome sequence was published our knowledge and understanding of gene/protein functions remains limited. This is exemplified by the recent identification of the minimal bacterial genome which revealed that one third (149 of 438) of the proteins in this genome were of unknown function. These genes perform essential roles, yet we have no idea of the functions they perform.
We performed an extensive in silico analysis to expand our understanding of the minimal genome. Overall our analysis inferred more informative functions for 59 of the 149 proteins of unknown function. The inferred functions cover multiple areas including protein synthesis, cell division and transport. Our results suggest that >50% of the minimal genome is required for the fundamental life processes of preserving and expressing genetic information. Interestingly we identified many transmembrane proteins in the set of uncharacterised proteins and predict that >70% of these have transporter functions. Our analysis provides insight into the functions of proteins in the minimal bacterial genome, which will now be of interest for experimental characterisation. Further, it highlights the ability to use computational approaches to expand our knowledge and understanding of protein function.

  • Nirvana Nursimulu , University of Toronto, Canada
  • Leon Xu, University of Toronto, Canada
  • James Wasmuth, University of Calgary, Canada
  • Ivan Krukov, University of Calgary, Canada
  • John Parkinson, Hospital for Sick Children, Canada

Presentation Overview: Show

Metabolic modelling is an effective way to understand factors affecting organisms’ growth. Ultimately, such models are key for such purposes as metabolic engineering and drug design. However, sequence similarity searches—typically used to annotate enzymatic function for these models—produce false positive enzyme predictions and fail to consider sequence diversity within enzyme classes. Therefore, various methods have been developed, looking beyond sequence similarity for such elements as domain and catalytic site presence. Here, we start by presenting DETECT (Density Estimation Tool for Enzyme ClassificaTion). In DETECT, the sequence diversity within each enzyme class is captured through density profiles. Then, it calculates likelihood scores for a query sequence given its matches to sequences of different enzyme classes. The use of enzyme-specific score cutoffs calculated from cross-validation gives DETECT higher precision and recall compared to existing methods. It remains that different methods are better suited for predicting certain enzyme classes compared to others. Thus, in a second part, we present an integrative approach for enzyme annotation, where enzyme-specific rules are used for combining the predictions of different tools. Overall, we propose methods for creating high-confidence metabolic models to drive biological discovery.

  • Kokulapalan Wimalanathan , Iowa State University, United States
  • Iddo Friedberg, Iowa State University, United States
  • Carson Andorf, USDA-ARS, United States
  • Carolyn Lawrence-Dill, Iowa State University, United States

Presentation Overview: Show

Maize is both a crop species and a model for genetics and genomics research. Maize GO annotations from Gramene and Phytozome are widely used to derive hypotheses for crop improvement and basic science. The maize-GAMER project is an effort to assess existing maize GO annotations and to improve the quality and quantity of annotations. We designed and implemented a plant-specific reproducible meta-annotator (GO-MAP) that uses diverse component methods including sequence-similarity, domain presence, and three CAFA tools (Argot2, FANN-GO, and Pannzer), to predict GO terms to maize genes and aggregates the predicted annotations as an aggregate dataset. Annotations from Gramene, Phytozome, and maize-GAMER were assessed and compared. Compared to Gramene and Phytozome, the maize-GAMER dataset annotates more genes and assigns more GO terms per gene. The quality of annotations was evaluated using an independent gold-standard dataset (2002 GO annotations for 1,619 genes) from MaizeGDB. In the CC category, maize-GAMER was the top performer, but it ranked slightly behind Gramene in both MF and BP categories. The maize-GAMER GO annotations have been released publicly, and the containerized GO-MAP tool will soon be released to facilitate annotation of other plant proteomes.

Presentation Overview: Show

CAZymes (carbohydrate-active enzymes) are among the most important enzymes for the bioenergy and agricultural industries. CAZyme are also important for human health, because microbes living in the human guts encode the highest percentage of CAZymes to degrade various dietary and host carbohydrates, and changing the dietary carbohydrates will impact the gut microbiota structure and further influence the human health. We have built an online database dbCAN-seq (http://cys.bios.niu.edu/dbCAN_seq) to provide pre-computed CAZyme sequence and annotation data for 5,349 bacterial genomes. Compared to the other CAZyme resources, dbCAN-seq has the following new features: (i) a convenient download page to allow batch download of all the sequence and annotation data (ii) an annotation page for every CAZyme to provide the most comprehensive annotation data (iii) a metadata page to organize the bacterial genomes according to species metadata such as disease, habitat, oxygen requirement, temperature, metabolism (iv) a very fast tool to identify physically linked CAZyme gene clusters (CGCs) and (v) a powerful search function to allow fast and efficient data query. With these unique utilities, dbCAN-seq will become a valuable web resource for CAZyme research, with a focus complementary to dbCAN (automated CAZyme annotation server) and CAZy (CAZyme family classification and reference database).

  • Tunca Dogan , EMBL-EBI, CanSyL, METU, United Kingdom
  • Ahmet Süreyya Rifaioğlu, Middle East Technical University, Turkey
  • Rabie Saidi, EMBL-EBI, United Kingdom
  • Maria Martin, EMBL-EBI, United Kingdom
  • Volkan Atalay, Middle East Technical University, Turkey
  • Rengul Atalay, METU, Turkey

Presentation Overview: Show

Functional annotation of biomolecules in the gene and protein databases is mostly incomplete. This is especially valid for multi-domain proteins. There is a grey area in the protein function data resources, where the truly negative functions and the ones possessed by the protein but have not been discovered or documented yet (i.e. false negatives), reside together. In many cases the information about the functions absent from the target biomolecule can be as important as the assigned functions. It’s possible to resolve a portion of this grey area by predicting the functions that the target proteins most probably do not possess. In this study, we present an approach to produce negative functional annotations for protein sequences, along with regular positive associations. Using this approach, we have developed an automated function prediction tool "UniGOPred". The negative prediction performance (recall) was measured as 0.82 for both MF and BP, and 0.66 for CC GO terms (with prediction scores ≤ 0.3), in cross-validation. To the best of our knowledge, the ability of a protein function prediction method to predict negative functions using sequence features is investigated here for the first time. UniGOPred is available as an open access tool at http://cansyl.metu.edu.tr/UniGOPred.html.

  • Morteza Pourreza Shahri , Montana State University, United States
  • Madhusudan Srinivasan, Montana State University, United States
  • Upulee Kanewala, Montana State University, United States
  • Indika Kahanda, Montana State University, United States

Presentation Overview: Show

The Critical Assessment of protein Function Annotation algorithms (CAFA) is a large-scale experiment for assessing the computational models for automated function prediction (AFP). The models presented in CAFA have shown excellent promise in terms of prediction accuracy, but quality assurance has been paid relatively less attention. The main challenge associated with conducting systematic testing on AFP software is the lack of a test oracle, which determines passing or failing of a test case unfortunately, the exact expected outcomes are not well defined for the AFP task. Metamorphic testing (MT) is a technique used to test programs that face the oracle problem by defining metamorphic relations (MRs). An MR determines whether a test has passed or failed by specifying how the output should change according to a specific change made to the input. In this work, we use MT to test five web-based CAFA2 AFP tools by defining a set of MRs that apply input transformations at the protein-level. According to this initial testing, we observe MR violations. Currently, we are working on developing domain-specific MRs based on sequence modifications. In the future, we plan to develop a comprehensive MT tool that is readily available for the AFP community.

  • Naihui Zhou , Iowa State University, United States
  • Yuxiang Jiang, Indiana University Bloomington, United States
  • Michael Gerten, Iowa State University, United States
  • Timothy Bergquist, University of Washington, United States
  • Md Nafiz Hamid, Iowa State University, United States
  • Deborah A. Hogan, Geisel School of Medicine at Dartmouth, United States
  • Kimberley A. Lewis, Geisel School of Medicine at Dartmouth, United States
  • Alex W. Crocker, Dartmouth College, United States
  • George Georghiou, EMBL-EBI, United Kingdom
  • Maria Martin, EMBL-EBI, United Kingdom
  • Claire O'Donovan, EMBL-EBI, United Kingdom
  • Sandra Orchard, EMBL-EBI, United Kingdom
  • Sean D. Mooney, University of Washington, United States
  • Casey S. Greene, University of Pennsylvania, United States
  • Predrag Radivojac, Indiana University Bloomington, United States
  • Iddo Friedberg, Iowa State University, United States

Presentation Overview: Show

The third CAFA challenge (CAFA3) released its prediction targets in September 2016, and preliminary results were announced in July 2017. CAFA3 featured a term-centric track where predictors were asked to associate a large set of genes (the complete genomes of Candida albicans and Pseudomonas aeruginosa) with a limited set of functions. By collaborating with experimental biologists, we were able to use unpublished whole-genome screen results to evaluate these predictions. To specifically address this question, we hosted an additional challenge CAFA 3.14 (CAFA-Pi) that is dedicated to evaluating term-centric predictions. The final CAFA3 results as well as preliminary CAFA-Pi results will be released and discussed, in addition to highlights of the term-centric evaluations and benchmark proteins.

  • Ying Zhang , University of Rhode Island, United States
  • Jon Steffensen, University of Rhode Island, United States
  • Keith Dufault-Thompson, University of Rhode Island, United States

Presentation Overview: Show

Metabolism forms the basis for understanding cellular processes in all living organisms and is essential in mediating microbial community and host-microbe associations. Despite the broad application of genome-scale models into studying the function and evolution of metabolic networks, a comprehensive understanding of diverse metabolic processes is still lacking due to the great complexity and variability of metabolic interactions among different species. To enable the annotation and visualization of complex metabolic networks beyond the scope of existing metabolic pathway databases, we have developed a new algorithm, FindPrimaryPairs, for automatically predicting the element-transferring reactant/product pairs and hence tracing the primary connections of metabolites in metabolic networks. The algorithm has been applied to enable the visualization of metabolic pathways. In the presentation, we will demonstrate new applications of our approach into annotating host-microbe metabolic collaborations and discuss the further integration of protein structural and functional information into studying the evolution of metabolic interactions among different species.

  • Vladimir Gligorijevic , Flatiron Institute, United States
  • Meet Barot, Flatiron Institute, United States
  • Da Chen Emily Koo, New York University, United States
  • Richard Bonneau, New York University, United States

Presentation Overview: Show

The prevalence of high-throughput experimental methods has resulted in an abundance of large-scale molecular and functional interaction networks. The connectivity of these networks provide a rich source of information for inferring functional annotations for genes and proteins. An important challenge has been to develop methods for combining these heterogeneous networks to extract useful protein feature representations for function prediction. Most of the existing approaches for network integration use shallow models that cannot capture complex and highly-nonlinear network structures. Thus, we propose deepNF, a network fusion method based on Multimodal Deep Autoencoders to extract high-level features of proteins from multiple heterogeneous interaction networks. We apply deepNF on 6 STRING networks to construct a compact low-dimensional representation containing high-level protein features. We present an extensive performance analysis comparing our method with the state-of-the-art network integration methods such as GeneMANIA and Mashup. In addition to cross-validation, the analysis also includes a temporal holdout validation evaluation similar to the measures in CAFA. Our method outperforms previous methods for both human and yeast STRING networks. Features learned by our method lead to substantial improvements in protein function prediction accuracy, which could enable novel protein function discoveries.

  • Yannick Mahlich , Technical University of Munich, Germany
  • Martin Steinegger, Max-Planck-Institute, Republic of Korea
  • Burkhard Rost, Technical University of Munich, Germany
  • Yana Bromberg, Rutgers University, United States

Presentation Overview: Show

Motivation: The rapid drop in sequencing costs has produced many more (predicted) protein sequences than can feasibly be functionally annotated with wet-lab experiments. Thus, many computational methods have been developed for this purpose. Most of these methods employ homology-based inference, approximated via sequence alignments, to transfer functional annota-tions between proteins. The increase in the number of available sequences, however, has drasti-cally increased the search space, thus significantly slowing down alignment methods.
Results: Here we describe HFSP, a novel computational method that uses results of a high-speed alignment algorithm, MMseqs2, to infer functional similarity of proteins on the basis of their alignment length and sequence identity. We show that our method is accurate (83% accu-racy) and fast (more than 40-fold speed increase over state-of-the-art). HFSP can help correct at least a 20% error in legacy curations, even for a resource of as high quality as Swiss-Prot. These findings suggest HFSP as an ideal resource for large-scale functional annotation efforts.

  • Rabie Saidi , EMBL-EBI, United Kingdom
  • Maryam Abdollahyan, Queen Mary University of London, United Kingdom
  • James Lee, EMBL-EBI, United Kingdom
  • Tunca Dogan, EMBL-EBI, CanSyL, METU, United Kingdom
  • Ahmet Süreyya Rifaioğlu, Middle East Technical University, Turkey
  • Maria Martin, EMBL-EBI, United Kingdom

Presentation Overview: Show

Both UniProt automatic and manual pipelines use sets of family and domain signatures to infer functional annotations of proteins. Recently, a number of studies have suggested that the same set of signatures does not necessarily imply the same annotations, and that other factors, such as the order of signatures in the protein sequence, may have an impact on its function. However, this impact has not yet been quantified. In this work, we present an information theory based approach to measure the consistency between signature sets and annotations. We propose a new entropy measure which takes the dynamic nature of the annotation process into account by assigning different weights to the presence and absence of an annotation. The results show a high consistency between signature sets and annotations in UniProt Knowledgebase. Apart from quantifying the annotation consistency, our analysis has a few additional implications. One is detection of signatures having complete annotation consistency which can then be used as seeds for generating new annotation rules. Moreover, to gain a better understanding of the reasons behind inconsistency in some signature sets, we used formal concepts to identify proteins with incomplete annotations and discover potential new subfamilies sharing the same annotations.

  • Taylor Brooks , Bethune Cookman University, United States
  • Remi Jones, Bethune-Cookman University, United States
  • Antoinesha Hollman, Jackson State University, United States
  • Raphael Isokpehi, Bethune-Cookman University, United States

Presentation Overview: Show

The bacteria genus Actinomyces are able to grow, reproduce and cause infections in multiple sites of the human body including sites where the conditions for bacteria growth is unfavorable. Genes encoding the universal stress proteins enable bacteria to respond to stress and grow in unfavorable conditions such as limited nutrients and acidic conditions. The goal of the research reported here was to predict the functions of the universal stress proteins encoded in genomes of Actinomyces species. A combination of bioinformatics and visual analytics techniques were used to construct data sets and identify function, transcription direction and operonic arrangement of genes adjacent to the universal stress proteins of Actinomyces. Gene neighborhood analysis revealed a 4-gene operon that includes a USP gene that is associated with the genome of an oral Actinomyces. The operon had function annotation for a sucrose transporter and an enzyme for breakdown of sucrose. The presence of double domain USPs could indicate capacity for biofilm formation. Sugar metabolism is central to the behavior of dental Actinomyces species who are able to persist in biofilms, produce acid and store glycogen-like molecules. Further studies could evaluate the expression levels of the members of the operon in diverse environmental conditions.

  • Elad Segev , Holon Institute of Technology, Israel
  • Noam Chapnik, Holon Institute of Technology, Israel
  • Roy Yosef, Holon Institute of Technology, Israel
  • Edouard Jurkevitch, The Hebrew University of Jerusalem, Israel
  • Zohar Pasternak, The Hebrew University of Jerusalem, Israel

Presentation Overview: Show

99.6% of all known proteins were never tested experimentally or even their expression observed, thus predicting their function relies mainly on comparing their sequence to annotated homologs. However, even with new automated tools for high-throughput functional annotation, the function of many proteins remains unknown since they have no annotated homologs. In order to identify function and discover protein-protein interaction networks, our study aimed at identifying proteins that are functionally linked to each. We analyzed the co-occurrence patterns of 406,000 orthologous and 118,000 homologous proteins from the fully sequenced non-draft genomes of 4,350 bacteria, 166 eukaryotes and 226 archaea. Validation successfully revealed known networks from various pathways, including nitrogen fixation, glycolysis and ribosome proteins for example, using the query protein AmoA (a subunit of ammonia monooxygenase), the resulting calculated functional network included AmoB and AmoC, the two other subunits.
This method was found to be both biological and computational practical and efficient , thus, it promises to remain efficient even as more and more genomes are being sequenced.

  • Jeffrey Law , Virginia Tech, United States
  • Shiv Kale, Virginia Tech, United States
  • T. M. Murali, Virginia Tech, United States

Presentation Overview: Show

Thousands of bacterial genomes have been sequenced and annotated. A very large fraction of GO functional annotations for bacterial genes are based on sequence similarity and have not been reviewed by any curator. We sought to examine afresh how well we can predict bacterial gene annotations with experimental evidence using network-based methods.

As a proof of concept, we selected 19 clinically-relevant pathogenic bacteria and created a cross-species network based on protein sequence similarity. We integrated this network with species-specific functional association networks for each pathogen from STRING. We hypothesized that the integrated network would have higher predictive power, despite the large network size and sparsity of annotated nodes.

We evaluated multiple network-based prediction algorithm’s ability to predict experimental annotations, and non-IEA annotations using five-fold cross validation. We found that the SinkSource algorithm consistently outperformed (higher F-max values) GeneMANIA, FunctionalFlow, and other BLAST-based methods. While incorporating STRING with the sequence similarity network did not improve F-max values for non-IEA annotations, the integrated network did yield higher F-max values for experimental annotations (median F-max increased from 0.46 to 0.51 for SinkSource across all BP terms). These results demonstrate that integrating multiple types of data improves predictive power for experimental annotations.

  • Seokjun Seo, Seoul National University, South Korea
  • Minsik Oh , Seoul National University, South Korea
  • Youngjune Park, Seoul National University, South Korea
  • Sun Kim, Seoul National University, South Korea

Presentation Overview: Show

A large number of newly sequenced proteins are generated by the next-generation sequencing technologies and the biochemical function assignment of the proteins is an important task. However, biological experiments are too expensive to characterize such a large number of protein sequences, thus protein function prediction is primarily done by computational modeling methods, such as profile Hidden Markov Model (pHMM) and k -mer based methods. Nevertheless, existing methods have some limitations k -mer based methods are not accurate enough to assign protein functions and pHMM is not fast enough to handle large number of protein sequences from numerous genome projects. Therefore, a more accurate and faster protein function prediction method is needed.
In this paper, we introduce DeepFam, an alignment-free method that can extract functional information directly from sequences without the need of multiple sequence alignments. In extensive experiments using the Clusters of Orthologous Groups (COGs) and G protein-coupled receptor (GPCR) dataset, DeepFam achieved better performance in terms of accuracy and runtime for predicting functions of proteins compared to the state-of-the-art methods, both alignment-free and alignment-based methods. Additionally, we showed that DeepFam has a power of capturing conserved regions to model protein families. In fact, DeepFam was able to detect conserved regions documented in the Prosite database while predicting functions of proteins. Our deep learning method will be useful in characterizing functions of the ever increasing protein sequences.
Codes are available at https://bhi-kimlab.github.io/DeepFam.

  • Amir Karger , Harvard University, United States
  • Victor Luria, Harvard University, United States
  • Anne O'Donnell-Luria, Broad Institute of MIT and Harvard, United States
  • Taran Gujral, Fred Hutchinson Cancer Research Center, United States
  • John Cain, Harvard University, United States
  • Marc Kirschner, Harvard University, United States

Presentation Overview: Show

How new protein-coding genes and new protein domains appear in evolution are major questions in biology. While new genes are often built by duplicating existing genes, new genes were recently found to arise de novo from genomic DNA. To understand how new genes may arise de novo, we built a mathematical birth-and-death model based on gene and genome dimensions and dynamic factors such as mutation, recombination and selection. We found most genomes should contain many new genes, with few being maintained. Second, we identified thousands of candidate de novo genes in 20 eukaryotic genomes, using phylostratigraphy and proteomics, and evaluated their predicted biophysical properties. Compared to ancient proteins, new proteins are shorter, more vulnerable to proteases, disordered, likely to bind other proteins, yet less prone to toxic aggregation. To test structural predictions, we performed biophysical experiments comparing human new proteins to ancient proteins. We found that new genes encode short proteins that have distinct structural features and are expressed in brain and male germline, readily providing an avenue for evolutionary testing of function. The continuous creation and destruction of new genes provides a dynamic reservoir of molecular variation that enables genomic exploratory behavior to find new structures and new functions.


Conclusions

In this analysis, we predicted the genome-wide PPI network of sweet orange using ortholog identification and domain-combination methods, and then employed a highly accurate KNN algorithm to filter the predicted interactions. The resultant PPI network contains 8,195 proteins and 124,491 interactions. We employed GO and Mapman annotation to assess the predicted network. We further predicted 159 protein complexes in sweet orange using orthologs of the yeast protein complexes and employed them to assess CitrusNet. We finally constructed a PPI sub-network related to hormone-signaling proteins, and found that TOR serves as the central hub for hormone crosstalk. CitrusNet provides a valuable resource for protein-protein interactions in sweet orange.


Additional file

Additional file 1:

Supplemental Material. Figure S1. Performance of PFP evaluated on exact GO terms from BP and MF categories. Figure S2. Performance of PFP and ESG evaluated on exact GO terms from all three categories. Figure S3. Performance of PFP using IEA and non-IEA GO terms from BP and MF categories. Figure S4. Performance of PFP using IEA and non-IEA GO terms of all three GO categories. Figure S5. Ranks of CONS and FPM among the benchmarked methods. (DOCX 202 kb)



Comments:

  1. Sulaiman

    Pindyk, I'm just crying))

  2. Treasach

    Not logically

  3. Rogelio

    Authoritative post :), tempting ...

  4. Jori

    This sentence, is incomparable)))

  5. Rossiter

    And did you try like that?

  6. Mekus

    Yes, it was advised!



Write a message