A holy grail of biological research is deciphering the workings of a cell—the elementary unit of life. The main building blocks of the cell are macromolecules called proteins; they are key factors in driving cellular processes and determining the structure and function of cells. Proteins do not work in isolation but rather physically interact to form cellular machineries or transmit molecular signals. A modification of a single protein may have dramatic effects on the cell; indeed, many diseases (for example, Huntington’s disease26) are the result of small changes to a single protein and, consequently, to its set of interacting partners and functionality. The mapping of proteins and their interactions and the interpretation of this data are thus a fundamental challenge in modern biology with important applications in disease diagnosis and therapy.15
Key Insights
- The explosion of biological network data necessitates methods to filter, interpret, and organize this data into modules of cellular machinery.
- The comparative analysis of networks from multiple species has proven to be a powerful tool in detecting significant biological patterns that are conserved across species and in enabling their interpretation.
- Comparative network analysis presents hard computational challenges such as graph and subgraph isomorphism and detecting heavy subgraphs; these can tackled to near optimality by a combination of heuristic, parameterized, and integer programming-based approaches.
The last two decades have witnessed a great shift in biological research. While classical research focused on a single gene or subsystem of a specific organism, the emergence of high-throughput technologies for measuring different molecular aspects of the cell has led to a different, systems-level approach. By this approach, genome-wide data is used to build computational models of certain aspects of the cell, thereby generating new biological hypotheses that can be experimentally tested and used to further improve the models in an iterative manner.
A prime example for this technological revolution is the development of techniques for measuring proteinprotein interactions (PPIs). Historically, such interactions were measured at small scale—one or few interactions at a time. The development of automated, large-scale measurement technologies such as the yeast two-hybrid system10 and the co-immunoprecipitation assay1 has enabled the mapping of the entire interactome of a species in a single experiment.
Since the first publication of PPI data in yeast,37 dozens of large-scale assays have been employed to measure PPIs in a variety of organisms including bacteria,25 yeast, worm,20 fly,12 and human.36,27 Protein interaction data is being accumulated and assessed in numerous databases including DIP,28 BioGRID,35 and more. Nevertheless, PPI data remains noisy and incomplete. The reliability of different experimental sources for protein-protein interactions has been estimated to be in the range of 25%60%.8 A recent experimental assessment of PPIs in yeast39 estimated that even in this well-mapped organism, the set of reproducible and highly confident interactions covers only 20% of the yeast’s interaction repertoire.
The low quality of the data has driven the use of cross-species conservation criteria to focus on the more reliable parts of the network and infer likely functional components. The basic paradigm was borrowed from the genomic sequence world, where sequence conservation (across species) often implies that the conserved region is likely to retain a similar biological function.3,24 This evolutionary principle has motivated a series of works that aim at comparing multiple networks to extract conserved functional components at two different levels: the protein level and the subnetwork level. On the protein level, proteins whose network context is conserved across multiple species are likely to share similar functions.34 On the subnetwork level, conserved subnetworks are likely to correspond to true functional components, such as protein complexes, and to have similar function.32 In both cases, biological knowledge in any one of the species can be transferred to the others, allowing the annotation of networks in an efficient and accurate manner.30
In this review, we survey the field of comparative network analysis with an emphasis on the arising computational problems and the different methods that have been used to tackle them, starting from heuristic approaches, going through parameterized algorithms that perform well on practical instances, and ending with optimal integer linear programming (ILP)-based solutions that rely on powerful, yet available, industrial solvers. We demonstrate the applications of these methods to predict protein function and interaction, infer the organization of protein-protein interaction networks into their underlying functional modules, and link biological processes within and across species.
A Roadmap to Network Comparison Techniques
We view a PPI network of a given species as a graph G = (V, E), where V is the set of proteins of the given species and E is the set of pairwise interactions among them. In a network comparison problem, one is given two or more networks along with sequence information for their member proteins. The goal is to identify similarities between the compared networks, which could be either local or global in nature (Figure 1). The underlying assumption is that the networks have evolved from a common ancestral network, and hence, evolutionarily related proteins should display similar sequence and interaction patterns. For ease of presentation, we focus in the description below on pairwise comparisons, but the problems and their solutions generalize to multiple networks.
Most algorithms for network comparison score the similarity of two subnetworks by first computing a many-to-many mapping between their vertices (with possibly some unmatched vertices in either network) and then scoring the similarity of proteins and interactions under this mapping. Proteins are commonly compared by their associated amino-acid sequences, using a sequence comparison tool such as BLAST.3 The similarity score of any two sequences is given as a p-value, denoting the chance of observing such sequence similarity at random. Significant p-values imply closer evolutionary distance and, hence, higher chances of sharing similar functions. Interactions are compared in a variety of ways; the simplest and most common of which is to count the number of conserved interactions. Formally, given a mapping Φ of proteins between two networks (associating proteins of one network with sets of proteins in the other network), an interaction (u, v) in one species is said to be conserved in the other species if there exist u‘ Φ(u) and v‘ Φ(v) such that u‘ and v‘ interact.
Historically, the first considered problem variant was local network alignment (Figure 1a), where the goal is to identify local regions that are similar across the networks being compared. To this end, one defines a scoring function that measures the similarity of a pair of subnetworks, one from each species, in terms of their topology and member proteins. To guide the search for high scoring, or significant matches, the scoring function is often designed to favor a certain class of subnetworks, such as dense subnetworks that serve as a model for protein complexes,13,16,32 or paths that serve as a model for protein pathways.17,18 In the related network querying problem (illustrated in Figure 1b in an astronomical context), a match is sought between a query subnetwork, representing a known functional component of a well-studied species, and a relatively unexplored network of some other organism. The match could be exact (that is, an isomorphic subgraph under some mapping of the proteins between the two species) or inexact, allowing unmatched nodes on either subnetwork. This problem was first studied by Kelley et al.17 in the context of local network alignment; its later development accompanied the growth in the number of mapped organisms.5,7,9,33 The third problem that has been considered is global network alignment (Figure 1c), where one wishes to align whole networks, one against the other.4,34 In its simplest form, the problem calls for identifying a 1-1 mapping between the proteins of two species so as to optimize some conservation criterion, such as the number of conserved interactions between the two networks.
All these problems are NP-hard as they generalize graph and subgraph isomorphism problems. However, heuristic, parameterized, and ILP approaches for solving them have worked remarkably well in practice. Here, we review these approaches and demonstrate their good performance in practice both in terms of solution quality and running time.
Heuristic Approaches
As in other applied fields, many problems in network biology are amenable to heuristic approaches that perform well in practice. Here, we highlight two such methods: a local search heuristic for local network alignment and an eigenvector-based heuristic for global network alignment.
NetworkBLAST32 is an algorithm for local network alignment that aims to identify significant subnetwork matches across two or more networks. It searches for conserved paths and conserved dense clusters of interactions; we focus on the latter in our description. To facilitate the detection of conserved subnetworks, NetworkBLAST first forms a network alignment graph,17,23 in which nodes correspond to pairs of sequence-similar proteins, one from each species, and edges correspond to conserved interactions (see Figure 2). The definition of the latter is flexible and allows, for instance, a direct interaction between the proteins of one species versus an indirect interaction (via a common network neighbor) in the other species. Any subnetwork of the alignment graph naturally corresponds to a pair of potentially matching subnetworks. NetworkBLAST scores such a subnetwork by the density of its corresponding intra-species subnetworks versus the chance that they arise at random, assuming a random model that preserves the node degrees.
After constructing the alignment graph, the algorithm proceeds to identify high-scoring subnetworks. This is done by starting with a seed of at most four nodes, and applying a local search to expand it. Each node serves as the center of a seed, along with at most three of its neighbors. The search iteratively adds or removes a node that contributes most to the score, as long as the score increases (and up to an upper bound of 15 nodes). The effectiveness of this search strategy can be quantified by comparing to an exhaustive search when such is possible. Figure 3(a) presents such a comparison when analyzing a single (yeast) network from Yu et al.,39 searching the best cluster containing each of the network’s proteins. It can be seen that the greedy heuristic produces near-optimal clusters (up to 20% deviation in score) in about 75% of the cases, with an average of merely 13% deviation from the optimal score. Notably, NetworkBLAST requires only a few minutes to run, while the exhaustive (ILP-based) approach took several hours while limiting the solver to five minutes per seed. For seven out of a total of 326 seeds, the solver could not find an optimal solution within the allotted time.
While NetworkBLAST can be used to align multiple networks, the size of the alignment graph grows exponentially with the number k of networks and becomes prohibitive for k = 4. Interestingly, the NetworkBLAST alignment strategy can be mimicked without having to explicitly construct the alignment graph. Instead, Kalaev et al.16 show that one can build a linearsize layered alignment graph where each layer contains the PPI network of a single species and inter-layer edges connect similar proteins. The main observation is that a set of proteins that are sequence similar, one from each species, translates to a size-k subgraph that contains a protein (vertex) from each species and is connected through the sequence similarity edges. Such a subgraph must have a spanning tree, which can be looked for using dynamic programming.
To exemplify the algorithm, consider the implementation of NetworkBLAST’s local search strategy and let us focus on the addition of new k-protein “nodes” (that is, these would have been nodes of the alignment graph) to the growing seed. The latter requires identification of k inter-species proteins that induce a connected graph on the interlayer edges and contribute most to the seed’s weight. As the contribution of each protein to the score can be easily computed and the total contribution is the sum of individual protein contributions, the optimal “node” to be added can be identified in a recursive fashion. That is, the corresponding spanning tree is obtained by merging two neighboring subtrees that span distinct species subsets whose union is the entire species set. This computation takes O(3kl) time in total, where l is the number of inter-layer edges.
For global network alignment, both heuristic and exact (ILP) approaches exist. Here, we highlight one such approach by Singh et al.34 that is based on Google’s PageRank algorithm. The idea is to score pairs of proteins, one from each species, based on their sequence similarity as well as the similarity of their neighbors. A maximum matching algorithm is then used to find a high-scoring 1-1 alignment between the compared networks.
Singh et al. formulate the computation of the pairwise scores as an eigenvalue problem. Denote by R the score vector to be computed (over all pairs of interspecies proteins). Let N(v) denote the set of nodes adjacent to v, and let A be a stochastic matrix over pairs of inter-species proteins, where A(u, v),(u′, v′) = 1/|N(u′)||N(v′)| if and only if {u, u’, v, v’} induce a conserved interaction (that is, (u, u’) and (v, v’) interact). Finally, denote by B a normalized pairwise sequence similarity vector. The goal is to find a score vector R in which the similarity score of a pair of proteins combines the prior sequence information with a weighted average of the similarity scores of their neighbors. Thus,
where α is a parameter balancing the network-based and sequence-based terms. R can be found efficiently using the power method algorithm, which iteratively updates R according to the above equation and converges to the analytical solution R = (I − αA)−1(1 − α)B.
As mentioned earlier, some manifestations of the global alignment problem can be solved to optimality using ILP. For instance, in Klau,19 a 1-1 mapping that maximizes the number of conserved edges between the two networks is sought. Interestingly, while the eigenvector solution is heuristic in nature, it performs as well as an exact solution in terms of the number of conserved edges it reveals and its correspondence to a biological ground truth. Indeed, in a recent paper,22 the performance of the heuristic approach of Singh et al. was compared to that of the exact ILP formulation. The algorithms were used to pairwise align the PPI networks of yeast, worm, and fly. Notably, for all three species pairs, the number of conserved edges in the alignment proposed by the heuristic method was equal to that of the ILP approach. As further shown in Mongioví and Sharan,22 both approaches gave comparable results when their alignments were assessed against a gold standard database of cross-species protein matches (precisely, the HomoloGene database of clusters of orthologous genes29).
Exact Approaches
In contrast to the heuristic methods highlighted here, which do not provide any guarantee of the quality of the obtained solution, exact approaches guarantee optimality at the cost of speed. Two general methodologies for efficient, yet exact, solutions have been common in network analysis: fixed parameter and ILP formulations. Fixed parameter tractable problems can be solved in time that is, typically, exponential in some carefully chosen parameter of the problem and polynomial in the size of the input networks. As we will describe, many variants of the network querying problem are amenable to fixed parameter approaches, as the query subnetworks can be often assumed to have a small, bounded size. The other methodology we demonstrate is based on reformulating the problem at hand as an integer linear program and applying an industrial solver such as CPLEX14 to optimize it. While arriving at a solution in a timely fashion is not guaranteed (as integer programming is NP-hard11), in practice, on current networks, many of these formulations are solved in reasonable time.
We start by describing a parameterized approach—color coding—that has been extensively used in network querying applications. Color coding was originally developed by Alon et al.2 for searching for structured size-k subgraphs, such as simple paths and trees, within a graph. Its complexity is 2O(k) m, where m is the size of the searched graph. Color coding is based on the idea that by randomly assigning k distinct colors to the vertices of the graph, the task of finding a simple subgraph translates to that of finding a colorful subgraph, namely, one spanning k distinct colors. For certain classes of tree-like subgraphs, the subsequent search can be efficiently implemented using dynamic programming. Since a particular subgraph need not be colorful in a specific color assignment, multiple color assignments should be considered to retrieve a desired subgraph with high probability. Precisely, the probability that a graph of size k is colorful is k!/kk > e−k; hence, in expectation, ek iterations of the algorithm suffice to detect the desired subgraph.
In the context of comparative network analysis, color coding was mainly used to tackle network querying problems, where typically the query subnetwork is small (515 proteins), motivating the use of this parameterized approach. One specific example is the QPath33 method for querying paths in a network. QPath extends the basic color coding formulation by querying weighted networks for inexact matches (see the accompanying sidebar “Finding a Hairpin in a Haystack“). The algorithm takes minutes to run with length-7 queries and up to three insertions (unaligned match vertices) and three deletions (unaligned query vertices). Efficient heuristics to color the network can be used to reduce its time even further.9,21 In a follow-up work,9 the QPath approach was extended to handle queries of bounded treewidth and a heuristic solution was offered for general queries.
Our last highlighted method uses an ILP formulation to optimally solve a different variant of the network querying problem, where the topology of the query is not known. This scenario is very common when querying for protein complexes, where the underlying interaction structure is rarely available39 but the member proteins are assumed to be connected. Hence, instead of searching for a particular interaction pattern, the goal is to find a matching subgraph that is connected. In Bruckner et al.,7 an ILP solution to this problem is given. The main challenge in formulating the problem as an ILP is to express the connectivity of the solution sought. The Torque algorithm7 solves this problem by modeling the solution subgraph as a flow network, where one of its nodes is arbitrarily designated as a sink, capable of draining k 1 units of flow, and all the other nodes are set as sources, each generating one unit of flow. A set of constraints requires that the total flow in the system is preserved. The detailed program is given in the accompanying sidebar “Querying via an Integer Linear Program.”
Notably, there is also a parameterized approach to this querying problem. The approach is based on the observation that a connected subgraph can be represented by its spanning tree, so the querying problem translates to that of finding a tree of k distinct vertices. The latter problem can be solved using the color coding technique.6,7 Interestingly, for most instances, the dynamic programming approach is empirically slower than running the ILP formulation through a solver, as demonstrated in Figure 3(b).
The Power of Comparative Network Analysis
The successful application of comparative network analysis approaches depends not only on their computational attributes but also on their biological relevance. Here, we give examples for the applications of several of the reviewed approaches and the biological insights they have enabled. We demonstrate the power of comparative network analysis approaches by comparing their performance with that of methods that are either sequence-based and, thus, cannot exploit the network information, or single-species-based and as a result are more prone to noise in the network data.
The most intuitive use of comparative network analysis is to gather support for computational predictions from multiple species. A prime example for this use is the inference of protein complexes or pathways from PPI data. For instance, in Sharan et al.,31 a cross-species analysis is used to identify yeast-bacterial conserved complexes. By comparing the inferred complexes to known complexes in yeast, it is shown that the comparative analysis increases the specificity of the predictions (compared to a yeast-only analysis) by a significant margin, albeit at the price of reduced sensitivity.
In Sharan et al.,32 the local alignment of three networks (yeast, worm, and fly) was used to identify their conserved protein machineries (Figure 4a); those were used in a systematic manner to infer protein function (Figure 4b) and interaction (Figure 4c), showing superior performance compared to that of a sequence-based analysis. In brief, NetworkBLAST was used to identify high-scoring triplets of matching subnetworks across the three networks. Whenever the proteins in a conserved subnetwork were enriched for a certain function and at least half the proteins in the subnetwork were annotated with that function, the rest of the subnetwork’s proteins were predicted to have that function as well. This prediction strategy yielded 58%63% accuracy across the three species (in a cross-validation test), versus 37%53% accuracy for a sequence-based method that predicts the function of a protein based on its most sequence-similar protein in the other species. The conserved subnetworks were further used to predict novel PPIs in the following manner: a pair of proteins were predicted to interact if two sequence-similar proteins were known to interact in another species (directly or via a common network neighbor) and, additionally, if the four proteins co-occurred in one of the conserved subnetworks. Remarkably, this strategy yielded >99% specificity in cross-validation. Experimental validation of 65 of these predictions gave a success rate in the range of 40%52%. In comparison, sequence-based predictions that do not use the conserved subnetwork information yielded success rates in the range of 16%31%.18,32
The transfer of annotations across species can go beyond single proteins to whole subnetworks. For instance, in Shlomi et al.,33 paths in the yeast network served as queries for the fly network. The resulting matches were annotated with the function of the query (whenever the query was significantly enriched with some functional annotation; see Figure 5a), and the predictions were tested versus the known functional annotations in the fly. Overall, the annotation was accurate in 64% of the cases, compared to 40% for sequence-based annotation.
Network comparison schemes can be used to gain additional insights on protein function, interaction, and evolution. Both local and global network alignments suggest aligned pairs (or, more generally, sets) of proteins as functionally similar. Accordingly, they have been used to identify proteins that have evolved from a common ancestor and retained their function (so called functional orthologs).4, 38 In Bandyopadhyay et al.,4 it is shown that in a majority of the cases, the aligned pairs were not the highest sequence similar ones. In addition, the conserved subnetworks often connect cellular processes that may work together in a coordinated manner (Figure 4b). The evidence is particularly compelling when the link is supported by multiple species.
Conclusion
The explosion of molecular data in the last two decades is revolutionizing biological and, consequently, computational research. The arising computational problems require practical solutions that can cope with an evergrowing scale. In addition to heuristic approaches, it is often of interest to compute exact solutions that can potentially lead to new insights about the biological problem at hand. The combination of parameterized approaches and powerful linear programming solvers has enabled the development of efficient, yet exact, methods to solve some of the key problems in comparative network analysis.
The application of comparative analysis tools to available network data has provided valuable insights on the function and interplay among molecular components in the cell. While much progress has already been made, new computational techniques will need to be developed to cope with the flood of genomic data that is expected to arrive in the coming years. These will span thousands of organisms and diverse molecular aspects. The arising challenges will involve the organization of this data into high-quality networks, data imputation through the integration of multiple information sources, multiple network alignment, and, ultimately, the propagation of curated or experimentally derived annotations through the aligned networks. Hybrid solution approaches that try to combine different techniques depending on the problem instance (see, for example, Bruckner et al.7) may be key to meeting those challenges.
Acknowledgments
We thank Sharon Bruckner for contributing Figure 2b. Atias was partially funded by the Edmond J. Safra Bioinformatics Program. This work was supported by a research grant from the Israel Science Foundation (Grant no. 241/11).
Figures
Figure 1. Computational problems in comparative network analysis.
Figure 2. The NetworkBLAST local network alignment algorithm. Given two input networks, a network alignment graph is constructed. Nodes in this graph correspond to pairs of sequence-similar proteins, one from each species, and edges correspond to conserved interactions. A search algorithm identifies highly similar subnetworks that follow a prespecified interaction pattern. Adapted from Sharan and Ideker.
Figure 3. Performance comparison of computational approaches.
Figure 4. Insights derived from a multiple network alignment.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment