Adaptively inferring human transcriptional subnetworks

Abstract
Although the human genome has been sequenced, progress in understanding gene regulation in humans has been particularly slow. Many computational approaches developed for lower eukaryotes to identify cis ‐regulatory elements and their associated target genes often do not generalize to mammals, largely due to the degenerate and interactive nature of such elements. Motivated by the switch‐like behavior of transcriptional responses, we present a systematic approach that allows adaptive determination of active transcriptional subnetworks ( cis ‐motif combinations, the direct target genes and physiological processes regulated by the corresponding transcription factors) from microarray data in mammals, with accuracy similar to that achieved in lower eukaryotes. Our analysis uncovered several new subnetworks active in human liver and in cell‐cycle regulation, with similar functional characteristics as the known ones. We present biochemical evidence for our predictions, and show that the recently discovered G2/M‐specific E2F pathway is wider than previously thought; in particular, E2F directly activates certain mitotic genes involved in hepatocellular carcinomas. Additionally, we demonstrate that this method can predict subnetworks in a condition‐specific manner, as well as regulatory crosstalk across multiple tissues. Our approach allows systematic understanding of how phenotypic complexity is regulated at the transcription level in mammals and offers marked advantage in systems where little or no prior knowledge of transcriptional regulation is available. ### Synopsis The importance of achieving an accurate quantitative understanding of gene regulation in humans can hardly be overstated. Deregulation of gene expression is a recurring theme in development and progression of several diseases including cancer. The emergence of new experimental platforms that probe transcription globally promises a comprehensive view of these fundamental biological processes in a large number of mammalian systems, in which very little is known in terms of their transcriptional regulation. By integrating the expression profiles with the genomic sequence information computationally, it is now possible to obtain a snapshot of the active transcriptional subnetworks in lower eukaryotes with a reasonable accuracy ([Das et al , 2004][1]; [Wang et al , 2005][2]). We define a transcriptional subnetwork as the set of transcription factors (TFs) as represented by the combinations of cognate cis ‐regulatory motifs, their target genes and the physiological processes they regulate ([Figure 2][3]). The generalization of such approaches to mammals remains challenging however (see, e.g., Figure 1 in [Tompa et al , 2005][4]). This is due to multiple factors, including enhanced degeneracy of TF binding sites, significantly elevated role of interactions between TFs in promoter recognition and multicellular architecture of mammals. Current computational methods, which are primarily clustering‐based, do not adequately address these complicating factors. Moreover, many genes do not cluster tightly enough that their regulatory motifs can be discovered reliably. There is also marked subjectivity in how targets are determined. This work presents a minimally biased approach motivated by the switch‐like behavior of transcriptional response, which overcomes the aforementioned limitations. It identifies potentially active motif combinations in proximal promoters by examining their correlation with mRNA expression levels across the genes. In this approach, both the active motif combinations and their target genes are learnt directly from the expression data in a condition‐specific manner, and thus, adaptively. We demonstrate that this method can systematically infer transcriptional subnetworks in mammals from expression data with accuracy similar to those obtained for lower eukaryotes. We applied our algorithm to the expression profile of adult human liver measured under a normal condition ([Su et al , 2004][5]) and discovered three functional liver‐specific motif combinations. The inferred model was used to obtain their target genes, a Gene Ontology enrichment analysis of which subsequently revealed the over‐represented biological pathways, thus leading to transcriptional subnetworks active in the profiled sample ([Figure 2][3]). HNF‐1, the pleiotropic regulator of liver‐specific genes, is among the three liver‐specific combinations. We observe, a posteriori , that >70% of our predicted HNF‐1 targets have been previously validated in biochemical assays. The other two liver‐specific combinations are novel, one of which regulates sugar metabolism pathways, and another regulates lipid transport and metabolism. There are certain other advantages to this approach. For instance, we are able to identify the mRNA mixing effects present in tissue samples derived from a whole organ such as the liver. Additionally, we notice that several targets achieve their maximum expression in a tissue different from where the motif combination has its maximal regulatory effect. This suggests that genes are coregulated across multiple tissues, as one would expect in a synexpression group ([Niehrs and Pollet, 1999][6]). A distinct advantage of our method is that expression profiles from only a few conditions are necessary to reach this conclusion. TFs regulate genes in a condition‐specific manner. Hence, a particular TF can activate different sets of genes under different conditions ([Zhu et al , 2005a][7]). Application to human cell‐cycle data ([Whitfield et al , 2002][8]) revealed that this technique can model such condition‐specific gene regulation. Namely, the predicted targets of E2F in G1/S and G2/M phases are significantly different, as one would expect biologically ([Zhu et al , 2005a][7]). This is a natural outcome of the fact that...