The myriad of cells in the human body are all made from the same blueprint: the human genome. At the heart of this diversity lies the concept of gene regulation, the process in which it is decided which genes are used where and when. Genes do not function as on/off buttons, but more like a volume control spanning the range from completely muted to cranked up to maximum. The volume, in this case, is the production rate of proteins. This production is the result of a two step procedure: i) transcription, in which a small part of DNA from the genome (a gene) is transcribed into an RNA molecule (an mRNA); and ii) translation, in which the mRNA is translated into a protein. This thesis focus on the first of these steps, transcription, and specifically the initiation of this.
Simplified, initiation is preceded by the binding of several proteins, known as transcription factors (TFs), to DNA. This takes place mostly near the start of the gene known as the promoter. This region contains patterns scattered in the DNA that the TFs can recognize and bind to. Such binding can prompt the assembly of the pre-initiation complex which ultimately leads to transcription of the gene. In order to achieve the regulation necessary to produce the multitude of tissues we observe, there exists a wide range of these TFs having different binding preferences and targeting different genes. By activating different TFs in a context dependent manner the organism can produce customized sets of proteins for each cell resulting in different cell types.
This thesis presents several methods for analysis and description of promoters. We focus particularly the binding sites of TFs and computational methods for locating these. We contribute to the field by compiling a database of binding preferences for TFs which can be used for site prediction and provide tools that help investigators use these. In addition, a de novo motif discovery tool was developed that locates these patterns in DNA sequences. This compared favorably to many contemporary methods.
A novel experimental method, cap-analysis of gene expression (CAGE), was recently published providing an unbiased overview of the transcription start site (TSS) usage in a tissue. We have paired this method with high-throughput sequencing technology to produce a library of unprecedented depth (DeepCAGE) for the mouse hippocampus. We investigated this in detail and focused particularly on what characterizes a hippocampus promoter. Pairing CAGE with TF binding site prediction we identified a likely key regulator of hippocampus.
Finally, we developed a method for CAGE exploration. While the DeepCAGE library characterized a full 1.4 million transcription initiation events it did not capture the complete TSS-ome of hippocampus. We fitted two statistical models to the CAGE data and extrapolated how deep sequencing needs to be to capture most of the events. We concluded that while most genes are discovered, tag clusters and TSSs are not fully explored.