
moter is to increase or repress the transcription from the core promoter (basal tran-
scription). Thus, any given gene will have a specific regulatory region determined by
the binding sites of the transcription factors that ensure that the gene is transcribed
in the appropriate cell type and at the proper point in development. The transcrip-
tional activation is determined not only by the presence of the binding sites but also
through the availability of the corresponding transcription factors. These transcrip-
tion factors are themselves subjected to regulation and activation, e. g., through sig-
naling pathways, and the whole process can entail complex procedures such as tran-
scriptional cascades and feedback control loops (Pedersen et al. 1999).
8.2.2
Sequence-based Prediction of Promoter Elements
This section discusses promoter prediction algorithms that incorporate solely the
genome sequence. As described in the previous section, promoters are complex and
diverse, which makes promoter prediction a difficult task. Early reviews on promoter
recognition programs can be found in Fickett and Hatzigeorgiou (1997) and Stormo
(2000); a more recent review on algorithms for promoter prediction can be found in
Werner (2003).
The modeling of gene transcription regulation follows its combinatorial nature,
starting from the detection of individual binding sites (5–25 bp in length), moving to
the detection of specific combinations of binding sites, so-called composite regula-
tory elements (Kel et al. 1995), and finally to the detection of the promoter.
The detection of individual binding sites is the first level in that process. TFBSs
have high sequence variability, which distinguishes them, for example, from restric-
tion sites, i.e., the recognition sequences of a restriction enzyme. Whereas restric-
tion sites are almost exact in the sense that sites varying by only a single mismatch
will be cut less well by orders of magnitude, transcription factor binding can tolerate
high sequence variability of the TFBSs (Stormo 2000). This variation makes biologi-
cal sense in that it allows a higher flexibility of the regulatory system and assigns the
promoters different activity levels.
In order to meet this flexibility, known TFBSs for the same transcription factor
that may vary slightly are often represented by a consensus sequence that is close to
each single motif according to some criterion. There is a tradeoff in the consensus
sequences between the number of mismatches that are allowed and the precision of
the representation and thus a tradeoff between the specificity and the sensitivity of
the algorithms. A consensus sequence is typically denoted in the IUPAC code to de-
scribe ambiguities in nucleotide composition (Fig. 8.1b).
An alternative to consensus sequences is the use of positional weight matrices
(PWM). A PWM is a matrix representation of a TFBS, with rows representing one of
the bases, “A,” “C,” “G,” and “T,” and columns representing the position within the
motif (Fig. 8.1 b). Each entry in the matrix corresponds to a numerical value indicat-
ing the confidence for the specific base at that position. The PWM approach is some-
what more general than the consensus sequence approach in the sense that each
consensus can be represented by a PWM (for example, through frequency counts
262
8 Modeling of Gene Expression