Experimenting with other DNA
composition analyses
While you’re at the Pasteur Institute site, use the opportunity to further
explore the list of available EMBOSS modules, such as
chips (for codon usage
statistics), or the CpG rich region finder
cpgreport.
Finding internal repeats in your sequence
Another useful type of composition analysis involves locating segments that
occur more than once within your sequence. Such segments are called
repeats.
There is no real difference between long words (6-tuple, 8-tuple, and larger)
and repeats. Here’s a common-sense rule:
A repeat is a word long enough so
that it’s unlikely to occur very often by chance, given a random sequence.
For
instance, a GTC triplet found 4 times within a 500-nucleotide long sequence
doesn’t qualify as a repeat.
Another difference between word-counting and repeat analysis is that repeats
can be imperfect. Unlike words, similar repeats don’t need to be identical.
Finally, biologists like to distinguish
tandem repeats (similar subsequences
along the same DNA strand) from
inverted repeats (similar sequences occur-
ring on the direct and reverse strands). Biologists are interested in repeats
because they are often involved in genome rearrangements or regulatory
mechanisms of gene expression.
There are many different algorithms for finding repeats within a DNA (or pro-
tein) sequence. They all try to identify segments more similar to each other
than would be expected by mere chance alone. The tricky part is in the scor-
ing and ranking of the similar subsequence segments. Is the exact matching
of five consecutive nucleotides good enough to be considered a repeat? And
is it better than 9 out of 10, or 123 over 160? Which one do you want reported
first? How far down the list of possible repeats do you want to go?
Finding repeats is a tricky business
Because there are no universal answers to the questions surrounding the pre-
cise nature of repeats, repeat-finding programs ask you to fix thresholds related
to their scoring algorithms, repeat size, copy number, periodicity (distance
between repeats), and other things you don’t always understand. This makes
them difficult to use. The default settings they provide may or may not work
for your particular sequence.
In that respect, our quick survey of Web-based repeat finders — using a
repeat-containing sequence we made just for this purpose — was amazingly
142
Part II: A Survival Guide to Bioinformatics