Masking protein sequences
Many proteins contain patches known as low-complexity (or low-entropy)
regions. For instance, these regions can be segments that contain many pro-
lines or many glutamic acid residues. If BLAST aligns two proline-rich
domains, this alignment gets a very good E-value because of the high number
of identical amino acids it contains. Unfortunately, there is a good chance
that these two proline-rich domains are not related at all. In fact, these one-
amino-acid-rich domains are notorious for fooling BLAST.
To avoid this problem, BLAST filters out low-complexity regions when analyz-
ing proteins. To do that, it replaces those regions in the sequence with Xs. If
you’re specifically interested in the low-complexity regions and don’t want
these regions filtered out of your search, you must deselect the correspond-
ing Low Complexity check box next to Choose Filter in the Options for
Advanced Blasting section of the blastp search page, as shown in Figure 7-11.
Imagine that you have just cloned and sequenced a protein. In the sequence
of this protein, you find a stretch of 10 Prolines:
PPPPPPPPPP. At this point,
you’re probably wondering whether this is a sequencing error or a mistake of
some kind. You would feel much more confident if you had a valid protein
sequence that contains the same stretch of amino acids.
To find this protein, write your own sequence
PPPPPPPPPPPPPPPPPP and
give it to blastp as a query. Do not forget to deselect the Low Complexity
check box.
Some domains are very common in protein sequences, such as Zn Fingers in
tandems or Fibronectines domains. If your protein contains this type of
domain, BLAST reports many matches with proteins that contain the same
domains but are otherwise unrelated. To make your search more interesting,
filtering out these domains is a good idea. Here’s how:
1. Use CD search, InterProScan, or Pfscan to find domains in your
protein.
See Chapter 6 for more information about using these searches.
2. Read the domain documentation to find out how widespread the
domains are that you found.
3. Replace the sequences of less-informative domains with Xs (or rewrite
them in lowercase) and then select the Mask Lower check box next to
Choose Filter in the BLAST form. (See Figure 7-11.)
4. Run a standard blastp as shown at the beginning of this chapter.
When your results are returned, interpret them as shown in the
“Understanding your BLAST output” section, earlier in this chapter.
221
Chapter 7: Similarity Searches on Sequence Databases