Retrieving a list of related
protein sequences
Many questions in molecular biology (your dissertation topic, your own
research, or your personal interests) require downloading a large collection
of similar protein sequences, all related to the same function, rather than just
one sequence. These biological questions typically include the detection of
conserved functional motifs (segments of sequences that look the same in
proteins with the same function), the simultaneous alignment of multiple
sequences, the assessment of their variability, or
phylogenetic studies — how
sequences relate to each other through evolution.
48
Part I: Getting Started in Bioinformatics
The FASTA (and RAW) format
FASTA is the name of a popular sequence-
alignment-and-database-scanning program cre-
ated by W.R. Pearson and D.J. Lipman in 1988
(you can use your brand new PubMed skills to
find the original article). The sequences used by
FASTA have to obey the following format:
>My_Sequence_Name
ARCGTCRGCKINTANDRGCKINTAND
CKINTANDARCGTCRGCKINTANDRG
CKINTAND
The line starting with > (the
definition line
) con-
tains a unique identifier followed by an optional
short definition. The lines that follow it contain
the DNA or protein sequence (in one-letter
code) until the next
> character in the file indi-
cates the beginning of a new sequence.
Because FASTA is easy to parse, this format has
become hugely popular — and is now the
default input format for much sequence analy-
sis software, including BLAST and CLUSTALW.
Be aware, though, that programs using FASTA-
formatted sequences as input are sometimes
case-sensitive. Here are some pointers:
Always use CAPITAL letters for the one-
letter codes.
When using FASTA-formatted sequences
on a PC, always use the TEXT option of your
preferred word-processing software (that
is, skip the formatting and use nothing but
ASCII characters).
When displaying these sequences as
a word-processing document, use the
Courier font for easy alignment.
Some programs analyzing one sequence at a
time work with the
RAW
format. This is simply
the sequence part of the FASTA format, without
the definition line — but machines can be
finicky. Using the FASTA format when the RAW
format is required may cause an error — or
some of the definition line may end up included
in the protein or DNA sequence (!).