292 APPENDIX C: BIOINFORMATICS GLOSSARY
Single - pass sequencing: rapid sequencing of large segments of the genome of
an organism by isolating as many expressed (cDNA) sequences as possible
and performing single sequencer runs on their 5 ′ or 3 ′ ends. Single - pass
sequencing typically results in individual, error - prone sequencing reads of
400 to 700 bases, depending on the type of sequencer used. However, if
many of these are generated from numerous clones from different tissues,
they may be overlapped and assembled to remove the errors and generate
a contiguous sequence for the entire expressed gene.
Site(s): sites in sequences can be located either in DNA (e.g., binding sites,
cleavage sites) or in proteins. To identify a site in DNA, ambiguity symbols
are used to allow several different symbols at one position. Proteins need
a different mechanism, however ( see Pattern). Restriction enzyme cleavage
sites, for example, have the following properties: limited length (typically,
fewer than 20 base pairs); defi nition of the cleavage site and its appearance
(3 ′ , 5 ′ overhang or blunt); defi nition of the binding site.
Splicing: the joining together of separate DNA or RNA component parts. For
example, RNA splicing in eukaryotes involves the removal of introns and
the stitching together of the exons from the pre - mRNA transcript before
maturation.
Start codon: a triplet codon (i.e., AUG) at which both prokaryotic and eukary-
otic ribosomes begin to translate the mRNA.
Stop codon: one of three triplet codons (UGA, UAG, and UAA) that does
not instruct the ribosome to insert a specifi c amino acid and thereby causes
translation of an mRNA to stop. Instead, a termination factor is typically
inserted, causing the ribosome to be disassembled and the completed
protein to be released.
Structural gene: a gene that encodes a structural protein.
Structure prediction: algorithms that predict the secondary, tertiary, and
sometimes even quarternary structure of proteins from their sequences.
Determining protein structure from a sequence has been dubbed “ the
second half of the genetic code ” since it is the higher - level folded structure
of a protein that governs how it functions as a gene product. As yet, most
structure prediction methods have been only partially successful and typi-
cally work best for certain well - defi ned classes of proteins.
Substitution matrix: a model of protein evolution at the sequence level, result-
ing in the development of a set of widely used substitution matrices. These
are frequently called Dayhoff, MDM (mutation data matrix), BLOSUM, or
PAM (percent accepted mutation) matrices. They are derived from global
alignments of closely related sequences. Matrices for greater evolutionary
distances are extrapolated from those for lesser distances.
Substrate: a specialized type of ligand that binds specifi cally to an enzyme.
Tertiary structure: folding of a protein chain via interactions of its side -
chain molecules, including formation of disulfi de bonds between cysteine
residues.