Let’s return to the sample program that we looked at briefly earlier,
to work with sequence assemblies. First we will look at how we can
use data structure diagrams to describe the data that it will be dealing
with, and then we will examine how we can use process flow diagrams
to show what happens to that data.
Remember that there’s no point in writing code if you don’t know
how to prove it works, and equally there’s no point even reading in
data unless you know what data to expect.
So, let’s make sure we know what data will be involved in (for
example) reading in the details of an assembly of sequences into con-
tiguous sub-sections, and summarizing the information that we dis-
cover into a report.
Each assembly will be made up of a number of groups of sequences
of individual DNA bases, represented by a long string of letters, like
‘ACTTGGTCCAATTGGCACAC’, that have been assembled together.
These groups are known as contigs, because they are contiguous – each
sequence in a contig contains one or more sections where the arrange-
ment of its bases matches a section of another sequence.
If you’re not a molecular biologist, think of a painting of some
flowers. It will be made up of lots of small dabs of paint, of various
colours, shades and thicknesses. If we then group a number of these
paint marks together, we may have a leaf, a shadow, or a vase. And if
we put all of these groupings together, we have a masterpiece, espe-
cially if your own child created it.
The sequence data is like the individual paint marks. The contigs –
the groups of sequences that go together – are like the leaf, the shadow
and the vase. And the assembly is what we get when we put all these
parts of the picture together.
Let’s just slightly rephrase our definition of how our data is
structured.
• Each assembly is made up of contigs.
• Each contig is made up of sequences.
• Each sequence is made up of bases.
We can represent this graphically, by using a simple crow’s foot
symbol to represent ‘is made up of’. Take a look at Figure 8.5, which
shows an assembly of contiguous sequences.
This is a very simple representation of the data that we will be dealing
with – we’ve not even shown the bases. In a moment, we’ll see a more
60 DATA IN, DATA OUT AND DATA TRANSFORMATION