43
Normalization of Gene-Expression Microarray Data
normalization, whose goal is to impose to each array the same 
empirical distribution of intensities. The distribution of within-
gene averages is usually used as the target or the reference.
Mathematically,  the  procedure  applies  a  transformation  
F 
-1
(G
i
(y)), where G
i
 is the cumulative distribution of intensities in 
the  array  i,  and  F  is  the  reference  distribution.  The  algorithm 
itself is  very  simple;  intensities  in  each  array are first ranked in 
increasing order. Each quantile value in then substituted by the 
corresponding quantile in the reference distribution. Finally, val-
ues are brought back to the original order. Using only the obser-
vation ranks, the algorithm is able to deal with a nonlinear trend, 
and  runs  quite  fast.  Where  several  replicates  of  the  same  gene 
intensities are available (e.g., Illumina and Affymetrix), the algo-
rithm is usually run before summarization, thus exploiting more 
information  and  possibly  with  a  better  estimation  of  the  real 
underlying distribution of gene intensities.
Most  commonly-used  normalization  procedures  use  the  whole 
set  of  genes,  under  the  assumption  that  the  great  majority  of 
genes are fairly invariant across arrays. Nevertheless, this assump-
tion is often questionable, especially in experiments where a large 
variation  in  expression  profiles  is  expected.  To  overcome  this 
problem, the housekeeping-gene approach borrows the idea from 
standard laboratory procedures (e.g., Northern blot or quantita-
tive RT-PCR), where an internal control is used for data normal-
ization. It assumes that some (not all) genes are similarly expressed 
across arrays, so that they can be used as a reference for the rela-
tive  expression  levels  of  other  genes.  For  example,  Affymetrix 
platforms include a set of control probes of housekeeping genes 
(e.g., b-Actin, GAPDH and others).
However, there is a serious concern about the assumption of 
invariant expression of the so-called housekeeping genes as they are 
often  affected  by  various  factors  that  are  not  controlled  in  the 
experiment. Also, those genes are usually highly expressed, thus 
not representing genes of low  intensities. Furthermore,  they are 
usually a very small subset of the whole array chip, so fluctuations 
in  their  intensities  are  highly  affected  by  random  or  systematic 
errors. Any normalization based on such a limited number of inter-
nal references would be unreliable. Therefore, normalization based 
on housekeeping genes selected a priori is not recommended.
A possible variation of the same framework is to use spiked-in 
control spots with genetic material from unrelated species. Again 
several problems arise with such an approach. First, spike-ins are 
added into the sample at a different stage of cDNA preparation, 
so that intensity levels of spike-ins are subject to less experimental 
variation than the naturally expressed transcripts of comparable 
abundance. Second, nonspecific hybridization cannot be excluded, 
though might be reduced  with  careful  probe  design. Finally,  a 
3.5. Housekeeping-
Gene Normalization