13.2
KEGG
KEGG (Kyoto Encyclopedia of Genes and Genomes; http://www.genome.ad.jp/
kegg/) is a reference knowledgebase offering information about genes and proteins,
biochemical compounds, reactions, and pathways. The data are organized in three
parts: the gene universe (consisting of the GENES, SSDB, and KO databases), the
chemical universe (with the COMPOUND, GLYCAN, REACTION, and ENZYME da-
tabases which are merged as the LIGAND database), and the protein network con-
sisting of the PATHWAY database (Kanehisa et al. 2004). In addition, the KEGG da-
tabase is hierarchically classified into categories and subcategories at four levels. The
five topmost categories are metabolism, genetic information processing, environ-
mental information processing, cellular processes, and human diseases. Subcate-
gories of metabolism are, e. g., carbohydrate, energy, lipid, nucleotide, or amino acid
metabolism. These are subdivided into the different pathways, such as glycolysis, ci-
trate cycle, purine metabolism, etc. Finally, the fourth level corresponds to the KO
(KEGG Orthology) entries. A KO entry (internally identified by a K number, e. g.,
K00001 for the alcohol dehydrogenase) corresponds to a group of orthologous genes
that have identical functions.
The gene universe offers information about genes and proteins generated by gen-
ome sequencing projects. Information about individual genes is stored in the GENES
database, which is semiautomatically generated from submissions to GenBank, the
NCBI RefSeq database, the EMBL database, and other publicly available organism-
specific databases. K numbers are further assigned to entries of the GENES database.
The SSDB database contains information about amino acid sequence similarities be-
tween protein-coding genes computationally generated from the GENES database.
This is carried out for many complete genomes and results in a huge graph depicting
protein similarities with clusters of orthologous and paralogous genes.
The chemical universe offers information about chemical compounds and reac-
tions relevant to cellular processes. It includes more than 11,000 compounds (intern-
ally represented by C numbers, e. g., C00001 denotes water), a separate database for
carbohydrates (nearly 11,000 entries; represented by a number preceded by G, e.g.,
G10481 for cellulose), more than 6000 reactions (with R numbers, e.g., R00275 for
the reaction of the superoxide radical into hydrogen peroxide), and more than 4000
enzymes (denoted by EC numbers as well as K numbers for orthologous entries). All
these data are merged as the LIGAND database (Goto et al. 2002). Thus, the chemi-
cal universe offers comprehensive information about metabolites with their respec-
tive chemical structures and biochemical reactions.
403
13.2 KEGG
3 Fig. 13.1 (a) AmiGO, a Web-based GO browser
developed by the GO consortium. It allows
browsing the GO hierarchy and searching for
specific GO terms or gene products in different
databases. The numbers in brackets behind the
GO terms indicate how many gene products
have been annotated to this term in the selected
database. (b) GoFish, a Java applet, can also
connect to several databases and allows the user
to search for gene products using complex Boo-
lean expressions of GO terms.