What is the genetic codon diagram

There is a direct connection between the nucleotide sequences in deoxyribonucleic acid (DNA) and the amino acid sequences in proteins (polypeptides). The amino acid sequence of a protein is accordingly coded by a nucleotide sequence, or to put it the other way round: a nucleotide sequence carries information that determines (instructs) the formation of an amino acid sequence. The relationship between the sequence of nucleotides and amino acids is called the genetic code.

Nucleotide sequences in DNA are sequences of four different nucleotides (with bases A, T, C, G), amino acid sequences in proteins are sequences of 20 different amino acids. One can now ask the question: What does a code word (a codon) that encodes an amino acid look like, how many nucleotides does it contain?

One nucleotide is obviously not enough, because only four amino acids can be clearly determined with it. Even nucleotide pairs (AA, AT, AG ... etc.) Don't give us enough code words. There are 42 = 16, but we need at least 20. What about a triplet? AAA, AAT, AAG ... 43 = 64 possibilities are mathematically available. That is enough, but at the same time it seems to be too much. The situation would be even more confusing if one were to consider quadruplets. 4th4 = 256 possibilities.

Genetic experiments and physical-chemical measurements ultimately tipped the scales in favor of adopting a triplet code. This simultaneously anticipates that all code words are of the same length. But how is the code organized? Are all 64 codons required? Is it overlapping or not. Three alternatives are conceivable:

  1. strongly overlapping
  2. slightly overlapping
  3. not overlapping

The answer could be decided by a simple consideration. In the case of an overlapping code, one amino acid in one protein would have an influence on the selection of the subsequent ones. In the case of a heavily overlapping code, there should only be one behind one that is coded by AAA, for example, that would be coded by AAX. There are only four options in total for this. In the case of a weak overlap, there could be only a limited number of others behind an amino acid (16 of 20), behind one encoded by AAA, only those that would be encoded by AXY.

As early as 1957, enough experimentally determined amino acid sequences were available to evaluate neighborhood frequencies. S. BRENNER (Cambridge / England) did this and came to the conclusion that any overlapping code is excluded, because no amino acid in proteins was influenced by the one in front of it in the sequence.

Our language provides an analogous example. In words, with one exception (a q is always followed by a u), every letter can come after every other.

Another problem would be: What is the starting signal for reading the code? Finally: What were the approaches to solving the genetic code?

You can only solve a code if the opponent makes mistakes and if you notice which system these errors are based on. The genetic code is also not free from errors of this kind. We know them as mutations. In the simplest case, they make themselves noticeable by the fact that in an amino acid sequence there is another instead of a certain amino acid. If the concept of the genetic code is correct, then one nucleotide should have been replaced by another in the corresponding nucleotide sequence.

Mutation-inducing substances (mutagens) are known which cause very specific, directed substitutions of bases in nucleic acids. These include nitrite ions (nitrous acid), which convert from C to U or from A to G through a deamination reaction.

The amount of conversions is important for biological experiments. The action of nitrous acid on DNA (or RNA) must not last too long and its concentration must not be too high, because only a small percentage of the C and A residues contained in the nucleic acids must be changed. The mutation-inducing process is of course a statistical process. We know that C or A are affected, but we never know in advance at which position a C and / or A will be changed. In many cases vital information is influenced by such a modification (base substitution). The rule therefore applies that every mutagen has a strong inactivating effect and that a large number of mutants can be expected among the few survivors.

Are such base substitutions induced by a mutagen reflected in changes in amino acids (amino acid exchanges) in proteins? For this we need a suitable test object and for this purpose a plant virus, the tobacco mosaic virus (TMV), offered itself. The amino acid sequence of its coat protein had been known since 1959. It consists of a sequence of 158 amino acids (sequence analysis: G. SCHRAMM and colleagues in Tübingen, A. TSUGITA and H. FRAENKEL-CONRAT in Berkeley). H. G. WITTMANN in Tübingen, and A. TSUGITA and H. FRAENKEL-CONRAT in Berkeley, produced a large number of nitrite-induced mutants, isolated individual ones and determined the amino acid sequences of their coat proteins. It turned out that individual amino acids were changed compared to the original strain (the wild type). The results can be summarized as follows:

  1. They provide further evidence that the code does not overlap, because otherwise two (three) neighboring amino acids would have to have been changed in the mutants after changing one nucleotide. Such cases have never been found.

  2. A direction of the exchanges could be determined. The newly added amino acids are encoded by codons (triplets) that are richer in U or G than the original ones.

  3. The various exchanges could be arranged in a certain way, from which it emerged that there must be several codons for individual amino acids. This would give us a partial answer to the question of what happens to the 64 - 20 = 44 "superfluous" codons. They are also needed. One speaks of a "degenerate code" and means that there are several code words for some amino acids (degeneration: here redundancy).

    Approaches to deciphering the genetic code. The illustration contains exchanges that were achieved in the TMV after nitrite treatment (right diagram). The changes in the RNA that are possible through nitrite treatment if UUU = Phe are to result in the end are shown on the left. For more see text (based on H. G. WITTMANN, 1962, 1966)

But how can you decide in which position of a codon a C and at which a U is to be inserted? To do this, we have to look at a completely different approach that ultimately led to the elucidation of the genetic code. One had learned to synthesize nucleic acids from free nucleotides. A. KORNBERG (Stanford University) isolated an enzyme (a DNA polymerase) that could form the complementary strand on a single DNA strand. The single strand of DNA serves as a template. S. OCHOA (Rockefeller University, New York) isolated another enzyme (an RNA polymerase) that synthesized RNA from ribonucleotides without requiring a template. The triphosphate nucleotides offered were randomly polymerized into polynucleotide chains. If only UTP or only CTP were offered, homogeneous sequences UUUUU ... (= PolyU) or CCCC ... (= PolyC) were formed. If two nucleotides were offered at the same time (e.g. UTP and CTP), polymers were created that contained U and C in a random distribution.

What can you do with such synthetic polynucleotides? At first little, but very much if you have a system with which the information stored in them can be read. M. NIRENBERG and H. MATTHEI (1961, at the National Institute of Health, Bethesda) developed a cell-free (in vitro) System capable of protein biosynthesis. This requires: RNA, ribosomes, a soluble supernatant from a bacterial extract (from Escherichia coli), Amino acids as well as ATP, CTP, GTP etc. Two aspects are important at the moment:

  1. Under in vitroProtein synthesis was detectable under conditions. The test for this was initially very simple: Individual radioactively labeled amino acids were added and it was checked whether the radioactivity could be precipitated by trichloroacetic acid (TCA) after a short incubation period. It is known that free amino acids cannot be precipitated by adding TCA, while proteins precipitate.

  2. Through the targeted addition of certain genetic information, e.g. through PolyU (UUUUU ...), only the amino acid Phe could be converted into a TCA-precipitable form. This unraveled the first code word: UUU = Phe.

We can now return to the findings on mutants of the tobacco mosaic virus and see that Leu and Ser are encoded by codons rich in C (UUU, UCU or CUU) and that the C content of the codons of Pro, Ser and Leu must be even higher (CCU, CUC, UCC or CCC). By comparison with other results obtained in the cell-free system, it was also possible to determine the exact sequence of the nucleotide bases in each codon.

A final answer to outstanding questions and the complete elucidation of the genetic code succeeded after it was possible to test precisely defined (synthetic) polynucleotides with their base composition and sequence determined in the cell-free system outlined above. In 1963 the work was successfully completed.

The table shows the assignment of all 64 codons to the corresponding amino acids. The numbers 1, 2 and 3 refer to the position of the nucleotide in the codon. e.g. 1 = A, 2 = C, 3 = A: ACA = Thr. Three of the codons "amber", "ocher" and "opal" represent signals for chain termination. Another, AUG, which normally codes for Met, can also mean chain start. In the upper left corner there are hydrophobic amino acids, in the lower right corner hydrophilic ones.

A number of conclusions can be drawn from the results:

  1. All 64 codons are used. 61 can be assigned to certain amino acids, three serve as a stop signal, one (AUG) alternatively as an amino acid codon or start signal.

  2. The number of codons for the individual amino acids is different, for some, such as Met and Trp, there is only one, for many two or four, and for some (Ser, Arg) even six codons. There is a correlation between the frequency of codons and the frequency of the corresponding amino acids in proteins. The only exception is the amino acid Arg, for which there are six code words, but which is underrepresented in proteins in relation to this.

  3. The codons are not randomly assigned to the amino acids. The first two nucleotides of a codon have a higher information value than the third, e.g. GUU, GUC, GUA and GUG all stand for Val. UC-rich codons (triplets) code for hydrophobic, AG-rich for hydrophilic amino acids. The genetic code can therefore be classified as extremely conservative.

Many (nearly 30%) of the base substitutions do not change the coding properties, e.g.

UUU> UUC: Phe> Phe

Even if a base substitution causes an amino acid exchange, the chemical character of the side chain residue is preserved in most cases (conservative exchanges):

UUU> UUG: Phe> Leu

CUC> AUC: Leu> Ile

AAA> AGA: Lys+ > Arg+

AAA> GAA: Lys+ > Asp-

Of course there are exceptions (radical exchanges) such as:

GAG> GUG: Glu- > Val

ATM> GUA: Glu- > Val

The last category usually leads to functionless or poorly functioning proteins. Since they are subject to selection, such mutants have no or only a reduced chance of survival under natural conditions.

In the course of evolution, a genetic code has emerged which guarantees stability and which is designed in such a way that a number of changes in the protein do not even appear. The number of codons for individual amino acids is also not left to chance. It has already been said that amino acids that occur frequently in proteins are represented by more codons than the rarer ones. The only question is what is cause and what is effect? Without being able to give a clear answer, at least one further correlation can be cited: There are the fewest codons for the amino acids whose biosynthesis is more complex than that of the others, that is, more energy has to be invested in their synthesis than for the simpler ones (and therefore more common). Here, too, arginine remains to be mentioned as an exception.

The findings on the tobacco mosaic virus, in comparison with those that were determined on microorganisms and later on eukaryotic cells, make it clear that the genetic code is universal, i.e. the assignment of codons and amino acids listed in the table is applicable to all organisms (microorganisms, Animals, plants) the same. One exception was finally found: information stored in animal mitochondria is used differently in some cases due to a different reading mechanism:

AUU: instead of Ile: Met

AUA: instead of Ile: Met

UGA: instead of stop: Trp

AGA: instead of Arg: stop

AGG: instead of Arg: stop

Deviations were also found in plant mitochondria. Apparently there are even species-specific differences. At Oenothera UGA (identified as TGA in the DNA) stands for termination and CGG for Trp (W. SCHUSTER, A. BRENNICKE, 1985).

Further details on this:


We have to ask ourselves now, of course, how the genetic information is translated, what is the flow of information? Usually this is a two-step process:

  1. The information stored in the DNS is overwritten in the RNS (= transcription). The overwriting is not done in one piece; only part of the information is processed in each case.

  2. The (partial) information now contained in RNA is translated into protein in a complex process in which a large number of components participate (= translation, protein biosynthesis).

A number of efficient methods for sequencing genes and complete genomes have been developed over the past two decades. The nucleotide sequences are stored in databases (daily update) and are freely accessible there at any time. There are a number of programs with which the data can be processed and compared with one another. Sequence comparisons are possible, as is the assignment to genes or potential genes: open reading frames (ORF) beginning with a start codon and ending with a stop codon.

The central database in Europe is: EMBL Outstation: European Bioinformatics Institute in Hinxton near Cambridge / England. Nucleotide and amino acid sequences are best accessed using the Sequence Retrival System. The EBI also serves as a mirror site for other large international databases of genetics and molecular biology. Can be contacted at:


The central databases in the USA are GeneBank and in Japan DDBJ (DNA Data Bank of Japan). The Kyoto Encyclopedia of Genes and Genomes (KEGG) provides a good compilation of the data on their databases at various locations in botany on-line is resorted to. So here are the complete genomes of the bacterium as examples Escherichia coli and the baker's yeast Saccharomyces cerevisiae presents.
Be careful when using the following data: In all cases, these are complex Java scripts (20 MB RAM required) with links to the original URL in Kyoto. The whole Escherichia coli- The genome and the amino acid sequences of the encoded proteins are also shown in full in a single file (size: 10.9 MB). Details on usage are given in the manual. The names of the genes can be found in the genome catalogs.