Science Education Models





Research Models



Part II

 The GFP Gene and the Flow of Genetic Information

 In this exercise, we will examine the nucleotide sequence that encodes the GFP protein, which is deposited in GenBank, a database of all known DNA sequences. Click on the following URL to access GenBank via the Internet: (http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html).

At the top of the GenBank overview page, click on "Entrez". This will take you to a page which prompts you for your search.

  • Scroll down the "Search" field and bring up "nucleotides". This means you wish the nucleotide sequence of a particular DNA. In the adjacent field, type in "Aequorea victoria green fluorescent protein" and click on "go". 
  • This will take you to a page with ten entries. Finding exactly the DNA sequence or gene you are interested in can be challenging if you don't spell out exactly what you want (e.g. typing in "GFP" will call up over 400 entries). Only numbers 9 and 10 on this page of entries are the GFP gene that we are interested in; they represent data contained in the 1992 paper by Prasker et al, as cited near the top of the page you are brought to when you click on the hot link number M62654 or M62653 (these are called "accession numbers"). The others are mutant genes or synthetic constructs.
  • Number 9 is the complete gene and number 10 is the DNA corresponding to the actual coding sequence. This can be seen by scrolling down to "Features": for number 9, exons, introns, and cds (coding sequence) are given, followed by the complete nucleotide sequence. For number 10, you will notice that only that only the coding sequence appears.
  • Save both entries.

The first thing to notice is that there are 5170 nucleotides in the gene. In principle, this will contain 1723 triplet codons; but GFP is known to contain only 238 amino acids. What is all the excess DNA doing? The answer, of course, is that it represents non-coding DNA: introns ("intervening sequences" that separate the protein-encoding "exons" of eukaryotic genes) and regulatory regions

We have used proprietary genetic analysis software to create an exercise in which the student discovers for him/her self how a eukaryote gene is constructed, transcribed and translated. To access this exercise, click here. Students will construct a paper model of the GFP gene by cutting the nucleotide sequence (and the accompanying amino acid sequence) of the "Map", found at this link and taping them together to form one long, continuous "gene". The task of the student is to find the exons and highlight the corresponding amino acid sequence of the protein. We encourage having the student cut the nucleotide sequence, along with the corresponding amino acid sequence, into strips and tape the strips together to generate a long linear model of the gene. The exons and introns can be color coded using different highlight marker pens.

GFP GENE ANALYSIS EXERCISE

The "map" of the GFP gene, which is printed out at the end of these instructions, shows the nucleotide sequence of BOTH strands of double-stranded DNA --- AND the three different amino acid sequences (using the one letter abbreviations) that could be encoded by the top strand of DNA.

Instructions Construct a paper model of the GFP gene by cutting the nucleotide sequence (and the accompanying amino acid sequence) of the "Map", found at the back of these instructions, and taping them together to form one long, continuous "gene".

Analyze this gene to find the nucleotide sequence that encodes the amino acid sequence of GFP.

HOW TO ANALYZE THE GFP GENE

The nucleotide sequence of both strands of DNA is shown. Remember that the two strands of DNA in the Watson-Crick structure are anti-parallel. In this representation, the top strand is the "non-coding" strand and the bottom one is the "coding" strand. What this means is that when the gene is expressed, RNA polymerase uses the bottom strand as a template to make an RNA copy that corresponds to the sequence shown in the top strand, but with the T's replaced by "U's"

Below this are shown three possible amino acid sequences, using the one-letter abbreviation of the amino acids; the "*" refers to one of three possible stop codons. (See below for a list of the codons and the one-letter abbreviations.) There are three possible amino acid sequences because the triplet code can be read in three possible "reading frames". In the "a" reading frame, the first codon would be aag, and corresponds to the amino acid, lysine (K). In the "b" reading frame, the first codon would be agc and correspond to the amino acid, serine (S); and the "c" reading frame would be gct and correspond to alanine (A).

(Stop codons: UAG, UGA, UAA)

Teaching Tips

 The task of the student is to find the 3 exons and highlight the corresponding amino acid sequence of the protein. We encourage having the student cut the nucleotide sequence, along with the corresponding amino acid sequence, into strips and tape the strips together to generate a long linear model of the gene. The exons and introns can be color-coded using different highlight marker pens.

Here are a few questions/directions to guide this inquiry.

1.) Which is the "sense" strand? This nomenclature is unfortunate and confusing. Moreover, from the printout, it is impossible to tell which is the "sense" and which is the "antisense" strand; indeed, it is impossible to tell which is the 5' end and which is the 3' end.

a) To help resolve this, look at the three possible amino acid sequences. The first one (reading frame a) has as its first amino acid, lysinc. The codon for this amino acid is AA (Pur). The codon AAG is found in the top DNA strand, which means that this strand must be identical to the mRNA, which in turn means that it is the bottom strand that is the template for transcription (the "sense" strand) and must be 3'à 5' since the RNA polymerase adds nucleotides in the 5'à 3' direction during transcription.

2.) A question some students might ask is: does transcription ever switch strands? The answer is yes, but not within a gene. So you can be confident of finding the actual coding region for the gene by looking at the top (antisense) strand.

3.) Look at the three reading frames. The "*" symbol represents a stop codon. What you are looking for is called an "open reading frame", meaning that at least three criteria must obtain: a start codon, a stop codon, and the codons that make up the protein must be contained in between these two.

a) What is the average distance between stop codons? Since there are 64 possible codons and three are stop codons, we expect 3/64 or one stop codon every 20 - 22 codons in a random base sequence. So, examine the sequence and count the number of codons between the "*" symbols. It will come out pretty close to 22.

4.) Now, with the amino acid sequence of the GFP in hand, find it in the reading frames given in the printout.

a) The first codon-for both prokaryotes and eukarytoes-is methione (in prokaryotes, it is modified by adding a formyl group); this has the codon AUG (the start codon), which means the antisense DNA strand must be ATG. This will be found at position 208. Right next to this codon is AGT; the mRNA sequence would be AGU, which codes for serine…. agreeing with the GFP sequence. So it appears that you are on the right track if you use reading frame a. (Look at the other reading frames and note all the stop codons.)

b) And this works…but only for a while. Amino acid 69 is Q (glutamine), corresponding to the codon CAG starting at position number 412. The next symbol in the amino acid sequence is *, meaning an "anti" or code terminating codon (The DNA sequence is TAA, the RNA is UAA, one of three terminating codons.) Clearly, if the protein is to continue, another reading frame must be co-opted. There is no easy way to find it except to hunt…knowing the amino acid sequence. After what may seem like an inordinately long time, you should be able to find the next in the GFP amino acid sequence in freading from b, starting at nucleotide 1008.

c) Continue with this analysis until the entire GFP amino acid sequence has been discovered ended by a stop codon.

5.) Students will have discovered a fundamental feature of the structure of eukaryote genes: they are not continuous sequences of DNA but are split into regions ("exons") containing the coding information, separated by regions ("introns") that don't contain coding regions.

a) This would be an excellent place to have students speculate why this is so…when prokaryotes seem to make so much sense to have the nucleotide sequence collinear with the amino acid sequence. The bottom line is that no one really knows the answer. In an advanced class, one could begin to explore cutting edge thinking in the area of bioinformatics, whereby split genes are one possible way to increase diversity without adding tremendously to the amount of DNA an organism has to protect and duplicate.

b) Another area of discussion is the notion of "junk DNA", which includes introns. Is this useless DNA? Or does it have some function? The answer is still unclear since in some cases, there does indeed appear to be useful information. One of the fascinating examples of this is to examine the spectrum of hemoglobinopathies (all the mutations that have been documented in both alpha and beta globin genes), and discover that some of these result from mutations in introns!

c) What can you say about the size of an intron? One thing is that it is not made up of a multiple of three nucleotides. This poses a fascinating question: how does the cell machinery make the splicing of the RNA so precise when it has so much intervening sequence to read through to find the right place…if it makes a mistake, the resulting protein is gibberish or even truncated if the splice causes a stop codon by mistake.

6.) Have students examine the beginning and end of the intron. Is there a pattern? They should find that it starts with GT and ends with AG. Ask them whether that makes sense…does that mean whenever a GT or an AG is encountered, splicing starts? What about when that occurs within a coding sequence?

a) The point is that both the 5' GT start and 3' AG end are the invariant part (the so-called GT-AG rule) of a "consensus sequence" that looks like this(the subscript is the % that this nucleotide has been found in that position):

5'…..A64G73 G100 T100 A62 A68 G84 T63…… (Can be >10kb long)…..12Py N C65A100G100 N…….3'

b) Another point is that a splice site can split a codon, resulting in a reading frame shift.

c) The mRNA containing these boundaries is somehow recognized by nuclear particles called "spliceosomes", which contains both protein and RNA. These spliceosomes are the same in all cells. The basis for proper recognition of the correct splice site pairs by spliceosomes is still unknown.

d) HIV is an excellent example of an organism that exploits this split gene architecture with amazing cleverness and efficiency since it can change the splice site depending on what it needs to do: i.e. if it needs to become integrated into the host chromosome, it uses one set of splice sites to generate the required proteins, but if it needs to become an infectious particle again, it uses a different set of splice sites to generate a new set of proteins. This means that the question "what is a gene" becomes ill-defined since it would appear that even the "junk DNA" isn't "junk", but rather the whole viral genome is a complete set of "nested genes".

To view the DNA Strands you will need Acrobat Reader.
DNA Sequence

Milwauke School of Engineering Milwaukee
School of
Engineering