A gene is a stretch of DNA that carries the instructions to build a functional product, usually a protein or an RNA molecule. In humans, the average gene spans about 28,000 base pairs, but only a fraction of that length actually codes for the final product. The rest consists of regulatory sequences, spacers, and signals that control when, where, and how much of that product gets made. Understanding a gene’s structure means understanding how all these parts fit together.
The Basic Layout
Picture a gene as a long ribbon of DNA with distinct zones. Working from the outside in, the major structural pieces are: a promoter region that signals where to start reading, a series of coding segments called exons, non-coding spacers called introns wedged between those exons, and a termination signal at the far end. Flanking the coding sequence on either side are untranslated regions that never become protein but play important roles in controlling the message.
Exons are the segments that survive all the way into the final messenger RNA. They include both the protein-coding sequence and the untranslated regions at either end. Introns, by contrast, are transcribed into the initial RNA copy but then snipped out before the message leaves the nucleus. This cutting-and-rejoining process, called splicing, stitches the exons together into a continuous instruction set. Human genes typically contain multiple introns, and those introns often make up the bulk of a gene’s total length.
The Promoter: Where It All Starts
Upstream of the first exon sits the promoter, a stretch of DNA that tells the cell’s machinery exactly where to begin copying the gene into RNA. In many genes, the promoter contains a short sequence motif called the TATA box, located about 30 base pairs before the transcription start site. The TATA box acts as a landing pad for a protein that helps assemble the copying machinery.
Not every gene has a TATA box, though. Many housekeeping genes, the ones your cells need running constantly, sit within regions called CpG islands and lack a recognizable TATA box entirely. These promoters tend to start transcription from multiple scattered points rather than one precise spot. Other short motifs in the promoter region fine-tune how efficiently the machinery assembles, but the core job is the same: mark the starting line.
Enhancers, Silencers, and Long-Range Control
A gene’s behavior isn’t determined by the promoter alone. Scattered across the surrounding DNA, sometimes tens of thousands of base pairs away, are regulatory elements called enhancers and silencers. Enhancers boost a gene’s activity; silencers dampen it. The beta-globin gene, which encodes part of the hemoglobin molecule in red blood cells, offers a well-studied example: its main control region sits 40,000 to 60,000 base pairs upstream of the promoter.
How can something so far away influence a gene? The DNA fiber loops back on itself, physically bringing the distant enhancer into direct contact with the promoter. This looping model is now widely accepted. The intervening DNA simply bows out while the regulatory element and promoter meet at a shared hub of active transcription. Insulator sequences can block these interactions, creating boundaries that prevent an enhancer meant for one gene from accidentally switching on a neighbor.
Splice Sites: Where Introns Meet Exons
The boundaries between exons and introns are marked by short, highly conserved DNA sequences called splice sites. Nearly every intron in human genes begins with the letters GT (on the DNA sense strand) and ends with AG. The fuller consensus at the start of an intron reads roughly MAG|GTRAGT (where M is either A or C, and R is either A or G), while the end follows a CAG|G pattern. These nearly invariant signals are what the splicing machinery recognizes when it cuts out introns and joins exons together.
Even small mutations in these splice site sequences can cause the machinery to skip an exon or include part of an intron, potentially producing a defective protein. Many inherited diseases trace back to single-letter changes at splice sites.
Untranslated Regions
At the front and back of the mature messenger RNA sit the 5′ and 3′ untranslated regions (UTRs). These stretches are part of the exons, so they survive splicing, but ribosomes don’t translate them into protein. Instead, they serve as control panels for what happens to the RNA after it’s made.
The 5′ UTR influences how efficiently translation begins. If it folds into complex secondary structures, the ribosome has a harder time scanning through it, which slows protein production. The 3′ UTR, meanwhile, contains sequence motifs that regulate how long the RNA molecule survives in the cell. Short signals like AU-rich elements can trigger rapid degradation, while other motifs stabilize the message. MicroRNAs, tiny regulatory molecules, also bind primarily in the 3′ UTR to dial down protein output. A longer 3′ UTR generally means more binding sites for these regulators, which tends to lower protein levels, while a shorter 3′ UTR often means higher expression.
The Termination Signal
At the downstream end of the gene, a specific DNA sequence signals the cell to stop transcription and process the end of the RNA. In vertebrates, this polyadenylation signal consists of two parts: a nearly universal AAUAAA sequence (read in RNA letters) located 20 to 50 nucleotides before the actual cut site, followed by a U-rich or GU-rich element on the other side. When the transcription machinery reads past this signal, the RNA is clipped and a long tail of adenine nucleotides is added to the cut end. This poly-A tail helps protect the RNA from degradation and assists in exporting it from the nucleus.
Mutating even a few letters in the AAUAAA hexamer can disable termination, causing the machinery to read past the gene’s intended endpoint. Some genes have multiple polyadenylation signals, letting the cell produce shorter or longer versions of the same RNA depending on which signal it uses.
How Bacterial Genes Differ
Everything described above applies to eukaryotic genes, the type found in animals, plants, and fungi. Bacterial genes are simpler in several important ways. They almost never contain introns, so there is no splicing step. Bacteria also organize related genes into clusters called operons, where multiple genes are transcribed as a single continuous RNA message. The classic example is the lac operon in E. coli, where three genes involved in breaking down lactose are read as one unit and regulated by a shared promoter.
Eukaryotes occasionally cluster related genes too, but each gene in the cluster typically produces its own separate RNA. The structural simplicity of bacterial genes, no introns, no complex enhancer looping, reflects a genome under pressure to stay compact and replicate quickly.
By the Numbers
The human genome contains roughly 20,000 protein-coding genes, but current catalogs list even more RNA genes, those that produce functional RNA molecules rather than proteins. When you add long non-coding RNA genes, antisense RNA genes, and other categories, the total gene count climbs well above 40,000, with pseudogenes (broken remnants of former genes) adding another 15,000 or so on top of that.
Despite all those genes, protein-coding sequences account for less than 2% of the human genome’s three billion base pairs. The average gene spans about 28,000 base pairs, but much of that length is intron. A single gene can produce multiple different proteins through alternative splicing, where different combinations of exons are stitched together from the same initial transcript. This is one reason humans can get by with a gene count not much larger than a simple roundworm’s: the structural complexity within each gene multiplies the number of possible outputs far beyond a simple one-gene, one-protein model.

