How Are Splice Sites Recognized by the Spliceosome?

Splice sites are recognized through a combination of short DNA-encoded sequence signals and a team of molecular machinery that reads those signals in a specific order. The process starts with conserved nucleotide patterns at the boundaries of each intron, but those patterns alone aren’t enough. Proteins and small nuclear RNAs work together to verify each site, and regulatory elements scattered across the pre-mRNA fine-tune which sites get used. Getting this right matters: an estimated 38% to 50% of disease-causing mutations in humans disrupt splicing in some way.

The Sequence Signals That Mark Each Splice Site

Every intron in your pre-mRNA is bookended by short, predictable sequence patterns. At the upstream end (the 5′ splice site, or donor), the intron almost always begins with the nucleotides GT. At the downstream end (the 3′ splice site, or acceptor), it almost always ends with AG. This GT-AG rule holds for about 99.24% of human introns. The remaining fraction uses minor variants: roughly 0.69% use GC-AG pairs, 0.05% use AT-AC pairs, and a tiny 0.02% use other rare combinations.

These two-letter signals are part of slightly longer consensus sequences. The 5′ splice site extends across about nine nucleotides straddling the exon-intron boundary, following the pattern (C or A)AG|GTRAGT, where the vertical line marks the cut point and R stands for any purine. The 3′ splice site spans roughly fifteen nucleotides, with a stretch of pyrimidines (cytosine and uracil) followed by the AG dinucleotide right at the intron-exon junction.

A third critical signal sits inside the intron itself: the branch point sequence. This is a short motif containing a key adenosine residue, typically located 19 to 35 nucleotides upstream of the 3′ splice site. During splicing, this adenosine attacks the 5′ end of the intron, creating a looped structure called a lariat. The spacing between the branch point and the 3′ AG matters. Splicing factors physically block any AG dinucleotide closer than about 12 to 18 nucleotides from the branch point, ensuring the correct 3′ site is chosen.

How U1 snRNP Finds the 5′ Splice Site

The first step in recognizing a splice site is the binding of a molecular complex called U1 snRNP to the 5′ end of the intron. U1 snRNP contains a small RNA molecule (U1 snRNA) and ten associated proteins. The 5′ tip of the U1 snRNA is complementary to the sequence at the 5′ splice site, and it can form up to 11 consecutive base pairs across the exon-intron boundary. This base pairing is what anchors the complex to the correct location.

The match doesn’t need to be perfect. One nucleotide, an adenosine at a specific position, is expected to bulge out of the RNA duplex rather than pair with the snRNA. A protein called U1-C binds reversibly to the U1 snRNP and helps stabilize the interaction, particularly at sites with this bulged nucleotide. This flexibility allows U1 to recognize a range of 5′ splice site sequences that aren’t identical but share enough similarity to the consensus.

Recognition of the 3′ Splice Site

The 3′ end of the intron is recognized through a different set of players. Between the branch point and the 3′ AG lies a stretch of RNA rich in pyrimidine nucleotides, called the polypyrimidine tract. This stretch is dominated by uridines, and it serves as a landing pad for a protein called U2AF65, the large subunit of the U2 Auxiliary Factor complex. U2AF65 has a strong preference for uridine-rich sequences. Chemically altering the uridine residues reduces its binding affinity by roughly 100-fold.

While U2AF65 grips the polypyrimidine tract, its smaller partner, U2AF35, contacts the AG dinucleotide at the intron-exon boundary. A third factor, SF1, binds the branch point sequence. Together, these three proteins span the entire 3′ splice site region: branch point, polypyrimidine tract, and AG. This cooperative binding is what gives the cell enough specificity to distinguish real splice sites from the many random AG dinucleotides scattered throughout a long intron.

Building the Spliceosome Step by Step

Splice site recognition happens in stages as the spliceosome assembles. In the earliest stage, called the E complex, U1 snRNP binds the 5′ splice site, U2AF recognizes the 3′ splice site region, and U2 snRNP loosely associates with the complex. At this point, no energy has been spent. The interactions are reversible, giving the cell a chance to check whether the sites are legitimate before committing.

The transition to the next stage, the A complex, requires ATP. This energy input drives U2 snRNP into a tight, stable interaction with the branch point sequence, locking the branch point adenosine into position for the first chemical step of splicing. Once this happens, the spliceosome is committed. Later, U1 snRNA is displaced from the 5′ splice site by U6 snRNA, which takes over just before the spliceosome activates its catalytic core and carries out the actual cutting and joining reactions.

Regulatory Elements That Steer Splice Site Choice

Consensus sequences and core splicing factors handle the basics, but they aren’t the whole story. Human genes often contain multiple possible splice sites, and the cell uses regulatory sequences embedded in both exons and introns to decide which ones to use. These elements fall into four categories: exonic splicing enhancers, exonic splicing silencers, intronic splicing enhancers, and intronic splicing silencers.

Two major families of proteins read these regulatory signals. SR proteins generally bind enhancer elements and promote the use of nearby splice sites, often by helping recruit U1 snRNP or U2AF to weak sites that wouldn’t be recognized on their own. A different group, called hnRNP proteins, typically bind silencer elements and block or redirect the splicing machinery. Both families are modular, with separate domains for recognizing RNA sequences and for interacting with other proteins.

This regulatory layer is what makes alternative splicing possible. By changing the relative amounts of SR proteins and hnRNPs in different tissues or developmental stages, cells can include or skip specific exons, producing different protein versions from the same gene. Over evolutionary time, the expansion of these regulatory protein families appears to have relaxed the sequence requirements at splice junctions, allowing more flexibility in which sites get used and when.

Splicing Starts Before Transcription Finishes

Splice sites don’t just sit passively waiting for the spliceosome to find them. In living cells, splicing often begins while the RNA is still being copied from the DNA template. The enzyme that builds the RNA, RNA polymerase II, has a long tail called the C-terminal domain (CTD) that acts as a recruitment platform for splicing factors.

Chemical modifications to this tail control which factors get loaded onto the emerging RNA. Specifically, a modification called serine-2 phosphorylation on the CTD is required for recruiting U2AF65 and U2 snRNA to active transcription sites. Without this modification, U1 snRNP can still find the 5′ splice site, but the 3′ splice site recognition machinery fails to assemble properly. This means the polymerase itself helps orchestrate the order of splice site recognition: the 5′ site is marked first, and only with the right CTD signal does the 3′ site machinery follow.

Co-transcriptional splicing also creates a natural directionality. As the polymerase moves along the gene, splice sites emerge from the enzyme in the order they appear in the DNA. The 5′ splice site of an intron is transcribed before the branch point and 3′ splice site, giving U1 snRNP a head start. By the time the 3′ end of the intron appears, the early complex at the 5′ site is already in place and ready to pair with it. This built-in timing helps the cell match the correct pairs of splice sites across introns that can be tens of thousands of nucleotides long.

When Splice Site Recognition Goes Wrong

Because splice site recognition depends on relatively short and somewhat flexible sequence signals, it’s vulnerable to mutation. A single nucleotide change in the GT or AG dinucleotides can completely abolish recognition of that site. Mutations in the broader consensus, the branch point, or the polypyrimidine tract can weaken recognition enough to cause exon skipping, intron retention, or activation of nearby sequences that resemble splice sites but aren’t normally used (called cryptic splice sites).

Mutations in regulatory elements are harder to predict but equally damaging. A change that destroys an exonic splicing enhancer or creates a new silencer element can cause an exon to be skipped even though the splice sites themselves are intact. These effects help explain why such a large fraction of pathogenic variants, up to half in some disease cohorts, ultimately exert their harm through disrupted splicing rather than by directly altering protein-coding sequences.