How to Find an Open Reading Frame (Step by Step)

Finding an open reading frame (ORF) means scanning a DNA or RNA sequence for a stretch of codons (three-nucleotide “words”) that begins with a start codon and runs uninterrupted by any stop codon. In the simplest case, you look for an ATG start codon, then read forward in triplets until you hit TAA, TAG, or TGA. Everything between those two signals is the ORF, the region that could potentially encode a protein. The real challenge is that any sequence contains dozens or hundreds of possible ORFs, and only some of them are biologically meaningful.

What Counts as an Open Reading Frame

The genetic code uses 64 possible three-letter codons. Of those, 61 code for amino acids and 3 are stop signals: TAA, TAG, and TGA (or UAA, UAG, and UGA in RNA). An ORF is any continuous run of amino acid-coding triplets, bookended by a start codon at the front and a stop codon at the end. The start codon is almost always ATG, which codes for the amino acid methionine and tells the cell’s machinery where to begin building a protein.

Not every ATG-to-stop stretch actually produces a protein. Short ORFs appear by chance in random sequence, which is why researchers traditionally used a cutoff of about 100 codons (300 nucleotides) as the minimum length worth investigating. Anything shorter was considered likely to be noise. That threshold is now understood to be imperfect, since genuine small proteins and peptides encoded by shorter ORFs do exist, but 300 nucleotides remains a practical default for most analyses.

Why There Are Six Reading Frames

A DNA sequence can be read in triplets starting at three different positions. If your sequence begins ATGCCTAG…, the first reading frame groups it as ATG-CCT-AG…, the second as A-TGC-CTA-G…, and the third as AT-GCC-TAG… Each grouping produces a completely different set of codons and therefore a different potential protein.

Because DNA is double-stranded, the complementary strand can also be read in three frames running in the opposite direction. That gives you six total reading frames for any stretch of DNA: three forward, three reverse. A real gene uses exactly one of these six frames. To figure out which one, you look for the frame that contains a long, uninterrupted ORF starting with ATG and ending with a stop codon, without premature stops scattered throughout.

Finding ORFs by Hand

For a short sequence, you can identify ORFs manually. Start by writing out all three forward reading frames. In each frame, locate every ATG and every stop codon (TAA, TAG, TGA). An ORF runs from any ATG to the next in-frame stop codon. Then do the same for the reverse complement of the sequence, giving you the other three frames.

The longest ORF in a given stretch of DNA is usually the best candidate for a real protein-coding gene, though this is a rule of thumb rather than a guarantee. If you find an ORF that is hundreds of codons long while the others in the same region are short and fragmented, that long ORF is almost certainly the correct reading frame.

Using NCBI ORFfinder

For anything beyond a very short sequence, use a tool. NCBI’s ORFfinder is the most widely used free option. You paste in a nucleotide sequence in FASTA format (or enter an NCBI accession number), and it scans all six reading frames for ORFs. The tool handles sequences up to 50,000 bases long.

Several settings let you fine-tune the search:

Minimum ORF length: Options range from 30 to 600 nucleotides. The default of 300 filters out most random matches, but you can lower it to 75 or 30 if you’re hunting for small peptide-coding regions.
Start codon selection: You can restrict the search to ATG-only start codons, include alternative initiation codons recognized by the chosen genetic code, or use any sense codon as a potential start (which finds every stop-to-stop stretch regardless of a start codon).
Genetic code: The standard code works for most nuclear genes, but mitochondrial, plastid, and certain organisms use different codon tables. Picking the wrong one will cause the tool to misidentify stops and starts.
Nested ORFs: A checkbox lets you ignore ORFs that are entirely contained within a larger ORF in the same frame, reducing clutter in the results.

The output displays each ORF as a colored bar on a six-frame map, ranked by length. You can click any ORF to get its nucleotide and amino acid sequence, then send it directly to BLAST to check whether it matches known proteins in other organisms. A strong BLAST hit is one of the best confirmations that your ORF encodes a real protein.

Alternative Start Codons

ATG is the dominant start codon, but it is not the only one. In bacteria, roughly 80% of annotated genes start with ATG, about 12% use GTG, and around 8% use TTG. Rarer bacterial start codons include ATT and ATC. In the lab, GTG and TTG initiate translation at roughly 12 to 15% the efficiency of ATG in the same context, so they produce less protein but are still functional.

Eukaryotic genes occasionally use non-ATG starts as well. CTG (CUG in RNA) is the most common alternative, and codons like GTG, ACG, ATT, and ATC have all been documented. If you’re analyzing a bacterial genome or searching specifically for upstream ORFs in a eukaryotic transcript, selecting “ATG and alternative initiation codons” in your tool’s settings will catch these.

Choosing the Right Genetic Code

The standard genetic code applies to nuclear genes in most organisms, but mitochondrial genomes and certain lineages use variations that change which triplets are stops. In the vertebrate mitochondrial code, for instance, TGA does not signal a stop. Instead it codes for tryptophan, meaning a sequence that looks like it contains a premature stop under the standard code may actually be one continuous ORF in mitochondria. Similarly, in ciliates like Tetrahymena, TAA and TAG code for the amino acid glutamine rather than functioning as stops.

If you’re analyzing mitochondrial DNA, plastid DNA, or sequences from organisms with known codon reassignments, selecting the correct genetic code table is essential. Using the wrong table will split real ORFs into fragments or merge separate genes into one artificially long reading frame. NCBI’s ORFfinder offers more than 30 genetic code options covering vertebrate and invertebrate mitochondria, yeast mitochondria, bacterial and archaeal genomes, and various protist nuclear codes.

The Eukaryotic Complication: Introns

In prokaryotes (bacteria and archaea), genes are typically continuous stretches of coding sequence, so finding the longest ORF in a region reliably identifies the gene. Eukaryotic genes are more complex. Most are split into coding segments called exons, separated by non-coding stretches called introns. The cell transcribes the entire gene into RNA, then splices out the introns before translation.

This means that if you run a simple ORF search on raw eukaryotic genomic DNA, you will often fail to find the full coding sequence. Introns introduce stop codons that break up what is actually a single gene into multiple small, seemingly unrelated ORFs. For eukaryotic genomes, the most reliable approach is to search for ORFs in mRNA or cDNA sequences, where introns have already been removed. If you only have genomic DNA, you need gene prediction software that models exon-intron boundaries rather than a simple ORF scanner.

Gene Prediction Tools for Whole Genomes

When you’re annotating an entire genome rather than examining a single sequence, dedicated gene prediction programs are far more powerful than a basic ORF finder. For prokaryotic genomes, tools like Prodigal and Balrog identify protein-coding genes by learning statistical patterns from thousands of known bacterial genomes. Annotation pipelines such as PROKKA and NCBI’s PGAP go further, combining ORF detection with functional annotation and comparison to known protein databases.

For eukaryotic transcripts, TransDecoder is designed to predict coding regions within assembled RNA sequences, making it a common choice for transcriptome analysis. These tools incorporate information beyond just start and stop codons: they evaluate codon usage patterns, the statistical likelihood that a stretch of sequence codes for protein, and homology to known genes.

Confirming That an ORF Is Real

Finding an ORF in a sequence does not prove a protein is actually made from it. Several lines of evidence strengthen the case. The most straightforward is a BLAST search: if your predicted protein closely matches a characterized protein in another species, it is almost certainly real. Codon usage analysis adds another layer. Real genes tend to use certain synonymous codons more frequently than others, matching patterns seen in highly expressed genes of the same organism. Random ORFs do not show this bias.

For small ORFs, where false positives are especially common, ribosome profiling data can confirm that ribosomes physically occupy the sequence during translation. Conservation across species is another strong signal. If the same ORF appears in the equivalent position across multiple related genomes, natural selection is likely preserving it because it produces a functional product.