How to Find a DNA Sequence: NCBI, BLAST, and Ensembl

Finding a specific DNA sequence typically means searching one of several large, free online databases that store genetic information from organisms across the tree of life. The largest of these, GenBank, holds nearly 50 trillion bases of sequence data across more than 6 billion records. Whether you need a gene sequence for a research project, a class assignment, or species identification, the process starts with knowing which database to use and how to search it effectively.

Start With the NCBI Nucleotide Database

The National Center for Biotechnology Information (NCBI) hosts the most widely used collection of DNA sequences. Its Nucleotide database is the go-to starting point for most searches. You can access it at ncbi.nlm.nih.gov/nucleotide and type in a gene name, organism, or accession number (a unique code assigned to each sequence record, like a catalog number).

For simple lookups, a basic search works fine. Type something like “BRCA1 Homo sapiens” and you’ll get a list of matching sequence records. But if you need more precision, the Advanced Search Builder lets you target specific fields within the database. Each record contains over 30 searchable fields, including gene name, organism, accession number, and associated journal references. You can combine search terms using Boolean operators: OR connects synonyms (like two names for the same gene), and AND connects separate concepts to narrow results (like a gene name AND a specific organism).

A useful trick is truncation. Adding an asterisk to the end of a search term finds all variants of that word. Searching “insulin*” would return results for insulin, insulinase, insulin-like, and so on. This helps when you’re not sure of the exact naming conventions used in the records.

GenBank vs. RefSeq: Picking the Right Record

Once your search returns results, you’ll notice records from two overlapping but different sources: GenBank and RefSeq. Understanding the distinction saves you from downloading a messy or outdated sequence.

GenBank is an archival database. It stores every publicly submitted DNA sequence from individual labs and large-scale sequencing projects worldwide, and it shares data daily with partner databases in Europe and Japan. Because it’s an archive, GenBank can be very redundant for popular genes, with dozens of slightly different submissions for the same sequence. Each record belongs to the original submitter and can’t be edited by anyone else.

RefSeq, on the other hand, is curated by NCBI staff scientists. These records are derived from GenBank data but cleaned up, annotated, and maintained to reflect current knowledge. Think of a RefSeq entry as a “review article” version of a gene’s sequence. RefSeq records carry a status label (REVIEWED or VALIDATED) in their comments section, and NCBI can update them as new information emerges. If you want a single, reliable, well-annotated version of a gene sequence, look for the RefSeq record. You can spot them by their accession number prefix: NM_ for mRNA sequences, NR_ for non-coding RNA, and NC_ for chromosomes.

Using BLAST to Search by Sequence

Sometimes you already have a DNA sequence and need to figure out what it is, where it came from, or what’s similar to it. That’s the job of BLAST (Basic Local Alignment Search Tool), available at blast.ncbi.nlm.nih.gov.

BLAST takes a nucleotide or protein sequence you provide and compares it against entire databases to find regions of similarity. It then calculates the statistical significance of each match. You can use it to identify an unknown sequence, find related genes in other species, or confirm that a sequence you’ve isolated matches what you expected.

Two numbers in the results matter most. The E-value tells you how likely it is that a match occurred by random chance: smaller is better, and anything close to zero is a strong match. Percent identity tells you what fraction of the bases in your sequence matched the database hit. A 99% identity to a known gene is essentially a confirmed match. A 70% identity might indicate a related gene in a different species. BLAST is particularly useful for inferring evolutionary relationships between organisms or identifying which gene family a sequence belongs to.

Ensembl: A Visual Alternative

The Ensembl Genome Browser, maintained by the European Bioinformatics Institute, offers a more visual way to explore DNA sequences in the context of whole genomes. It’s especially useful for vertebrate species, and a sister site (Ensembl Genomes) covers plants, fungi, and microorganisms.

Ensembl lets you zoom from chromosome-level views down to individual base pairs. You can view how a gene’s upstream region aligns across multiple mammalian species, which is helpful for studying conserved regulatory regions. Sequences and data tables can be exported using the built-in BioMart tool, which lets you filter and download exactly the data you need rather than downloading entire genome files. If you’re working with a well-studied organism like a human, mouse, or zebrafish, Ensembl often provides richer context around a gene than a raw database record.

Understanding Sequence File Formats

When you download a DNA sequence, it will typically come in one of two formats.

FASTA is the simpler option. A FASTA file starts with a single header line beginning with a “>” symbol, followed by a unique sequence identifier (25 characters or fewer, no spaces). After the header, the DNA sequence itself appears on the following lines, using standard letter codes (A, T, G, C, and N for ambiguous positions). Each line of sequence is typically kept to 80 characters or fewer. FASTA is the format most analysis tools expect when you paste in a sequence.

GenBank flat file format is more detailed. It wraps the raw sequence in layers of annotation: the organism’s name, the gene’s function, the positions of coding regions, references to published papers, and more. If you just need the letters of a sequence, FASTA is easier to work with. If you need context about what those letters encode, GenBank format gives you the full picture. Most databases let you choose which format to download.

Specialized Databases for Specific Needs

Not every DNA sequence lives in GenBank. Depending on what you’re looking for, a specialized database may be more efficient.

BOLD Systems (barcodinglife.org) is built specifically for DNA barcoding, the practice of identifying species from short, standardized gene sequences. BOLD links every barcode sequence to its physical source specimen, creating a verified reference library. For animals, barcoding typically uses a region of the COI gene, and test studies show that more than 95% of species in varied animal groups carry distinctive COI sequences. If your goal is identifying a species from a tissue or environmental sample, BOLD is the place to search.
ORF Finder (ncbi.nlm.nih.gov/orffinder) helps when you have a raw DNA sequence and want to find potential protein-coding regions within it. You paste in your sequence, set a minimum length for open reading frames (options range from 30 to 600 nucleotides), choose the genetic code, and specify whether to look only for the standard ATG start codon or include alternatives. The tool returns each potential coding region along with its predicted protein translation, which you can then verify using BLAST.

If You Need to Generate New Sequence Data

When the sequence you need doesn’t exist in any database, it has to be determined through sequencing. Two broad technologies dominate.

Sanger sequencing is the older method, best suited for reading a single known gene or a short stretch of DNA. It produces long, highly accurate reads but requires you to already know roughly where in the genome to look. It’s limited to detecting simple mutations like single-letter substitutions and small insertions or deletions.

Next-generation sequencing (NGS) works by breaking DNA into millions of small fragments and reading them all simultaneously. This parallel approach captures a far broader range of mutations, including rare variants present in only a small percentage of cells. Because each position in the genome gets read multiple times, you can increase accuracy simply by sequencing deeper. NGS is the standard for whole-genome projects, cancer genomics, and any situation where you don’t know exactly what you’re looking for. The cost of sequencing has dropped dramatically over the past two decades, making it feasible for individual labs and even direct-to-consumer services.

For most people searching for a known gene in a studied organism, though, the sequence is already sitting in a database waiting to be downloaded. Start with NCBI’s Nucleotide database, grab the RefSeq record if one exists, and export it in whatever format your downstream analysis requires.