How to Sequence a Protein: Methods and Steps

Protein sequencing determines the exact order of amino acids in a protein chain. The two primary methods used today are chemical degradation, which chips away one amino acid at a time from the end of the chain, and mass spectrometry, which breaks proteins into fragments and reads their masses to reconstruct the sequence. Most modern labs rely on mass spectrometry for its speed and sensitivity, but both approaches follow a logical workflow: prepare the protein, break it into manageable pieces, identify each piece, and assemble the full sequence.

Edman Degradation: The Classical Approach

Developed in the 1950s, Edman degradation was the original method for reading a protein’s sequence. It works by removing one amino acid at a time from the starting end (the N-terminus) of a protein or peptide chain. Each removed amino acid is then identified, giving you the sequence in order.

The chemistry happens in a repeating three-step cycle. First, a chemical called phenyl isothiocyanate (PITC) attaches to the exposed amino acid at the N-terminal end of the chain under mildly basic conditions, around pH 8.6. Second, a strong acid (trifluoroacetic acid) breaks the bond between that first amino acid and the rest of the chain, releasing it. Third, the released amino acid is washed away with an organic solvent and converted into a stable form that can be identified using chromatography. With the first amino acid removed, the second amino acid is now exposed at the end, and the whole cycle repeats.

This method is reliable but has practical limits. Each cycle isn’t perfectly efficient, so after roughly 30 to 50 rounds, the signal becomes too noisy to read. For proteins longer than that, you need to first cut the protein into shorter peptide fragments, sequence each fragment separately, then figure out how they overlap to reconstruct the full sequence. Edman degradation also requires relatively pure, concentrated samples. It remains useful for confirming the identity of a protein’s starting end or verifying short sequences, but for large-scale work, mass spectrometry has largely taken over.

Mass Spectrometry: The Modern Standard

Mass spectrometry-based sequencing, often called tandem mass spectrometry or MS/MS, works on a fundamentally different principle. Instead of reading amino acids one at a time, it measures the precise mass of protein fragments and uses those masses to deduce the sequence. This approach can handle complex mixtures of proteins simultaneously and works with far smaller sample amounts, down to the attomole range (billionths of a billionth of a mole).

The most common workflow is called “bottom-up” proteomics, and it follows a clear sequence of steps.

Step 1: Digest the Protein Into Peptides

Proteins are too large for a mass spectrometer to sequence directly in this approach, so they’re first cut into shorter peptide fragments using enzymes. Trypsin is the most widely used enzyme because it cuts predictably, always slicing after the amino acids lysine and arginine. This predictability makes it easier to reconstruct the original sequence later. For more complex proteins, labs may use additional enzymes with different cutting preferences, like chymotrypsin or elastase, to generate overlapping fragments that fill in sequence gaps.

Antibody sequencing illustrates why multiple enzymes matter. One published method for sequencing monoclonal antibodies uses a panel of nine different enzymes with complementary cutting patterns, all run in parallel. Using all nine produced significantly better accuracy than using just one or even a smaller subset of four enzymes. The overlapping peptides from different digestions help confirm the complete sequence with high confidence.

Step 2: Separate the Peptides

The enzyme digest produces a complex mixture of peptides. These are separated using liquid chromatography, which sorts them based on physical properties like size and how strongly they interact with water. Each fraction coming off the chromatography column may still contain 10 to 15 peptides, but that’s manageable for what comes next.

Step 3: Measure and Fragment in the Mass Spectrometer

The separated peptides flow directly into the mass spectrometer. In the first stage, the instrument measures the intact mass of each peptide. In the second stage (that’s the “tandem” part), individual peptides are selected and broken apart by collision with gas molecules. The peptide doesn’t shatter randomly. It tends to break at the bonds between amino acids, producing a ladder of fragment ions. Each fragment differs from the next by the mass of one amino acid. By reading the mass differences between these fragments, you can determine the amino acid sequence.

Some instruments use multiple fragmentation methods on the same peptide to get more complete information. One approach combines two different collision techniques: one that works well for standard peptides, and another that’s better at breaking apart peptides with higher electrical charges. Together, they produce complementary fragment patterns that make sequence determination more reliable.

Turning Raw Data Into a Sequence

A mass spectrometer produces thousands of spectra, each representing the fragmentation pattern of one peptide. Converting those spectra into amino acid sequences requires software, and there are two fundamentally different strategies.

Database searching is the more common approach. The software takes each experimental spectrum and compares it against theoretical spectra generated from every peptide in a known protein database. Programs like Mascot, X!Tandem, and MSGF+ all work this way. If your protein is from a well-studied organism with a complete genome, this method is fast and accurate. The limitation is obvious: if the protein isn’t in the database, it can’t be found.

De novo sequencing takes the harder route, reading the amino acid sequence directly from the spectrum without any reference database. This is essential for novel proteins, engineered antibodies, or organisms without sequenced genomes. Older algorithms struggled with accuracy, but deep learning tools like DeepNovo, PointNovo, and newer models like PepNet have dramatically improved performance by training neural networks on millions of spectra to recognize fragmentation patterns associated with specific amino acid sequences.

Top-Down vs. Bottom-Up Approaches

The bottom-up workflow described above, where proteins are digested into peptides before analysis, is the most widely used. But it has a fundamental weakness called the “peptide-to-protein inference” problem. When you chop multiple related protein forms into peptides, many of those peptides are shared between different versions of the protein. You lose information about which modifications existed on which intact protein.

Top-down proteomics skips the digestion step entirely, feeding whole intact proteins into the mass spectrometer. This preserves the complete picture: you can see all the modifications on a single protein molecule and distinguish between closely related protein forms (called proteoforms) that would look identical after digestion. Top-down analysis is particularly valuable for identifying where chemical modifications sit on a protein and for quantifying how much of each protein variant is present in a sample. The tradeoff is that intact proteins are harder to separate, fragment, and analyze, so the technique requires more specialized instrumentation.

Detecting Chemical Modifications

Proteins in living cells are frequently modified after they’re built. Phosphorylation, where a phosphate group is added to an amino acid, is one of the most important modifications because it acts as an on/off switch for many cellular processes. Mass spectrometry detects phosphorylation by the characteristic mass increase of 80 Daltons it adds to a peptide. During fragmentation, phosphorylated peptides often lose the phosphate group (a loss of 80 or 98 mass units, depending on which amino acid was modified), which serves as a diagnostic signal.

Pinpointing exactly which amino acid carries the modification can be tricky. Sulfation, a different modification, also adds exactly 80 Daltons, making it easy to confuse with phosphorylation. Distinguishing between the two requires careful analysis of the fragmentation patterns and sometimes specialized fragmentation techniques.

Sample Preparation Matters

No sequencing method works well on a dirty sample. Before any sequencing can begin, the protein needs to be isolated and purified from whatever biological mixture it came from, whether that’s blood, cell culture, or tissue. There’s no single test that proves a protein sample is “pure.” Instead, purity is assessed indirectly by checking for specific types of contaminants and demonstrating their absence using techniques like gel electrophoresis or chromatography.

For mass spectrometry, contaminants like salts, detergents, and other proteins can suppress the signal or crowd out the peptides you’re trying to measure. Standard preparation includes unfolding the protein with a chemical denaturant, breaking any disulfide bonds that hold the protein’s shape together, and then digesting with trypsin or other enzymes. Each step needs to be cleaned up before the sample enters the mass spectrometer.

Nanopore Sequencing: A Technology in Progress

Nanopore sequencing revolutionized DNA analysis by reading single molecules in real time as they thread through a tiny pore. Adapting this technology for proteins is an active area of research, and several groups have demonstrated that nanopores can identify proteins by reading patterns as fragments or whole molecules pass through. However, true de novo protein sequencing, reconstructing a complete unknown sequence from scratch, remains an unsolved challenge. Proteins don’t have a uniform electrical charge like DNA, making it harder to control their movement through the pore at a steady, readable pace. Reconstructing full sequences from the signals nanopores produce still requires significant computational and experimental advances.