How to Identify a Protein: From Sample to Sequence

Identifying an unknown protein typically involves breaking it into smaller pieces, measuring those pieces with a mass spectrometer, and matching the results against a database of known protein sequences. This core workflow dominates modern protein identification, though several complementary techniques exist depending on what you already know about your sample and what equipment you have access to. The process combines wet-lab chemistry with computational analysis, and understanding both sides is essential.

Separating Proteins From a Complex Sample

Most biological samples contain thousands of different proteins mixed together. Before you can identify any single protein, you need to isolate it or at least reduce the complexity of the mixture. The most common first step is gel electrophoresis, a technique that separates proteins by size (and sometimes by charge) as they migrate through a gel under an electric field. Two-dimensional gel electrophoresis resolves proteins even further by first separating them by their isoelectric point, then by molecular weight, producing a grid of individual protein “spots” that can be cut out for analysis.

Once you’ve located a protein spot or band of interest on the gel, the next step is excising that section and extracting the protein. The basic workflow involves cutting the gel piece, removing the staining dye, eluting the protein from the gel matrix, and then cleaning up the sample. If a detergent was used during electrophoresis, it needs to be removed (acetone precipitation works well for this) before the protein can be analyzed further.

For high-throughput work, researchers skip gel separation entirely and use liquid chromatography to separate a digested protein mixture in solution. This “shotgun” approach feeds directly into a mass spectrometer and can profile thousands of proteins in a single run.

Digesting Proteins Into Peptides

Whole proteins are too large for most identification instruments to analyze efficiently. Instead, they’re cut into shorter peptide fragments using an enzyme, almost always trypsin. Trypsin is the workhorse of proteomics because it cuts proteins at predictable locations, specifically after the amino acids lysine and arginine. This predictability is what makes database matching possible later: if you know a protein’s sequence, you can calculate exactly which peptide fragments trypsin would produce.

The digestion step sounds simple but has a major impact on results. The protein needs to be unfolded (denatured) so trypsin can access all cleavage sites, and the chemical environment matters. Certain detergents significantly improve efficiency. Sodium deoxycholate, for instance, enhances trypsin activity nearly fivefold and allows complete digestion in just five to seven hours. The buffer choice also matters if you plan to use chemical labeling for quantitative comparisons downstream.

Peptide Mass Fingerprinting

The simplest mass spectrometry approach to protein identification is peptide mass fingerprinting. After trypsin digestion, the resulting peptides are loaded into a mass spectrometer, most often a MALDI-TOF instrument, which measures the mass-to-charge ratio of each peptide with high precision. The collection of peptide masses forms a pattern, essentially a fingerprint, that is unique to each protein.

That fingerprint is then compared against a database containing the theoretical peptide masses of every known protein. Software tools generate predicted trypsin fragments for each database entry and look for the best match. A protein can often be identified from as few as four or five matching peptide masses. The technique works best when the protein has already been isolated to reasonable purity, such as from a single gel spot, because a messy mixture of proteins produces overlapping fingerprints that are hard to untangle.

Tandem Mass Spectrometry for Complex Mixtures

When you’re dealing with hundreds or thousands of proteins at once, peptide mass fingerprinting isn’t enough. Tandem mass spectrometry (often called LC-MS/MS) adds a second layer of analysis. In the first stage, the instrument measures peptide masses just like in fingerprinting. But then it selects individual peptides and fragments them further, producing a pattern that reveals the actual amino acid sequence of each peptide.

This sequence information makes identification far more specific. Even if two proteins produce peptides of similar mass, their internal sequences will differ. Label-free shotgun proteomics, the most common version of this approach, can identify and quantify proteins across entire biological systems in a single experiment. Modern instruments have pushed detection limits into the attomolar range, meaning they can identify proteins present in extraordinarily small quantities.

How Software Matches Spectra to Proteins

Raw mass spectrometry data is meaningless without software to interpret it. Two algorithms have dominated the field for decades. Sequest, developed in 1994, works by generating theoretical spectra from every protein sequence in a database, predicting where each protein’s peptides would produce signals, and finding the closest match to the experimental data. Mascot, introduced in 1999, takes a more statistical approach: instead of just finding the best match, it calculates the probability that each match occurred by chance, ranking results by how unlikely they are to be false positives.

Both tools search against large protein databases. UniProt is one of the most comprehensive, containing millions of sequences organized into clusters at different levels of similarity (100%, 90%, or 50% identical sequences). When searching these databases with a sequence or mass profile, results are ranked by an expectation value (E-value). An E-value below 0.1 generally indicates a significant match. Values between 0.1 and 10 are considered dubious, and anything above 10 is unlikely to be meaningful. Even strong statistical matches should be verified manually.

Sequencing a Protein Directly

Before mass spectrometry dominated the field, Edman degradation was the standard method for reading a protein’s amino acid sequence one residue at a time. The chemistry works by attaching a reactive molecule to the very first amino acid at one end of the protein chain (the N-terminus). When heated under acidic conditions, this tagged amino acid breaks away from the rest of the chain, and the process repeats with the next amino acid now exposed at the end.

Edman degradation is still used when you need a definitive sequence for a short stretch of protein, but it has significant limitations. Each cycle isn’t perfectly efficient, and errors accumulate, so it becomes unreliable after about 30 to 50 residues. If the chemical tag fails to complete its reaction in any given cycle, it produces a “skip” in the read. Newer approaches are exploring fluorescently labeled binding proteins that can read the N-terminal amino acid identity without relying on the same reactive chemistry, which could reduce certain types of errors.

Confirming Identity With Antibodies

When you have a specific protein in mind and want to confirm it’s present in your sample, antibody-based methods offer a targeted approach. Western blotting is the most common. Proteins are separated by size on a gel, transferred to a membrane, and then probed with an antibody designed to bind only your protein of interest. The result is a visible band at the expected molecular weight.

The strength of this method is its specificity: you’re asking “is this particular protein here?” rather than “what proteins are here?” But antibodies aren’t perfect. They sometimes bind to unintended targets, producing extra bands on the membrane. One way to verify you’re looking at the right band is to use genetic tools to suppress production of your target protein in a control sample. If a band disappears when the protein is knocked down, that confirms it was the correct target. Western blots are considered semiquantitative, meaning they can tell you roughly how much protein is present when you normalize against a stable reference protein that doesn’t change between experimental conditions.

Using Sequence Databases for Partial Information

You don’t always need a complete protein sequence to make an identification. If you have even partial sequence data, or just an amino acid composition, you can search against protein sequence databases to find matches. The BLAST algorithm, hosted by resources like UniProt and NCBI, compares your fragment against millions of known sequences and returns statistically ranked matches.

This is particularly powerful because protein families share conserved sequences across species. A partial sequence from an unknown protein might match a well-characterized protein in another organism, giving you a strong lead on its identity and function. The key is evaluating the statistical significance of the match. A low E-value combined with high sequence identity (the percentage of amino acids that are identical between your query and the database hit) gives confidence in the result. Searching against clustered databases at 90% or 50% identity can help identify more distant relatives when an exact match isn’t available.

Choosing the Right Method

The best approach depends on your starting point. If you have a single purified protein band on a gel, peptide mass fingerprinting is fast and straightforward. If you’re profiling an entire cell or tissue, shotgun LC-MS/MS handles the complexity. If you already suspect which protein you’re looking at, a Western blot can confirm it with high specificity. And if you need the actual letter-by-letter sequence for a short stretch, Edman degradation or tandem mass spectrometry can provide it.

In practice, most identification projects combine multiple methods. A researcher might use gel electrophoresis to isolate a band, mass spectrometry to identify it, and a Western blot to confirm the result in follow-up experiments. Each technique answers a slightly different question, and together they build a convincing case for a protein’s identity.