What’s New in SignalP 6.0 for Signal Peptide Prediction?

Proteins perform their specific functions only after being directed to the correct location within a cell or secreted outside of it. To facilitate this movement, many proteins are tagged with short amino acid sequences that act like a cellular zip code. Identifying these tags is a fundamental step in understanding cellular operation, especially for proteins destined for the membrane or the external environment. The difficulty of experimentally determining these sorting signals has driven the development of computational prediction tools. SignalP is the most widely recognized software for this purpose, and the release of SignalP 6.0 offers improved accuracy and a broader scope of detection compared to its predecessors.

Understanding Signal Peptides and Protein Targeting

A signal peptide (SP) is typically a short sequence, generally located at the N-terminus of a newly synthesized protein, which acts as the instruction label for the cell’s transport machinery. This sequence is recognized by specific protein-carrying systems that guide the protein to its target destination, such as the endoplasmic reticulum in eukaryotes or the bacterial cell membrane. Once the protein has been successfully delivered, a dedicated enzyme, called a signal peptidase, often cleaves the signal peptide off the mature protein.

The canonical signal peptide structure is characterized by three distinct regions. The N-region is usually positively charged, which helps initiate interaction with the cellular membrane. Following this is the hydrophobic H-region, which is important for spanning or embedding into the lipid bilayer during translocation. The C-region contains the recognition site for the signal peptidase, a conserved pattern of small, neutral amino acids known as the cleavage site. The precise location of this cleavage site is a primary prediction target for tools like SignalP, as it defines the start of the final, functional protein.

The Computational Leap: Architecture of SignalP 6.0

The leap in performance for SignalP 6.0 stems from a complete overhaul of its underlying machine learning architecture, moving from previous approaches to a modern deep learning framework. Earlier versions of the tool relied on various methods, including Hidden Markov Models (HMMs) and recurrent neural networks (RNNs), which were sometimes limited in their ability to capture complex, long-range dependencies within a protein sequence. SignalP 6.0 now uses a Transformer-based protein language model, similar to the architecture powering large language models for human text.

A protein language model is first pre-trained on a massive dataset consisting of hundreds of millions of unlabeled protein sequences. This process teaches the model the fundamental “grammar” and “semantics” of protein sequences, such as which amino acids typically appear next to each other and how distant residues might influence local structure. The pre-trained knowledge is then fine-tuned on the smaller, specialized dataset of known signal peptides to specifically learn their characteristics.

The Transformer architecture is effective because it uses an attention mechanism, which allows the model to weigh the importance of every amino acid in the sequence when making a prediction for any single position. This ability to consider the entire sequence context simultaneously is a major advantage over older models that processed sequence information sequentially. The result is a model with superior generalization capability, meaning it performs better when analyzing protein sequences that are not closely related to those it was trained on.

For a final, precise prediction, the Transformer model output is combined with a Conditional Random Field (CRF) for structured prediction. The CRF ensures that the sequence of predictions—for instance, the transition from the N-region to the H-region—follows biologically plausible rules. This hybrid approach significantly improves the accuracy of both the initial signal peptide detection and the precise localization of the cleavage site compared to its predecessor, SignalP 5.0.

Expanded Prediction Capabilities

SignalP 6.0 is the first computational tool that can reliably differentiate and predict all five known types of signal peptides found across all domains of life. This expanded scope is a gain for researchers studying prokaryotes (Bacteria and Archaea), which utilize a wider variety of secretion pathways than eukaryotes. The model distinguishes between the following types:

Standard Sec-translocated signal peptides (Sec/SPI)
Lipoprotein signal peptides (Sec/SPII)
Tat-translocated signal peptides (Tat/SPI)
Tat lipoprotein signal peptides (Tat/SPII)
Pilin-like signal peptides (Sec/SPIII)

The ability to accurately detect the two rarest types, Tat/SPII and Sec/SPIII, is a marked improvement; previous versions often struggled with these types due to a lack of sufficient training data. The Tat pathway, for example, is used for folded proteins and requires a distinct twin-arginine motif within its signal peptide. The model is now capable of recognizing the subtle differences between a standard Tat signal peptide (cleaved by Signal Peptidase I) and a Tat lipoprotein signal peptide (cleaved by Signal Peptidase II).

Furthermore, the model introduces flexibility by no longer requiring the user to specify the organism type for prokaryotic sequences. While the user still selects between Eukarya and a generalized “Other” category (which encompasses all prokaryotes), the model can now accurately predict the various signal peptide types across Gram-positive bacteria, Gram-negative bacteria, and Archaea without prior taxonomic knowledge. This feature is particularly useful when analyzing metagenomic data, where the source organism for a given protein sequence is often unknown.

Beyond simple presence or absence, SignalP 6.0 provides detailed predictions of the positions of the biochemical sub-regions (n-, h-, and c-regions) for all predicted signal peptide types. This level of detail helps researchers understand the specific structural properties of the signal peptide being analyzed. The model’s capacity to predict these regions automatically is an aid, as delineating these borders previously required expert manual inspection.

Practical Application and Interpretation of Results

To utilize SignalP 6.0, researchers typically submit their protein sequences in the standard FASTA format. Users select the appropriate organism group—Eukarya or the generalized “Other” for all prokaryotes—and then choose between a “Fast” or “Slow” prediction mode. The “Fast” mode uses a reduced-size version of the model to quickly provide highly accurate probability predictions, suitable for most large-scale analyses.

The output provides a clear summary table listing the protein identifier, the predicted signal peptide type, and the precise location of the cleavage site, which is the boundary between the signal peptide and the mature protein. The tool also provides a distinct score or probability for the presence of each possible signal peptide type. These probabilities allow the user to gauge the confidence of the primary prediction against the possibility of other secretion pathways.

The most informative output for detailed analysis is the graphical representation, which plots the prediction scores across the length of the N-terminal sequence. This plot visualizes the three main scores: the signal peptide probability, the cleavage site probability, and the non-signal peptide probability. The signal peptide probability curve indicates the likelihood that a given amino acid belongs to a signal peptide.

The non-signal peptide probability curve shows the likelihood that the amino acid is part of the mature protein. The cleavage site probability is shown by a sharp peak, representing the most likely position where the signal peptidase will cut the protein. By examining where the signal peptide probability drops off and the cleavage site probability peaks, a user can confirm the prediction and accurately pinpoint the start of the functional protein.