What Is Self-Attention in Transformers, Explained?

Self-attention is the core mechanism that allows transformer models (the architecture behind ChatGPT, Google Translate, and most modern AI) to understand language. It works by letting every word in a sentence look at every other word to figure out context and meaning. When you read “The bank was covered in grass,” your brain instantly knows “bank” means a riverbank, not a financial institution, because of the word “grass.” Self-attention gives AI models a similar ability, computing relationships between all words in a sequence simultaneously.

How Self-Attention Actually Works

Every word in a sentence gets converted into three vectors: a query, a key, and a value. Think of it like a search engine inside the model. The query is what a word is looking for. The key is what a word offers as context. The value is the actual information that gets passed along once a match is found.

For each word, the model compares its query against the keys of every other word in the sentence. Words with closely matching query-key pairs get high attention scores, meaning they’re considered highly relevant to each other. Those scores determine how much each word’s value contributes to the final representation. The result is a new, richer version of each word that now carries contextual information from the entire sentence.

The formula for this is called scaled dot-product attention. The model multiplies the query and key matrices together, scales the result down, passes it through a function that converts scores into probabilities, then multiplies by the value matrix. The scaling step divides by the square root of the key vector’s dimension (typically 64 or 128). Without this scaling, the dot products between queries and keys grow very large as dimensions increase, which pushes the probability function into a region where learning essentially stalls. This scaling keeps the math in a range where the model can train effectively.

Why It Replaced Recurrent Networks

Before transformers, the dominant approach for language tasks was recurrent neural networks (RNNs), which processed words one at a time, left to right. This created two problems. First, RNNs naturally weight recent words more heavily than distant ones, similar to a weighted moving average. Information from early in a long sentence gets diluted or “forgotten” by the time the model reaches the end. Second, this strictly sequential processing meant you couldn’t always capture relationships between words that were far apart but deeply connected in meaning.

Self-attention solves both problems. It processes all words in parallel rather than sequentially, and every word can directly attend to every other word regardless of distance. A pronoun at the end of a paragraph can attend to its referent at the beginning just as easily as a neighboring word. This ability to capture long-range dependencies without information decay is a major reason transformers outperform RNNs on most language tasks.

Multiple Heads Capture Different Patterns

A single pass of self-attention computes one weighted average across the sequence, but language has many simultaneous layers of structure. One word might need to attend to its grammatical subject, its syntactic modifier, and a semantically related word elsewhere in the sentence, all at the same time. A single weighted average can’t capture all of these relationships well.

Transformers solve this with multi-head attention. Instead of running one set of queries, keys, and values, the model splits them into multiple parallel “heads,” each operating independently. One head might learn to track grammatical relationships while another tracks semantic similarity or coreference. After each head computes its own attention pattern, the results are concatenated and combined through a final transformation. In practice, models like GPT use dozens of these heads per layer, stacked across many layers, giving them an enormous capacity to represent language structure.

Why Position Information Must Be Added Separately

There’s a catch with self-attention: it has no built-in sense of word order. Because it computes attention scores between all pairs of words simultaneously, the output is the same regardless of how the words are arranged. “The dog chased the cat” and “The cat chased the dog” would produce identical representations without some additional signal telling the model which word came first.

RNNs encoded position naturally through their sequential processing. Transformers need it injected explicitly through positional encodings, numerical patterns added to each word’s initial representation that tell the model where it sits in the sequence. Without these encodings, a transformer would treat language as an unordered bag of words, losing the meaning that comes from syntax and word order.

Masked Self-Attention in Text Generation

When a transformer generates text (like a chatbot writing a response), it produces one word at a time. During training, though, the model sees entire sentences at once for efficiency. This creates a problem: the model could “cheat” by looking at future words when predicting the current one.

Masked self-attention prevents this. It applies a triangular mask to the attention scores so that each word can only attend to words that came before it, never to words that come after. For the sentence “The cat sat on the mat,” the word “sat” can attend to “The” and “cat” but not to “on,” “the,” or “mat.” The mask works by setting all future-word attention scores to negative infinity before the probability calculation, which effectively zeroes them out. This ensures the model learns to predict each word using only the preceding context, exactly matching the conditions it will face when generating text in real time.

Self-Attention vs. Cross-Attention

Self-attention computes relationships within a single sequence: a sentence attending to itself. Cross-attention is a related mechanism where one sequence attends to a different sequence. In machine translation, for example, the decoder (generating the output language) uses cross-attention to look back at the encoder’s representation of the input language. The decoder’s words provide the queries, while the encoder’s output provides the keys and values. This lets each generated word in the translation “consult” the most relevant parts of the original sentence.

The Quadratic Cost of Self-Attention

Self-attention’s biggest practical limitation is its computational cost. Because every word must be compared with every other word, both time and memory scale quadratically with sequence length. Double the input length and the cost roughly quadruples. For a 1,000-word input, the model computes 1 million attention scores per head per layer. For 10,000 words, that jumps to 100 million.

This quadratic scaling is why early transformer models had strict input limits (often 512 or 1,024 tokens) and why newer models with 100,000+ token context windows require significant engineering to make affordable. Researchers have developed approximations that reduce self-attention to linear time by using mathematical shortcuts, but these come with trade-offs in accuracy. The full quadratic version remains the standard in most production models because the quality difference matters for complex language understanding.