What Is Cross-Attention and How Does It Work?

Cross-attention is a mechanism in neural networks that lets one sequence of data “look at” a different sequence to find relevant information. It works by comparing elements from two distinct sources, like a translated sentence checking back against the original sentence to decide which words matter most for the next output. This distinguishes it from self-attention, where a single sequence only looks at itself.

How Cross-Attention Works

All attention mechanisms rely on three components: queries, keys, and values. Think of it like a library search. You have a question (the query), a catalog of book titles (the keys), and the actual books on the shelves (the values). You compare your question against every title in the catalog, figure out which titles are most relevant, then pull information from those corresponding books.

In self-attention, the queries, keys, and values all come from the same input. A sentence looks at itself to figure out which of its own words relate to each other. In cross-attention, the queries come from one source and the keys and values come from a different source. This is the core distinction: cross-attention connects two separate streams of information.

The math behind both is identical. Each query is compared against every key using a dot product, producing a raw relevance score. Those scores are divided by the square root of the key’s dimension size to keep them from becoming extremely large, which would cause the next step to produce near-zero gradients and stall learning. A softmax function then converts the scaled scores into a set of weights between 0 and 1 that sum to 1 across the sequence. Finally, those weights are applied to the values, producing a weighted blend of information from the second source that’s most relevant to the query.

Where It Sits in a Transformer

The original Transformer architecture, designed for machine translation, has an encoder that processes the input sentence and a decoder that generates the output sentence one token at a time. Each decoder layer contains three sublayers in order: a masked self-attention layer, a cross-attention layer, and a feedforward network.

The masked self-attention lets the decoder look at what it has generated so far. The cross-attention layer is where the decoder checks back against the encoder’s output. Specifically, the decoder’s internal states become the queries, while the encoder’s output vectors serve as both keys and values. This means the decoder is asking: “Given what I’m about to generate next, which parts of the input sentence should I pay attention to?”

Because the encoder has already finished processing the entire input before the decoder starts, there’s no need for masking in the cross-attention layer. The decoder can freely attend to any position in the encoder’s output.

Dimension Requirements

The two input sequences in cross-attention don’t need to be the same length. The query sequence can have T tokens and the key/value sequence can have T’ tokens, and those numbers can differ. What does need to match is the dimension of the query and key vectors, since they’re compared through a dot product. If your queries have 64 dimensions, your keys must also have 64 dimensions. The value vectors can have a different dimension entirely, which determines the size of the output.

The resulting attention matrix has dimensions T by T’, meaning every position in the query sequence gets a set of weights over every position in the key/value sequence. The final output has one vector per query position, each being a weighted combination of the value vectors.

The Classic Example: Translation

Machine translation makes cross-attention intuitive. Suppose you’re translating English to French. The encoder reads the full English sentence and produces a rich representation of each English word in context. As the decoder generates each French word, cross-attention lets it focus on the most relevant English words. When generating a French verb, the mechanism might assign high weight to the corresponding English verb and its subject, while largely ignoring other parts of the sentence.

This selective focus is what makes cross-attention so powerful. Rather than compressing an entire input into a single fixed-size vector (which older sequence-to-sequence models did), cross-attention preserves access to every position in the input, letting the model dynamically retrieve what it needs at each step of generation.

Applications Beyond Text

Cross-attention has become a fundamental building block well beyond translation. Any task where two different types of information need to interact is a natural fit.

Vision-language models use cross-attention to connect image features with text. When generating a caption for a photo, the text decoder attends to different regions of the image at each word. When answering a question about an image, the question tokens query against visual features to locate the relevant part of the scene.
Text-to-image generation in diffusion models like Stable Diffusion uses cross-attention to let the image generation process attend to the text prompt. The noisy image features serve as queries, while the text embedding provides keys and values. This is how the model knows where to place objects described in the prompt.
Medical imaging applies cross-attention to align images from different modalities. One published approach uses it to register MRI and ultrasound volumes for prostate cancer biopsy, capturing the correspondence between features extracted from each imaging type to align them accurately.
Speech recognition uses cross-attention to let the text decoder attend to audio features, focusing on the relevant portion of the audio signal as it transcribes each word.

Cross-Attention vs. Self-Attention

Self-attention is about understanding relationships within a single input. When reading a paragraph, self-attention helps the model figure out that “it” in sentence three refers to “the car” in sentence one. Every token compares against every other token in the same sequence.

Cross-attention is about building a bridge between two inputs. The queries from one source search through the keys of another source to find what’s relevant, then pull corresponding values. The two sources can be different lengths, different modalities (text and image, audio and text), or different stages of processing (encoder output and decoder state).

In practice, many modern architectures use both. A decoder-only model like GPT uses only self-attention, since it processes a single stream of tokens. But any model that needs to combine information from two distinct sources, whether that’s an encoder-decoder Transformer, a multimodal system, or a retrieval-augmented model, relies on cross-attention to make that connection.