CBOW, short for Continuous Bag of Words, is a neural network architecture that learns to represent words as numerical vectors by predicting a target word from the words surrounding it in a sentence. It was introduced in 2013 by Tomas Mikolov and colleagues at Google as one of two models inside the Word2Vec framework, the other being Skip-gram. CBOW remains one of the most widely used methods for generating word embeddings, the dense numerical representations that allow computers to work with language mathematically.
How CBOW Works
The core idea behind CBOW is simple: if you remove a word from a sentence and feed the surrounding words into a neural network, the network should be able to guess the missing word. Take the sentence “you should subscribe to my channel.” If the target word is “subscribe,” the model receives “you,” “should,” “to,” and “my” as input and tries to predict “subscribe” as the output.
The network itself has three layers. The input layer has one neuron for every word in the vocabulary. The hidden layer is much smaller, with a size you choose (this becomes the dimension of your word vectors). The output layer matches the input layer in size, again one neuron per vocabulary word. During training, the surrounding context words are fed in, their representations are averaged together in the hidden layer, and the network outputs a probability for every word in the vocabulary. The goal is to push the probability of the correct target word as high as possible.
The real product of CBOW isn’t the predictions themselves. It’s the weights between the input and hidden layers. Once training finishes, those weights become the word embeddings. Each word gets a vector of numbers that captures something about its meaning based on the contexts it appeared in. Words that show up in similar contexts end up with similar vectors, which is why word embeddings can capture relationships like “king is to queen as man is to woman.”
The “Bag of Words” Part
The name gives away an important limitation. “Bag of words” means the model treats context words as an unordered set. Feeding in “the cat sat on” produces the same result as “on sat cat the.” CBOW doesn’t care about word order, only which words are present in the context window.
This sounds like a fatal flaw, but in practice it works surprisingly well. Research from MIT found that CBOW encodings could reach 70% accuracy on a word order prediction task, 20 percentage points above baseline, despite never explicitly learning word order. The explanation is that natural language has strong statistical patterns in which words co-occur, and CBOW captures those patterns even without tracking position. Still, this is a genuine ceiling on what CBOW can represent, and it’s one reason newer models like transformers (which do encode word position) have taken over for tasks requiring deeper language understanding.
Context Window and Vector Size
Two settings have the biggest impact on the quality of CBOW embeddings: the context window size and the vector dimension.
The context window determines how many words on each side of the target word the model looks at. A window of 5 means the model uses the five words before and five words after the target. Research comparing different window sizes found that CBOW tends to perform best with a window of about 7, while Skip-gram peaks around 5. Larger windows capture broader topical relationships; smaller windows focus on tighter syntactic patterns.
Vector dimensions between 50 and 500 are standard. Dimensions below 50 consistently produce poor-quality embeddings. Going above 150 often yields diminishing returns for classification tasks, and pushing past 400 can actually degrade quality as the model starts to overfit. A dimension of 150 is a practical starting point for most applications, though some specialized tasks (like classifying social media posts) have benefited from dimensions as high as 800 when paired with large training corpora.
Making Training Practical
The output layer creates a computational problem. For every training example, the network needs to calculate a probability across the entire vocabulary. If your vocabulary has 100,000 words, that means 100,000 calculations per update. Scale that to billions of training examples and training becomes impractical.
Two techniques solve this. Negative sampling replaces the full vocabulary calculation with a simpler task: instead of asking “which of all 100,000 words is correct?”, it asks “is this the real target word or a randomly chosen fake one?” The model only needs to compare the true word against a small handful of random “negative” samples, typically 5 to 20, reducing the work per update dramatically. Hierarchical softmax takes a different approach, organizing the vocabulary into a tree structure so the model only needs to make a series of binary choices rather than one massive comparison. Negative sampling became the default choice because it’s simpler to implement and produces high-quality embeddings when the vectors themselves are the goal.
CBOW vs. Skip-gram
CBOW and Skip-gram are mirror images of each other. CBOW predicts a target word from its context. Skip-gram predicts context words from a single target word. This structural difference creates practical tradeoffs.
CBOW’s biggest advantage is speed. In benchmarks using 48 processing threads, training Skip-gram for the same number of passes over the data took 3.31 times as long as training CBOW. This makes CBOW the better choice when you need embeddings quickly or when you’re working with very large datasets. Skip-gram has traditionally been considered better at capturing rare words, since it generates more training examples per word occurrence. However, a 2021 study published through the Association for Computational Linguistics found that a corrected implementation of CBOW performs as well as Skip-gram on standard benchmarks, suggesting the quality gap was partly due to implementation details rather than fundamental architectural differences.
Where CBOW Fits Today
CBOW was a breakthrough when it launched in 2013, proving that simple neural networks trained on raw text could produce word vectors that captured meaningful semantic relationships. The embeddings it generates are still used as input features for text classification, sentiment analysis, information retrieval, and other tasks where you need a fixed-size numerical representation of a word.
Its main limitation is that each word gets exactly one vector regardless of context. The word “bank” gets the same embedding whether you’re talking about a river bank or a financial bank. Contextual models like BERT and GPT generate different representations for the same word depending on the sentence it appears in, which is why they’ve replaced Word2Vec for tasks requiring nuanced language understanding. But for applications where speed, simplicity, and interpretability matter, CBOW remains a practical and well-understood tool. Training takes minutes rather than days, the resulting vectors are compact and easy to store, and the math behind the model is transparent enough to debug when something goes wrong.

