What Is BM25? How the Best Matching Algorithm Works

BM25 (short for “Best Matching 25”) is a ranking algorithm that search engines use to score how relevant a document is to a given query. It looks at how often your search terms appear in a document, how rare those terms are across the entire collection, and how long the document is compared to average. Despite being decades old, BM25 remains the default ranking function in major search platforms like Elasticsearch and Apache Lucene, and it plays a central role in modern AI-powered search systems.

How BM25 Scores a Document

BM25 calculates a relevance score by combining three core ideas for each search term in your query:

Term frequency: How many times the search term appears in the document. More occurrences signal higher relevance, but with diminishing returns (more on that below).
Inverse document frequency (IDF): How rare the term is across all documents. A term that appears in only a handful of documents carries more weight than one that shows up everywhere. If you search “machine learning tutorial,” the word “tutorial” appears in millions of documents, so it contributes less to the score than a rarer term would.
Document length normalization: A long document naturally contains more words, so a term might appear frequently just by chance. BM25 adjusts for this by comparing each document’s length to the average length in the collection.

The final score for a document is the sum of these components across every term in the query. Documents with higher scores rank higher in your search results.

Why Term Frequency Saturates

One of BM25’s key improvements over simpler algorithms like TF-IDF is how it handles term frequency. In a naive approach, a document mentioning “python” 20 times would score twice as high as one mentioning it 10 times. That’s usually not useful, because after a certain point, extra mentions don’t make a document more relevant.

BM25 solves this with a saturation curve. The first few occurrences of a term boost the score significantly, but each additional occurrence adds less and less. The score approaches a ceiling and never exceeds it, no matter how many times the term repeats. This prevents long, repetitive documents from dominating results.

The Two Tuning Parameters: k1 and b

BM25’s behavior is controlled by two parameters that you can adjust depending on your dataset.

k1: Term Frequency Saturation

The k1 parameter controls how quickly that saturation ceiling kicks in. A low k1 (close to 0) means the score flattens almost immediately: it barely matters whether a term appears once or ten times. A high k1 (say, 5 or 10) lets additional occurrences keep contributing meaningfully for longer. The default in Elasticsearch is 1.2, which works well for most use cases. At k1 = 0, term frequency is ignored entirely and only the rarity of terms (IDF) matters.

b: Document Length Normalization

The b parameter controls how much document length affects scoring, on a scale from 0 to 1. At b = 1, BM25 fully penalizes longer documents, assuming their higher term counts are just a side effect of length. At b = 0, document length is ignored completely. The default of 0.75 provides moderate normalization. If your collection has documents of wildly different lengths (say, tweets mixed with full research papers), tuning b can improve results. But for most collections, the defaults of k1 = 1.2 and b = 0.75 perform well without adjustment.

Where BM25 Came From

The “BM” stands for “Best Matching,” and 25 is simply the iteration number in a series of ranking functions developed as part of the Okapi information retrieval system, a research project from the 1990s at City, University of London. The algorithm built on probabilistic retrieval theory, which frames search as a statistical question: given a query, what’s the probability that this document is relevant? BM25 turned out to be a particularly effective formulation of that idea, and it stuck.

BM25 vs. TF-IDF

If you’ve encountered TF-IDF (term frequency-inverse document frequency), BM25 is its more sophisticated cousin. Both use term frequency and document rarity, but BM25 adds two critical improvements. First, the saturation curve prevents runaway scores from keyword-stuffed or repetitive documents. Second, the tunable length normalization handles mixed-length collections more gracefully. In practice, BM25 consistently outperforms raw TF-IDF on real-world search tasks, which is why it replaced TF-IDF as the default in most search engines.

How BM25 Handles Rare vs. Common Terms

The IDF component gives BM25 a built-in sense of which words matter. Common words like “the” or “is” appear in nearly every document, so their IDF score is close to zero. In fact, if a term appears in more than half the documents in a collection, the standard IDF formula can produce a negative value. Most implementations handle this by setting a floor of zero, so these ultra-common terms simply contribute nothing to the score rather than dragging it down. Stop words (extremely common words like “a,” “the,” “and”) are typically filtered out before scoring even begins.

BM25 in Modern Search Systems

BM25 is the default scoring algorithm in Elasticsearch, which powers search for companies like Wikipedia, GitHub, and thousands of e-commerce sites. Apache Lucene, the search library underneath Elasticsearch and Apache Solr, also uses BM25 as its default. When you type a query into any application built on these tools, BM25 is almost certainly involved in ranking the results.

More recently, BM25 has found a second life in AI-powered search. Large language models like GPT and Gemini use a technique called retrieval-augmented generation (RAG), where the model first retrieves relevant documents, then uses them to generate an answer. Many RAG pipelines use hybrid search, which runs a query through both BM25 and a neural embedding model simultaneously. BM25 catches exact keyword matches, while the embedding model captures semantic meaning (understanding that “car” and “automobile” refer to the same thing). A fusion step then merges both ranked lists, often using a weighted formula that balances keyword precision against semantic understanding.

This hybrid approach works because BM25 and neural models have complementary strengths. BM25 excels at matching specific terms, product names, error codes, or technical jargon where exact wording matters. Neural models are better at understanding intent and handling paraphrases. Together, they cover each other’s blind spots. A typical setup might weight the neural retriever at 60% and BM25 at 40%, though the best balance depends on the data.

Strengths and Limitations

BM25’s biggest advantage is that it’s fast, interpretable, and requires no training data. You can index a million documents and start getting useful search results immediately, with no machine learning pipeline. The scoring is transparent: you can look at exactly why a document ranked where it did.

Its main limitation is that it’s purely lexical. BM25 has no understanding of meaning. If someone searches “how to fix a flat tire” and a document discusses “changing a punctured tire,” BM25 may miss the connection because the exact words don’t match. It also can’t interpret context, so a search for “apple” will score documents about the fruit and the company equally. These are the gaps that neural search models fill, which is why hybrid approaches combining both have become the standard in modern information retrieval.