Skip to main content

BM25S Retriever

BM25 (Wikipedia) also known as the Okapi BM25, is a ranking function used in information retrieval systems to estimate the relevance of documents to a given search query.

BM25SRetriever retriever uses the bm25s package, which leverages Scipy sparse matrices to store eagerly computed scores for all document tokens. This allows extremely fast scoring at query time, improving performance over popular libraries such as rank_bm25 by orders of magnitude.

Setupโ€‹

%pip install --upgrade --quiet  bm25s

[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
from langchain_community.retrievers import BM25SRetriever
API Reference:BM25SRetriever

Instantiationโ€‹

The retriever can be instantiated from a list of texts and (optionally) metadata or directly from a list of Documents. If a persist_directory is provided, the retriever will persist the index to that directory.

# Create your corpus here
corpus = [
"a cat is a feline and likes to purr",
"a dog is the human's best friend and loves to play",
"a bird is a beautiful animal that can fly",
"a fish is a creature that lives in water and swims",
]

metadata = [
{"descr": "I am doc1"},
{"descr": "I am doc2"},
{"descr": "I am doc3"},
{"descr": "I am doc4"},
]

retriever = BM25SRetriever.from_texts(
corpus, metadata, persist_directory="animal_index_bm25"
)

Alternatively you can instantiate the retriever from a persisted directory.

retriever_2 = BM25SRetriever.from_persisted_directory("animal_index_bm25")

Usageโ€‹

query = "does the fish purr like a cat?"
retrieved_chunks = retriever.invoke(query)
retrieved_chunks
[Document(metadata={'descr': 'I am doc1'}, page_content='a cat is a feline and likes to purr'),
Document(metadata={'descr': 'I am doc4'}, page_content='a fish is a creature that lives in water and swims'),
Document(metadata={'descr': 'I am doc3'}, page_content='a bird is a beautiful animal that can fly'),
Document(metadata={'descr': 'I am doc2'}, page_content="a dog is the human's best friend and loves to play")]
retrieved_chunks_2 = retriever_2.invoke(query)
retrieved_chunks_2
[Document(metadata={'descr': 'I am doc1'}, page_content='a cat is a feline and likes to purr'),
Document(metadata={'descr': 'I am doc4'}, page_content='a fish is a creature that lives in water and swims'),
Document(metadata={'descr': 'I am doc3'}, page_content='a bird is a beautiful animal that can fly'),
Document(metadata={'descr': 'I am doc2'}, page_content="a dog is the human's best friend and loves to play")]
retrieved_chunks == retrieved_chunks_2
True

Was this page helpful?