BM25S Retriever

BM25 (Wikipedia) also known as the Okapi BM25, is a ranking function used in information retrieval systems to estimate the relevance of documents to a given search query.

BM25SRetriever retriever uses the bm25s package, which leverages Scipy sparse matrices to store eagerly computed scores for all document tokens. This allows extremely fast scoring at query time, improving performance over popular libraries such as rank_bm25 by orders of magnitude.

Setup

%pip install --upgrade --quiet  bm25s

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

from langchain_community.retrievers import BM25SRetriever

API Reference:BM25SRetriever

Instantiation

The retriever can be instantiated from a list of texts and (optionally) metadata or directly from a list of Documents. If a persist_directory is provided, the retriever will persist the index to that directory.

# Create your corpus here
corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

metadata = [
    {"descr": "I am doc1"},
    {"descr": "I am doc2"},
    {"descr": "I am doc3"},
    {"descr": "I am doc4"},
]

retriever = BM25SRetriever.from_texts(
    corpus, metadata, persist_directory="animal_index_bm25"
)

Alternatively you can instantiate the retriever from a persisted directory.

retriever_2 = BM25SRetriever.from_persisted_directory("animal_index_bm25")

Usage

query = "does the fish purr like a cat?"
retrieved_chunks = retriever.invoke(query)
retrieved_chunks

[Document(metadata={'descr': 'I am doc1'}, page_content='a cat is a feline and likes to purr'),
 Document(metadata={'descr': 'I am doc4'}, page_content='a fish is a creature that lives in water and swims'),
 Document(metadata={'descr': 'I am doc3'}, page_content='a bird is a beautiful animal that can fly'),
 Document(metadata={'descr': 'I am doc2'}, page_content="a dog is the human's best friend and loves to play")]

retrieved_chunks_2 = retriever_2.invoke(query)
retrieved_chunks_2

[Document(metadata={'descr': 'I am doc1'}, page_content='a cat is a feline and likes to purr'),
 Document(metadata={'descr': 'I am doc4'}, page_content='a fish is a creature that lives in water and swims'),
 Document(metadata={'descr': 'I am doc3'}, page_content='a bird is a beautiful animal that can fly'),
 Document(metadata={'descr': 'I am doc2'}, page_content="a dog is the human's best friend and loves to play")]

retrieved_chunks == retrieved_chunks_2

True

Retriever conceptual guide
Retriever how-to guides

Setup​

Instantiation​

Usage​

Related​

Was this page helpful?

Setup

Instantiation

Usage

Related