BM25S Retriever
BM25 (Wikipedia) also known as the
Okapi BM25
, is a ranking function used in information retrieval systems to estimate the relevance of documents to a given search query.
BM25SRetriever
retriever uses thebm25s
package, which leverages Scipy sparse matrices to store eagerly computed scores for all document tokens. This allows extremely fast scoring at query time, improving performance over popular libraries such asrank_bm25
by orders of magnitude.
Setupโ
%pip install --upgrade --quiet bm25s
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
from langchain_community.retrievers import BM25SRetriever
API Reference:BM25SRetriever
Instantiationโ
The retriever can be instantiated from a list of texts and (optionally) metadata or directly from a list of Documents
. If a persist_directory
is provided, the retriever will persist the index to that directory.
# Create your corpus here
corpus = [
"a cat is a feline and likes to purr",
"a dog is the human's best friend and loves to play",
"a bird is a beautiful animal that can fly",
"a fish is a creature that lives in water and swims",
]
metadata = [
{"descr": "I am doc1"},
{"descr": "I am doc2"},
{"descr": "I am doc3"},
{"descr": "I am doc4"},
]
retriever = BM25SRetriever.from_texts(
corpus, metadata, persist_directory="animal_index_bm25"
)
Alternatively you can instantiate the retriever from a persisted directory.
retriever_2 = BM25SRetriever.from_persisted_directory("animal_index_bm25")
Usageโ
query = "does the fish purr like a cat?"
retrieved_chunks = retriever.invoke(query)
retrieved_chunks
[Document(metadata={'descr': 'I am doc1'}, page_content='a cat is a feline and likes to purr'),
Document(metadata={'descr': 'I am doc4'}, page_content='a fish is a creature that lives in water and swims'),
Document(metadata={'descr': 'I am doc3'}, page_content='a bird is a beautiful animal that can fly'),
Document(metadata={'descr': 'I am doc2'}, page_content="a dog is the human's best friend and loves to play")]
retrieved_chunks_2 = retriever_2.invoke(query)
retrieved_chunks_2
[Document(metadata={'descr': 'I am doc1'}, page_content='a cat is a feline and likes to purr'),
Document(metadata={'descr': 'I am doc4'}, page_content='a fish is a creature that lives in water and swims'),
Document(metadata={'descr': 'I am doc3'}, page_content='a bird is a beautiful animal that can fly'),
Document(metadata={'descr': 'I am doc2'}, page_content="a dog is the human's best friend and loves to play")]
retrieved_chunks == retrieved_chunks_2
True
Relatedโ
- Retriever conceptual guide
- Retriever how-to guides