Papers
arxiv:2407.03618

BM25S: Orders of magnitude faster lexical search via eager sparse scoring

Published on Jul 4
· Submitted by xhluca on Jul 10
Authors:

Abstract

We introduce BM25S, an efficient Python-based implementation of BM25 that only depends on Numpy and Scipy. BM25S achieves up to a 500x speedup compared to the most popular Python-based framework by eagerly computing BM25 scores during indexing and storing them into sparse matrices. It also achieves considerable speedups compared to highly optimized Java-based implementations, which are used by popular commercial products. Finally, BM25S reproduces the exact implementation of five BM25 variants based on Kamphuis et al. (2020) by extending eager scoring to non-sparse variants using a novel score shifting method. The code can be found at https://github.com/xhluca/bm25s

Community

Paper author Paper submitter

BM25S is designed to provide a fast, low-dependency and low-memory implementation of BM25 algorithms in Python. It is solely built with Numpy and Scipy, with optional dependencies for stemming and selection, as well as integrations to Huggingface Hub, allowing you to share and use other BM25 indices with ease.

·

@xhluca Congrats on your work🔥 Amazing blog btw, thanks for sharing! 🤗https://huggingface.co/blog/xhluca/bm25s

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.03618 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 3