Skip to main content

A Python library for efficient sparse retrieval using SPLADE models.

Project description

light-splade

light-splade provides a minimal yet extensible PyTorch implementation of SPLADE, a family of sparse neural retrievers that expand queries and documents into interpretable sparse representations.

Unlike dense retrievers, SPLADE produces sparse vectors in the vocabulary space, making it both efficient to index with standard IR engines (e.g., Lucene, Elasticsearch) and interpretable, while achieving strong retrieval effectiveness. It was first introduced in the paper “SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking”.

This repository is designed for

  • Practitioners wanting to train SPLADE models on custom corpora.
  • Developers experimenting with sparse lexical expansion at scale.
  • Researchers looking for a reference implementation.

We currently support SPLADE v2 and SPLADE++

Features

  • Training pipeline for SPLADE using PyTorch + HuggingFace Transformers.
  • Support for distillation training from dense retrievers (e.g., ColBERT, dense BERT).
  • Export trained models into sparse representations compatible with IR systems.
  • Simple, lightweight, and easy to extend for experiments.

Installation

pip install light-splade

Quickstart

The following code uses bizreach-inc/light-splade-japanese-28M, an open SPLADE model for Japanese.

  • Convert text to sparse vector with SPLADE model using this package
import torch
from light_splade import SpladeEncoder

# Initialize the encoder
encoder = SpladeEncoder(model_path="bizreach-inc/light-splade-japanese-28M")

# Tokenize input text
corpus = [
    "日本の首都は東京です。",
    "大阪万博は2025年に開催されます。"
]

# Generate sparse representation
with torch.inference_mode():
    embeddings = encoder.encode(corpus)
    sparse_vecs = encoder.to_sparse(embeddings)

print(sparse_vecs[0])
print(sparse_vecs[1])
  • Convert text to sparse vector with SPLADE model using transformers package

Install required packages

pip install fugashi torch transformers unidic-lite

Then execute the following Python code

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM


def dense_to_sparse(dense: torch.tensor, idx2token: dict[int, str]) -> list[dict[str, float]]:
    rows, cols = dense.nonzero(as_tuple=True)
    rows = rows.tolist()
    cols = cols.tolist()
    weights = dense[rows, cols].tolist()

    sparse_vecs = [{} for _ in range(dense.size(0))]
    for row, col, weight in zip(rows, cols, weights):
        sparse_vecs[row][idx2token[col]] = round(weight, 2)

    for i in range(len(sparse_vecs)):
        sparse_vecs[i] = dict(sorted(sparse_vecs[i].items(), key=lambda x: x[1], reverse=True))
    return sparse_vecs


MODEL_PATH = "bizreach-inc/light-splade-japanese-28M"
device = "cuda" if torch.cuda.is_available() else "cpu"
transformer = AutoModelForMaskedLM.from_pretrained(MODEL_PATH).to(device)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
idx2token = {idx: token for token, idx in tokenizer.get_vocab().items()}

corpus = [
    "日本の首都は東京です。",
    "大阪万博は2025年に開催されます。"
]
token_outputs = tokenizer(corpus, padding=True, return_tensors="pt")
attention_mask = token_outputs["attention_mask"].to(device)
token_outputs = {key: value.to(device) for key, value in token_outputs.items()}

with torch.inference_mode():
    outputs = transformer(**token_outputs)
    dense, _ = torch.max(
        torch.log(1 + torch.relu(outputs.logits)) * attention_mask.unsqueeze(-1),
        dim=1,
    )
sparse_vecs = dense_to_sparse(dense, idx2token)

print(sparse_vecs[0])
print(sparse_vecs[1])
  • Output
{'首都': 1.83, '日本': 1.82, '東京': 1.78, '中立': 0.73, '都会': 0.69, '駒': 0.68, '州都': 0.67, '首相': 0.64, '足立': 0.62, 'です': 0.61, '都市': 0.54, 'ユニ': 0.54, '京都': 0.52, '国': 0.51, '発表': 0.49, '成田': 0.48, '太陽': 0.45, '藤原': 0.45, '私立': 0.42, '王国': 0.4...}
{'202': 1.61, '開催': 1.49, '大阪': 1.34, '万博': 1.19, '東京': 1.15, '年': 1.1, 'いつ': 1.05, '##5': 1.03, '203': 0.86, '月': 0.8, '期間': 0.79, '高槻': 0.79, '京都': 0.7, '神戸': 0.62, '2024': 0.54, '夢': 0.52, '206': 0.52, '姫路': 0.51, '行わ': 0.49, 'こう': 0.49, '芸術': 0.48...}

Setup for fine-tuning a SPLADE model

  • Python 3.11+.
  • Recommended: use the uv tool to manage the virtual environment (see Getting started document).

Quick setup (recommended):

git clone https://github.com/bizreach-inc/light-splade.git
cd light-splade
# create and activate virtual env using uv
uv venv --seed .venv
source .venv/bin/activate
uv sync

For developer checks, run:

uv run pre-commit run --all-files
uv run pytest

Train SPLADE with toy dataset (triplet-based)

  • uv run examples/run_train_splade_triplet.py --config-name toy_splade_ja
  • To run on an environment without GPU, see this trouble shooting

For full run instructions using uv and Docker commands, see Getting started.

Input Data format

Detailed data format docs:

References

License

This project is licensed under the Apache License, Version 2.0 — see the LICENSE file for details.

Copyright 2025 BizReach, Inc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

light_splade-0.1.2.tar.gz (222.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

light_splade-0.1.2-py3-none-any.whl (79.7 kB view details)

Uploaded Python 3

File details

Details for the file light_splade-0.1.2.tar.gz.

File metadata

  • Download URL: light_splade-0.1.2.tar.gz
  • Upload date:
  • Size: 222.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.13 {"installer":{"name":"uv","version":"0.9.13"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for light_splade-0.1.2.tar.gz
Algorithm Hash digest
SHA256 06ca280abe107bae3049b8d6e970a3cf8ab1f7d862a6a8ed53551442ea21e255
MD5 56eba41c0bd5f72ffe591b334c0dd436
BLAKE2b-256 1c8733bd84dc1dcc9426efb0f41f66e097a65f6a3ca27823437c2892888c656e

See more details on using hashes here.

File details

Details for the file light_splade-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: light_splade-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 79.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.13 {"installer":{"name":"uv","version":"0.9.13"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for light_splade-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0f420be7f1da32e351e05454464d0e99c348917a8caa37b71ecb38d60d010b8a
MD5 e093cec9620ecef7e46da1075c0a3232
BLAKE2b-256 dcb72f42aa12ac0fbc60e89729eb4e3365642c3447e7d7a503cd29a4590b77a1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page