toolkit for creating and searching sparse representations

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

andrewyates

These details have not been verified by PyPI

Project description

bsparse

bsparse is a toolkit for creating, indexing, and searching learned sparse representations

Usage examples

# Recommended way to install requirements:
# (using pip only works too, but uv is much faster)
pipx install uv
# Create virtual environment
uv venv venv
# Activate
source venv/bin/activate
# Install requirements
uv pip install -r requirements.txt

# Request access to splade-v3: https://huggingface.co/naver/splade-v3
# Get your huggingface API token and then:
export HF_TOKEN="the token"

# load Python virtual environment
source venv/bin/activate

# optional: spot check output from a model
python -m bsparse.cli check --text "tesla net worth"

# create query representations:
python -m bsparse.cli encode --out nfcorpus-queries.jsonl \
  --dataset irds --type query --name beir/nfcorpus  --batch-size 64

# create doc representations:
python -m bsparse.cli encode --out nfcorpus-docs.jsonl \
  --dataset irds --type doc --name beir/nfcorpus  --batch-size 64

# search and evaluate without building an index:
python -m bsparse.cli memsearch --out nfcorpus.run --docs nfcorpus-docs.jsonl --queries nfcorpus-queries.jsonl --qrels beir/nfcorpus/test


# alternatively, you can build an index and search it

# 1) setup: compile ScaledJsonVectorCollection.java and add it to anserini-1.0.0-fatjar.jar
$ wget -c https://repo1.maven.org/maven2/io/anserini/anserini/1.0.0/anserini-1.0.0-fatjar.jar
$ cd java
$ javac -cp ../anserini-1.0.0-fatjar.jar io/anserini/collection/*.java
$ cp ../anserini-1.0.0-fatjar.jar ../anserini-1.0.0-fatjar-bsparse.jar
$ jar uf ../anserini-1.0.0-fatjar-bsparse.jar io/anserini/collection/*.class

# 2) build index
java -cp anserini-1.0.0-fatjar-AY.jar  io.anserini.index.IndexCollection \
  -generator DefaultLuceneDocumentGenerator -impact -pretokenized \
  -threads 16 -collection ScaledJsonVectorCollection \
  -input /path/to/encoded-text -index /path/to/encoded-text-index

# 3) search index
# Create sparse query representations in `$QUERY_VECTORS` and create an index in `$INDEX`, then:
python -m bsparse.cli search --index $INDEX --queries $QUERY_VECTORS --out test.run --topk 1000

Seismic backend

Seismic is an alternative backend that indexes learned sparse representations natively in Python (no Java/JAR required). The encoded JSONL files produced by encode are already in the format Seismic expects, so the same doc/query files work for both backends.

# install the Seismic Python bindings (optional dependency; only needed for this backend)
uv pip install pyseismic-lsr
# for best performance, build against your CPU instead:
# RUSTFLAGS="-C target-cpu=native" uv pip install --no-binary :all: pyseismic-lsr

# 1) build a Seismic index from encoded docs
python -m bsparse.cli index --backend seismic --input nfcorpus-docs.jsonl --index $INDEX
# --input accepts multiple files, gzipped (.gz) input, and directories of .jsonl/.jsonl.gz files;
# if the in-memory API gives you trouble, --build-method file falls back to concatenating
# the inputs into a temporary uncompressed JSONL file and using Seismic's file-based build
#
# note: seismic appends ".index.seismic" to the path, so the on-disk file is $INDEX.index.seismic;
# search --index accepts either the build-time path or the full on-disk filename
#
# indexing hyperparameters are flags with defaults, e.g.:
#   --n-postings 3000 --centroid-fraction 0.2 --summary-energy 0.5 --max-fraction 6 --min-cluster-size 2 --nknn 0
#
# use --variant large_vocab for collections with more than 65k unique tokens

# 2) search the index and evaluate
python -m bsparse.cli search --backend seismic --index $INDEX \
  --queries nfcorpus-queries.jsonl --out test.run --topk 1000 \
  --query-cut 10 --heap-factor 0.8 --qrels beir/nfcorpus/test

# query-time thread count is index-independent and set via the environment:
#   SEISMIC_THREADS=16 python -m bsparse.cli search --backend seismic ...

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

andrewyates

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Jun 9, 2026

0.1.0

Oct 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bsparse-0.2.0.tar.gz (24.6 kB view details)

Uploaded Jun 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bsparse-0.2.0-py3-none-any.whl (21.2 kB view details)

Uploaded Jun 9, 2026 Python 3

File details

Details for the file bsparse-0.2.0.tar.gz.

File metadata

Download URL: bsparse-0.2.0.tar.gz
Upload date: Jun 9, 2026
Size: 24.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bsparse-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`3d2f56b750f8562c64400ba5719bce3dc4adfca362137d343154ed2e06714da6`
MD5	`bf215e7be8257f3475f372e0c36cf770`
BLAKE2b-256	`50ce8e94d3b481c69846fe416f3b6af8770f0581b85ad2764a34cb152f973273`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bsparse-0.2.0.tar.gz:

Publisher: publish-release.yml on hltcoe/bsparse

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bsparse-0.2.0.tar.gz
- Subject digest: 3d2f56b750f8562c64400ba5719bce3dc4adfca362137d343154ed2e06714da6
- Sigstore transparency entry: 1770839062
- Sigstore integration time: Jun 9, 2026
Source repository:
- Permalink: hltcoe/bsparse@6cccd3baadfea5a25ca68a828af0da748c5f0d42
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/hltcoe
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-release.yml@6cccd3baadfea5a25ca68a828af0da748c5f0d42
- Trigger Event: push

File details

Details for the file bsparse-0.2.0-py3-none-any.whl.

File metadata

Download URL: bsparse-0.2.0-py3-none-any.whl
Upload date: Jun 9, 2026
Size: 21.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bsparse-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5a7a4c60839ecc86f05618fa9c85e6778a5f609a875f154b4dd8b0cbdee3ae20`
MD5	`e2cb464932f7c617d1dd8c43d8b223ab`
BLAKE2b-256	`42af7fb00905ff63695f22fcaf84e0e4258fc8641353a7e4e2f6ceab3dcd5ba4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bsparse-0.2.0-py3-none-any.whl:

Publisher: publish-release.yml on hltcoe/bsparse

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bsparse-0.2.0-py3-none-any.whl
- Subject digest: 5a7a4c60839ecc86f05618fa9c85e6778a5f609a875f154b4dd8b0cbdee3ae20
- Sigstore transparency entry: 1770839355
- Sigstore integration time: Jun 9, 2026
Source repository:
- Permalink: hltcoe/bsparse@6cccd3baadfea5a25ca68a828af0da748c5f0d42
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/hltcoe
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-release.yml@6cccd3baadfea5a25ca68a828af0da748c5f0d42
- Trigger Event: push

bsparse 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

bsparse

Usage examples

Seismic backend

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance