Skip to main content

Seismic: A high-performance data structure for fast retrieval over learned sparse representations.

Project description

Seismic

Seismic is a fast and lightweight search engine for learned sparse embeddings, written in Rust with Python bindings. It indexes sparse vector collections and retrieves results in microseconds with near-exact accuracy.

Requirements

  • Python >= 3.8
  • Rust toolchain (only needed if installing from source for hardware-specific optimizations)

Installation

The easiest way to use Seismic is via its Python API, which can be installed in two different ways:

  1. the easiest way is via pip as follows:
pip install pyseismic-lsr
  1. via Rust compilation that allows deeper hardware optimizations as follows (requires a working Rust toolchain, installable via rustup):
RUSTFLAGS="-C target-cpu=native" pip install --no-binary :all: pyseismic-lsr

Check docs/PythonUsage.md for more details.

Quick Start

Given a collection as a jsonl file, you can quickly index it by running

from seismic import SeismicIndex

json_input_file = "" # Your data collection

index = SeismicIndex.build(json_input_file)
print("Number of documents:", index.len)
print("Avg number of non-zero components:", index.nnz / index.len)
print("Dimensionality of the vectors:", index.dim)

index.print_space_usage_byte()

and then exploit Seismic to retrieve your set of queries quickly

import numpy as np

MAX_TOKEN_LEN = 30

string_type  = f'U{MAX_TOKEN_LEN}'

query = {"a": 3.5, "certain": 3.5, "query": 0.4}
query_id = "0"
query_components = np.array(list(query.keys()), dtype=string_type)
query_values = np.array(list(query.values()), dtype=np.float32)

results = index.search(
    query_id=query_id,
    query_components=query_components,
    query_values=query_values,
    k=10,
    query_cut=3,
    heap_factor=0.8,
)

Each document in the jsonl file should be a JSON object with an id (integer), an optional content (string), and a vector (dictionary mapping tokens to scores, e.g., {"dog": 2.45}). See docs/RunExperiments.md for full format details.

Features

  • Multiple index variants — Standard (SeismicIndex), compressed (SeismicIndexDotVByte), and large vocabulary (SeismicIndexLV) for collections with >65K unique tokens
  • RAG-ready — Build the index with load_content=True and retrieve document texts alongside scores (example)
  • Python & Rust APIs — Use from Python via pyseismic-lsr or integrate directly in Rust via cargo add seismic (docs)
  • Parallel batch search — Multi-threaded query processing via batch_search

Examples

Interactive Jupyter notebooks are available in the examples/ folder:

Best Results

Seismic is an approximate algorithm designed for high-performance retrieval over learned sparse representations. We provide pre-optimized configurations for several common datasets, e.g., MsMarco. Check the detailed documentation in docs/BestResults.md and the available optimized configurations in experiments/best_configs.

Resources

Check out our docs folder for detailed guides:

Bibliography

Click to expand citations
  1. Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini. "Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations." Proc. ACM SIGIR. 2024.
  2. Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini. "Pairing Clustered Inverted Indexes with κ-NN Graphs for Fast Approximate Retrieval over Learned Sparse Representations." Proc. ACM CIKM. 2024.
  3. Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini, and Leonardo Venuta. "Investigating the Scalability of Approximate Sparse Retrieval Algorithms to Massive Datasets." Proc. ECIR. 2025.
  4. Bruch, Sebastian and Fontana, Martino and Nardini, Franco Maria and Rulli, Cosimo and Venturini, Rossano. "Forward Index Compression for Learned Sparse Retrieval", ECIR 2025 (to appear)

SIGIR 2024

@inproceedings{bruch2024seismic,
  author    = {Bruch, Sebastian and Nardini, Franco Maria and Rulli, Cosimo and Venturini, Rossano},
  title     = {Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations},
  booktitle = {Proceedings of the 47th International {ACM} {SIGIR} {C}onference on Research and Development in Information Retrieval ({SIGIR})},
  pages     = {152--162},
  publisher = {{ACM}},
  year      = {2024},
  url       = {https://doi.org/10.1145/3626772.3657769},
  doi       = {10.1145/3626772.3657769}
}

CIKM 2024

@inproceedings{bruch2024pairing,
  author    = {Bruch, Sebastian and Nardini, Franco Maria and Rulli, Cosimo and Venturini, Rossano},
  title     = {Pairing Clustered Inverted Indexes with $\kappa$-NN Graphs for Fast Approximate Retrieval over Learned Sparse Representations},
  booktitle = {Proceedings of the 33rd International {ACM} {C}onference on {I}nformation and {K}nowledge {M}anagement ({CIKM})},
  pages     = {3642--3646},
  publisher = {{ACM}},
  year      = {2024},
  url       = {https://doi.org/10.1145/3627673.3679977},
  doi       = {10.1145/3627673.3679977}
}

ECIR 2025

@inproceedings{bruch2025investigating,
  author    = {Bruch, Sebastian and Nardini, Franco Maria and Rulli, Cosimo and Venturini, Rossano and Venuta, Leonardo},
  title     = {Investigating the Scalability of Approximate Sparse Retrieval Algorithms to Massive Datasets},
  booktitle = {Advances in Information Retrieval},
  pages     = {437--445},
  publisher = {Springer Nature Switzerland},
  year      = {2025},
  url       = {https://doi.org/10.1007/978-3-031-88714-7_43},
  doi       = {10.1007/978-3-031-88714-7_43}
}

ECIR 2026 (Accepted, to appear)

@article{bruch2026forward,
  title={Forward Index Compression for Learned Sparse Retrieval},
  author={Bruch, Sebastian and Fontana, Martino and Nardini, Franco Maria and Rulli, Cosimo and Venturini, Rossano},
  journal={European Conference on Information Retrieval 2026 (to appear)},
  year={2026}
}

Journal of ACM (Under Review)

@article{bruch2025efficient,
  title={Efficient Sketching and Nearest Neighbor Search Algorithms for Sparse Vector Sets},
  author={Bruch, Sebastian and Nardini, Franco Maria and Rulli, Cosimo and Venturini, Rossano},
  journal={arXiv preprint arXiv:2509.24815},
  year={2025}
}

Citation License

The source code in this repository is subject to the following citation license:

By downloading and using this software, you agree to cite the papers listed in the Bibliography section above in any kind of material you produce where it was used to conduct a search or experimentation, whether be it a research paper, dissertation, article, poster, presentation, or documentation. By using this software, you have agreed to the citation license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyseismic_lsr-0.5.0.tar.gz (369.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyseismic_lsr-0.5.0-cp37-cp37m-manylinux_2_34_x86_64.whl (902.3 kB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.34+ x86-64

File details

Details for the file pyseismic_lsr-0.5.0.tar.gz.

File metadata

  • Download URL: pyseismic_lsr-0.5.0.tar.gz
  • Upload date:
  • Size: 369.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.5.1

File hashes

Hashes for pyseismic_lsr-0.5.0.tar.gz
Algorithm Hash digest
SHA256 f652e19b2d2a7ff83bb96b849b6fb88ef688b151d9dd9433e13561588812b94d
MD5 0ae94eb603a03e6dc937a16e50e944d9
BLAKE2b-256 de3b5d5f7b470fb19ef3f49a29b003c0ccdb7e9f8869a2132fadca47b6a99128

See more details on using hashes here.

File details

Details for the file pyseismic_lsr-0.5.0-cp37-cp37m-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for pyseismic_lsr-0.5.0-cp37-cp37m-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 2b9dc59f304478ec48700e1be3466804b32aabed09f8d5a9e5f7c6ed26d7ea76
MD5 d1bdb743724a8767c6cbfeabc40c3f05
BLAKE2b-256 c6845fd0383e259dc4b9ea3763671c4b181fe36e2d505909ab06756271938fcc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page