Skip to main content

Seismic: A high-performance data structure for fast retrieval over learned sparse representations.

Project description

Seismic

Seismic is a highly efficient data structure for fast retrieval over learned sparse embeddings written in Rust 🦀. Designed with scalability and performance in mind, Seismic makes querying learned sparse representations seamless.

Details on how to use Seismic's core engine in Rust 🦀 can be found in docs/RustUsage.md.

The instructions below explain how to use it by using the Python API.

⚡ Installation

To install Seismic, run:

pip install pyseismic-lsr

Check out the detailed installation guide in docs/Installation.md for performance optimizations.

🚀 Quick Start

Given a collection as a jsonl file, you can quickly index it by running

from seismic import SeismicIndex

json_input_file = "" # Your data collection

index = SeismicIndex.build(json_input_file)
print("Number of documents: ", index.len)
print("Avg number of non-zero components: ", index.nnz / index.len)
print("Dimensionality of the vectors: ", index.dim)

index.print_space_usage_byte()

and then exploit Seismic to retrieve your set of queries quickly

import numpy as np

MAX_TOKEN_LEN = 30

string_type  = f'U{MAX_TOKEN_LEN}'

query = {"a": 3.5, "certain": 3.5, "query": 0.4}
query_id = "0"
query_components = np.array(list(query.keys()), dtype=string_type)
query_values = np.array(list(query.values()), dtype=np.float32)

results = index.search(
    query_id=query_id,
    query_components=query_components,
    query_values=query_values,
    k=10, 
    query_cut=3, 
    heap_factor=0.8,
)

📥 Download the Datasets

The embeddings in jsonl format for several encoders and several datasets can be downloaded from this HuggingFace repository, together with the queries representations.

As an example, the Splade embeddings for MSMARCO can be downloaded and extracted by running the following commands.

wget https://huggingface.co/datasets/tuskanny/seismic-msmarco-splade/resolve/main/documents.tar.gz?download=true -O documents.tar.gz 

tar -xvzf documents.tar.gz

or by using the Huggingface dataset download tool.

📄 Data Format

Documents and queries should have the following format. Each line should be a JSON-formatted string with the following fields:

  • id: must represent the ID of the document as an integer.
  • content: the original content of the document, as a string. This field is optional.
  • vector: a dictionary where each key represents a token, and its corresponding value is the score, e.g., {"dog": 2.45}.

This is the standard output format of several libraries to train sparse models, such as learned-sparse-retrieval.

The script convert_json_to_inner_format.py allows converting files formatted accordingly into the seismic inner format.

python scripts/convert_json_to_inner_format.py --document-path /path/to/document.jsonl --queries-path /path/to/queries.jsonl --output-dir /path/to/output 

This will generate a data directory at the /path/to/output path, with documents.bin and queries.bin binary files inside.

If you download the NQ dataset from the HuggingFace repo, you need to specify --input-format nq as it uses a slightly different format.

Resources

Check out our docs folder for more detailed guide on use to use Seismic directly in Rust, replicate the results of our paper, or use Seismic with your custom collection.

📚 Bibliography

  1. Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini. "Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations." Proc. ACM SIGIR. 2024.
  2. Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini. "Pairing Clustered Inverted Indexes with κ-NN Graphs for Fast Approximate Retrieval over Learned Sparse Representations." Proc. ACM CIKM. 2024.
  3. Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini, and Leonardo Venuta. "Investigating the Scalability of Approximate Sparse Retrieval Algorithms to Massive Datasets." Proc. ECIR. 2025. To Appear.

Citation License

The source code in this repository is subject to the following citation license:

By downloading and using this software, you agree to cite the under-noted paper in any kind of material you produce where it was used to conduct a search or experimentation, whether be it a research paper, dissertation, article, poster, presentation, or documentation. By using this software, you have agreed to the citation license.

SIGIR 2024

@inproceedings{bruch2024seismic,
  author    = {Bruch, Sebastian and Nardini, Franco Maria and Rulli, Cosimo and Venturini, Rossano},
  title     = {Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations},
  booktitle = {Proceedings of the 47th International {ACM} {SIGIR} {C}onference on Research and Development in Information Retrieval ({SIGIR})},
  pages     = {152--162},
  publisher = {{ACM}},
  year      = {2024},
  url       = {https://doi.org/10.1145/3626772.3657769},
  doi       = {10.1145/3626772.3657769}
}

CIKM 2024

@inproceedings{bruch2024pairing,
  author    = {Bruch, Sebastian and Nardini, Franco Maria and Rulli, Cosimo and Venturini, Rossano and Venuta, Leonardo},
  title     = {Pairing Clustered Inverted Indexes with $\kappa$-NN Graphs for Fast Approximate Retrieval over Learned Sparse Representations},
  booktitle = {Proceedings of the 33rd International {ACM} {C}onference on {I}nformation and {K}nowledge {M}anagement ({CIKM})},
  pages     = {3642--3646},
  publisher = {{ACM}},
  year      = {2024},
  url       = {https://doi.org/10.1145/3627673.3679977},
  doi       = {10.1145/3627673.3679977}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyseismic_lsr-0.1.3.tar.gz (971.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyseismic_lsr-0.1.3-cp310-cp310-macosx_10_12_x86_64.whl (611.3 kB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

File details

Details for the file pyseismic_lsr-0.1.3.tar.gz.

File metadata

  • Download URL: pyseismic_lsr-0.1.3.tar.gz
  • Upload date:
  • Size: 971.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.2

File hashes

Hashes for pyseismic_lsr-0.1.3.tar.gz
Algorithm Hash digest
SHA256 ce9568a1a80f131f32793dafaab58bb41e9779a77e1a877091191737fdc53925
MD5 39a9fb015332e27c8c6476295d5bc888
BLAKE2b-256 48e7889577eb4e65659702ede5fc610a9e9b950918065ba4646338805efec2fd

See more details on using hashes here.

File details

Details for the file pyseismic_lsr-0.1.3-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for pyseismic_lsr-0.1.3-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e047b698fc1fd7098180ae4661bc919ddbc4581d7534b7da56373486a3085d67
MD5 a22778a79e62fe7fe0aa96bb929a776d
BLAKE2b-256 779fcc1a6178c03b969a2dd6c8d78d0a1c9c4a01395021c6478c1b3513989878

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page