No project description provided
Project description
Impact Index for Information Retrieval
This package is a library that implements efficient algorithms for sparse representations from neural information retrieval systems. Contrarily to other libraries, this one specifically targets neural IR models, and does not suppose any quantization. It also does not implement "standard" IR algorithms based on term frequencies.
Installation
pip install maturin
maturin develop --release
Python API
The library exposes the impact_index Python module for building and searching sparse indices.
Basic Usage
import numpy as np
from impact_index import IndexBuilder, Index
# Build an index
builder = IndexBuilder("/path/to/index")
# Add documents: docid, term_indices (numpy array), impact_values (numpy array)
builder.add(0, np.array([1, 5, 10]), np.array([0.5, 1.2, 0.8]))
builder.add(1, np.array([2, 5, 8]), np.array([0.3, 0.9, 1.1]))
index = builder.build(in_memory=True)
# Search
query = {5: 1.0, 10: 0.5} # {term_index: weight}
results = index.search_wand(query, top_k=10)
for doc in results:
print(f"Document {doc.docid}: {doc.score}")
Classes
BuilderOptions
Options for index construction.
from impact_index import BuilderOptions
options = BuilderOptions()
options.checkpoint_frequency = 100000 # Checkpoint every N documents
options.in_memory_threshold = 1000000 # In-memory threshold
IndexBuilder
Builds a sparse index from documents.
IndexBuilder(folder: str, options: BuilderOptions = None)
Methods:
add(docid: int, terms: np.ndarray[int], values: np.ndarray[float])- Add a document with term indices and impact valuesget_checkpoint_doc_id() -> int | None- Get the last checkpointed document ID (useful for resuming indexing)build(in_memory: bool) -> Index- Finalize and return the index
Index
A sparse index supporting efficient top-k retrieval.
Index.load(folder: str, in_memory: bool) -> Index
Search Methods:
search_wand(query: dict, top_k: int) -> list[ScoredDocument]- WAND algorithmsearch_maxscore(query: dict, top_k: int) -> list[ScoredDocument]- MaxScore algorithmaio_search_wand(query: dict, top_k: int) -> Coroutine- Async WAND searchaio_search_maxscore(query: dict, top_k: int) -> Coroutine- Async MaxScore search
Other Methods:
postings(term: int) -> SparseIndexIterator- Get posting list iterator for a termnum_postings() -> int- Total number of posting liststo_bmp(output: str, bsize: int, compress_range: bool)- Convert to BMP formatto_bmp_streaming(output: str, bsize: int, compress_range: bool)- Memory-efficient BMP conversion
SparseIndexIterator
Iterator over a term's posting list.
iterator = index.postings(term_id)
print(f"Length: {iterator.length()}")
print(f"Max impact: {iterator.max_value()}")
print(f"Max doc ID: {iterator.max_doc_id()}")
for posting in iterator:
print(f"Doc {posting.docid}: {posting.value}")
ScoredDocument
Search result with document ID and score.
docid: int- Document identifierscore: float- Retrieval score
Compression and Transforms
Apply compression to reduce index size:
from impact_index import (
Index, CompressionTransform,
EliasFanoCompressor, ImpactQuantizer, GlobalImpactQuantizer
)
index = Index.load("/path/to/raw_index", in_memory=True)
# Create compressors
docid_compressor = EliasFanoCompressor()
impact_compressor = ImpactQuantizer(nbits=8, min=0.0, max=10.0)
# Or use global quantization:
# impact_compressor = GlobalImpactQuantizer(nbits=8)
# Apply compression
transform = CompressionTransform(
max_block_size=128,
doc_ids_compressor=docid_compressor,
impacts_compressor=impact_compressor
)
transform.process("/path/to/compressed_index", index)
SplitIndexTransform
Split index by impact quantiles for tiered retrieval:
from impact_index import SplitIndexTransform, CompressionTransform
base_transform = CompressionTransform(128, docid_compressor, impact_compressor)
split_transform = SplitIndexTransform(
quantiles=[0.5, 0.9], # Split at 50th and 90th percentile
sink=base_transform
)
split_transform.process("/path/to/split_index", index)
BMP (Block-Max Pruning) Search
Fast approximate search using the BMP algorithm from the BMP repository, which implements "Faster Learned Sparse Retrieval with Block-Max Pruning" (SIGIR 2024).
from impact_index import Index, BmpSearcher
# Convert an existing index to BMP format
index = Index.load("/path/to/index", in_memory=True)
index.to_bmp_streaming("/path/to/bmp_index.bin", bsize=64, compress_range=True)
# Load and search with BMP
searcher = BmpSearcher("/path/to/bmp_index.bin")
print(f"Documents: {searcher.num_documents()}")
# Query uses string term IDs
query = {"term1": 1.0, "term2": 0.5}
doc_ids, scores = searcher.search(query, k=10, alpha=1.0, beta=1.0)
for docid, score in zip(doc_ids, scores):
print(f"{docid}: {score}")
BMP Conversion Methods:
to_bmp_streaming(output, bsize, compress_range)- Recommended. Memory-efficient streaming conversion using O(num_terms × num_blocks) memoryto_bmp(output, bsize, compress_range)- Legacy method using O(total_postings) memory
BMP Search Parameters:
k- Number of results to returnalpha- Controls early termination aggressiveness (default: 1.0)beta- Controls block skipping (default: 1.0)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file impact_index-0.30.0.tar.gz.
File metadata
- Download URL: impact_index-0.30.0.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb945fea365e82d8952a39ab36e74fc7ceb34fbd68e8813f69c04b89a8488850
|
|
| MD5 |
95682d130329468e15ebee7ec8a94cf9
|
|
| BLAKE2b-256 |
ef28cdcccfa2a071c9e52ffb7edf421a5aae7eaa43604d4d76e0bc907f61530b
|
Provenance
The following attestation bundles were made for impact_index-0.30.0.tar.gz:
Publisher:
release.yml on experimaestro/impact-index
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
impact_index-0.30.0.tar.gz -
Subject digest:
cb945fea365e82d8952a39ab36e74fc7ceb34fbd68e8813f69c04b89a8488850 - Sigstore transparency entry: 801455442
- Sigstore integration time:
-
Permalink:
experimaestro/impact-index@340f128c05fa1ab46bc266cd6d340ac1264f397b -
Branch / Tag:
refs/tags/0.30.0 - Owner: https://github.com/experimaestro
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@340f128c05fa1ab46bc266cd6d340ac1264f397b -
Trigger Event:
release
-
Statement type:
File details
Details for the file impact_index-0.30.0-cp38-abi3-win_amd64.whl.
File metadata
- Download URL: impact_index-0.30.0-cp38-abi3-win_amd64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.8+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
452eb7b82e7ff05b20b577f938c37105523fe4a69dfd388aee071f5d4b6b3868
|
|
| MD5 |
1e495f4c0e46d5989151ce446eaf8041
|
|
| BLAKE2b-256 |
fa1a3cff5481a9fc2c4430ead5cd897dc72619d406e2d40a455280ae2442d587
|
Provenance
The following attestation bundles were made for impact_index-0.30.0-cp38-abi3-win_amd64.whl:
Publisher:
release.yml on experimaestro/impact-index
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
impact_index-0.30.0-cp38-abi3-win_amd64.whl -
Subject digest:
452eb7b82e7ff05b20b577f938c37105523fe4a69dfd388aee071f5d4b6b3868 - Sigstore transparency entry: 801455641
- Sigstore integration time:
-
Permalink:
experimaestro/impact-index@340f128c05fa1ab46bc266cd6d340ac1264f397b -
Branch / Tag:
refs/tags/0.30.0 - Owner: https://github.com/experimaestro
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@340f128c05fa1ab46bc266cd6d340ac1264f397b -
Trigger Event:
release
-
Statement type:
File details
Details for the file impact_index-0.30.0-cp38-abi3-manylinux_2_24_x86_64.whl.
File metadata
- Download URL: impact_index-0.30.0-cp38-abi3-manylinux_2_24_x86_64.whl
- Upload date:
- Size: 2.0 MB
- Tags: CPython 3.8+, manylinux: glibc 2.24+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd705fc4fc264125e7a96541e972ba31e5d447d11cdfe62e7614c3cd6b34685e
|
|
| MD5 |
9c189a527435b2eb72a2679d5f6a7218
|
|
| BLAKE2b-256 |
f011eb6174533fc74fb01882e3896faa6d7da3647ef518dc6144de2fa7c1d985
|
Provenance
The following attestation bundles were made for impact_index-0.30.0-cp38-abi3-manylinux_2_24_x86_64.whl:
Publisher:
release.yml on experimaestro/impact-index
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
impact_index-0.30.0-cp38-abi3-manylinux_2_24_x86_64.whl -
Subject digest:
cd705fc4fc264125e7a96541e972ba31e5d447d11cdfe62e7614c3cd6b34685e - Sigstore transparency entry: 801455493
- Sigstore integration time:
-
Permalink:
experimaestro/impact-index@340f128c05fa1ab46bc266cd6d340ac1264f397b -
Branch / Tag:
refs/tags/0.30.0 - Owner: https://github.com/experimaestro
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@340f128c05fa1ab46bc266cd6d340ac1264f397b -
Trigger Event:
release
-
Statement type:
File details
Details for the file impact_index-0.30.0-cp38-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: impact_index-0.30.0-cp38-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.6 MB
- Tags: CPython 3.8+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
621797dab633b03b70dfc6f58063399eb370e7fb623befe1f4e771ec5336145d
|
|
| MD5 |
38278b0440d0f13c513a98b2c4e1155f
|
|
| BLAKE2b-256 |
f4736131d206ba2ef6ba845ea1a022a3c7dcff13e8bb91b682505d83f95f8549
|
Provenance
The following attestation bundles were made for impact_index-0.30.0-cp38-abi3-macosx_11_0_arm64.whl:
Publisher:
release.yml on experimaestro/impact-index
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
impact_index-0.30.0-cp38-abi3-macosx_11_0_arm64.whl -
Subject digest:
621797dab633b03b70dfc6f58063399eb370e7fb623befe1f4e771ec5336145d - Sigstore transparency entry: 801455601
- Sigstore integration time:
-
Permalink:
experimaestro/impact-index@340f128c05fa1ab46bc266cd6d340ac1264f397b -
Branch / Tag:
refs/tags/0.30.0 - Owner: https://github.com/experimaestro
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@340f128c05fa1ab46bc266cd6d340ac1264f397b -
Trigger Event:
release
-
Statement type:
File details
Details for the file impact_index-0.30.0-cp38-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: impact_index-0.30.0-cp38-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 1.8 MB
- Tags: CPython 3.8+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a969764cbbf9dd99740cffef63e9fdc4a0974981c1c0f2a5cdaa1f12377759f2
|
|
| MD5 |
956e30daec1d2ef22a10c87822f90374
|
|
| BLAKE2b-256 |
6de2e5c747df65d5b3a2b9ef9ef6c9dc0d29f935fe5fe99bf36b40ac5cc94fd2
|
Provenance
The following attestation bundles were made for impact_index-0.30.0-cp38-abi3-macosx_10_12_x86_64.whl:
Publisher:
release.yml on experimaestro/impact-index
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
impact_index-0.30.0-cp38-abi3-macosx_10_12_x86_64.whl -
Subject digest:
a969764cbbf9dd99740cffef63e9fdc4a0974981c1c0f2a5cdaa1f12377759f2 - Sigstore transparency entry: 801455554
- Sigstore integration time:
-
Permalink:
experimaestro/impact-index@340f128c05fa1ab46bc266cd6d340ac1264f397b -
Branch / Tag:
refs/tags/0.30.0 - Owner: https://github.com/experimaestro
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@340f128c05fa1ab46bc266cd6d340ac1264f397b -
Trigger Event:
release
-
Statement type: