Skip to main content

No project description provided

Project description

Impact Index for Information Retrieval

This package is a library that implements efficient algorithms for sparse representations from neural information retrieval systems. Contrarily to other libraries, this one specifically targets neural IR models, and does not suppose any quantization. It also does not implement "standard" IR algorithms based on term frequencies.

Installation

pip install maturin
maturin develop --release

Python API

The library exposes the impact_index Python module for building and searching sparse indices.

Basic Usage

import numpy as np
from impact_index import IndexBuilder, Index

# Build an index
builder = IndexBuilder("/path/to/index")
# Add documents: docid, term_indices (numpy array), impact_values (numpy array)
builder.add(0, np.array([1, 5, 10]), np.array([0.5, 1.2, 0.8]))
builder.add(1, np.array([2, 5, 8]), np.array([0.3, 0.9, 1.1]))
index = builder.build(in_memory=True)

# Search
query = {5: 1.0, 10: 0.5}  # {term_index: weight}
results = index.search_wand(query, top_k=10)
for doc in results:
    print(f"Document {doc.docid}: {doc.score}")

Classes

BuilderOptions

Options for index construction.

from impact_index import BuilderOptions

options = BuilderOptions()
options.checkpoint_frequency = 100000  # Checkpoint every N documents
options.in_memory_threshold = 1000000  # In-memory threshold

IndexBuilder

Builds a sparse index from documents.

IndexBuilder(folder: str, options: BuilderOptions = None)

Methods:

  • add(docid: int, terms: np.ndarray[int], values: np.ndarray[float]) - Add a document with term indices and impact values
  • get_checkpoint_doc_id() -> int | None - Get the last checkpointed document ID (useful for resuming indexing)
  • build(in_memory: bool) -> Index - Finalize and return the index

Index

A sparse index supporting efficient top-k retrieval.

Index.load(folder: str, in_memory: bool) -> Index

Search Methods:

  • search_wand(query: dict, top_k: int) -> list[ScoredDocument] - WAND algorithm
  • search_maxscore(query: dict, top_k: int) -> list[ScoredDocument] - MaxScore algorithm
  • aio_search_wand(query: dict, top_k: int) -> Coroutine - Async WAND search
  • aio_search_maxscore(query: dict, top_k: int) -> Coroutine - Async MaxScore search

Other Methods:

  • postings(term: int) -> SparseIndexIterator - Get posting list iterator for a term
  • num_postings() -> int - Total number of posting lists
  • to_bmp(output: str, bsize: int, compress_range: bool) - Convert to BMP format
  • to_bmp_streaming(output: str, bsize: int, compress_range: bool) - Memory-efficient BMP conversion

SparseIndexIterator

Iterator over a term's posting list.

iterator = index.postings(term_id)
print(f"Length: {iterator.length()}")
print(f"Max impact: {iterator.max_value()}")
print(f"Max doc ID: {iterator.max_doc_id()}")

for posting in iterator:
    print(f"Doc {posting.docid}: {posting.value}")

ScoredDocument

Search result with document ID and score.

  • docid: int - Document identifier
  • score: float - Retrieval score

Compression and Transforms

Apply compression to reduce index size:

from impact_index import (
    Index, CompressionTransform,
    EliasFanoCompressor, ImpactQuantizer, GlobalImpactQuantizer
)

index = Index.load("/path/to/raw_index", in_memory=True)

# Create compressors
docid_compressor = EliasFanoCompressor()
impact_compressor = ImpactQuantizer(nbits=8, min=0.0, max=10.0)
# Or use global quantization:
# impact_compressor = GlobalImpactQuantizer(nbits=8)

# Apply compression
transform = CompressionTransform(
    max_block_size=128,
    doc_ids_compressor=docid_compressor,
    impacts_compressor=impact_compressor
)
transform.process("/path/to/compressed_index", index)

SplitIndexTransform

Split index by impact quantiles for tiered retrieval:

from impact_index import SplitIndexTransform, CompressionTransform

base_transform = CompressionTransform(128, docid_compressor, impact_compressor)
split_transform = SplitIndexTransform(
    quantiles=[0.5, 0.9],  # Split at 50th and 90th percentile
    sink=base_transform
)
split_transform.process("/path/to/split_index", index)

BMP (Block-Max Pruning) Search

Fast approximate search using the BMP algorithm from the BMP repository, which implements "Faster Learned Sparse Retrieval with Block-Max Pruning" (SIGIR 2024).

from impact_index import Index, BmpSearcher

# Convert an existing index to BMP format
index = Index.load("/path/to/index", in_memory=True)
index.to_bmp_streaming("/path/to/bmp_index.bin", bsize=64, compress_range=True)

# Load and search with BMP
searcher = BmpSearcher("/path/to/bmp_index.bin")
print(f"Documents: {searcher.num_documents()}")

# Query uses string term IDs
query = {"term1": 1.0, "term2": 0.5}
doc_ids, scores = searcher.search(query, k=10, alpha=1.0, beta=1.0)
for docid, score in zip(doc_ids, scores):
    print(f"{docid}: {score}")

BMP Conversion Methods:

  • to_bmp_streaming(output, bsize, compress_range) - Recommended. Memory-efficient streaming conversion using O(num_terms × num_blocks) memory
  • to_bmp(output, bsize, compress_range) - Legacy method using O(total_postings) memory

BMP Search Parameters:

  • k - Number of results to return
  • alpha - Controls early termination aggressiveness (default: 1.0)
  • beta - Controls block skipping (default: 1.0)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

impact_index-0.30.0.tar.gz (1.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

impact_index-0.30.0-cp38-abi3-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.8+Windows x86-64

impact_index-0.30.0-cp38-abi3-manylinux_2_24_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.24+ x86-64

impact_index-0.30.0-cp38-abi3-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

impact_index-0.30.0-cp38-abi3-macosx_10_12_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file impact_index-0.30.0.tar.gz.

File metadata

  • Download URL: impact_index-0.30.0.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for impact_index-0.30.0.tar.gz
Algorithm Hash digest
SHA256 cb945fea365e82d8952a39ab36e74fc7ceb34fbd68e8813f69c04b89a8488850
MD5 95682d130329468e15ebee7ec8a94cf9
BLAKE2b-256 ef28cdcccfa2a071c9e52ffb7edf421a5aae7eaa43604d4d76e0bc907f61530b

See more details on using hashes here.

Provenance

The following attestation bundles were made for impact_index-0.30.0.tar.gz:

Publisher: release.yml on experimaestro/impact-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file impact_index-0.30.0-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for impact_index-0.30.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 452eb7b82e7ff05b20b577f938c37105523fe4a69dfd388aee071f5d4b6b3868
MD5 1e495f4c0e46d5989151ce446eaf8041
BLAKE2b-256 fa1a3cff5481a9fc2c4430ead5cd897dc72619d406e2d40a455280ae2442d587

See more details on using hashes here.

Provenance

The following attestation bundles were made for impact_index-0.30.0-cp38-abi3-win_amd64.whl:

Publisher: release.yml on experimaestro/impact-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file impact_index-0.30.0-cp38-abi3-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for impact_index-0.30.0-cp38-abi3-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 cd705fc4fc264125e7a96541e972ba31e5d447d11cdfe62e7614c3cd6b34685e
MD5 9c189a527435b2eb72a2679d5f6a7218
BLAKE2b-256 f011eb6174533fc74fb01882e3896faa6d7da3647ef518dc6144de2fa7c1d985

See more details on using hashes here.

Provenance

The following attestation bundles were made for impact_index-0.30.0-cp38-abi3-manylinux_2_24_x86_64.whl:

Publisher: release.yml on experimaestro/impact-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file impact_index-0.30.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for impact_index-0.30.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 621797dab633b03b70dfc6f58063399eb370e7fb623befe1f4e771ec5336145d
MD5 38278b0440d0f13c513a98b2c4e1155f
BLAKE2b-256 f4736131d206ba2ef6ba845ea1a022a3c7dcff13e8bb91b682505d83f95f8549

See more details on using hashes here.

Provenance

The following attestation bundles were made for impact_index-0.30.0-cp38-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on experimaestro/impact-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file impact_index-0.30.0-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for impact_index-0.30.0-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a969764cbbf9dd99740cffef63e9fdc4a0974981c1c0f2a5cdaa1f12377759f2
MD5 956e30daec1d2ef22a10c87822f90374
BLAKE2b-256 6de2e5c747df65d5b3a2b9ef9ef6c9dc0d29f935fe5fe99bf36b40ac5cc94fd2

See more details on using hashes here.

Provenance

The following attestation bundles were made for impact_index-0.30.0-cp38-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on experimaestro/impact-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page