Skip to main content

No project description provided

Project description

Impact Index for Information Retrieval

This package is a library that implements efficient algorithms for sparse representations from neural information retrieval systems. Contrarily to other libraries, this one specifically targets neural IR models, and does not suppose any quantization. It also does not implement "standard" IR algorithms based on term frequencies.

Installation

pip install maturin
maturin develop --release

Python API

The library exposes the impact_index Python module for building and searching sparse indices.

Basic Usage

import numpy as np
from impact_index import IndexBuilder, Index

# Build an index
builder = IndexBuilder("/path/to/index")
# Add documents: docid, term_indices (numpy array), impact_values (numpy array)
builder.add(0, np.array([1, 5, 10]), np.array([0.5, 1.2, 0.8]))
builder.add(1, np.array([2, 5, 8]), np.array([0.3, 0.9, 1.1]))
index = builder.build(in_memory=True)

# Search
query = {5: 1.0, 10: 0.5}  # {term_index: weight}
results = index.search_wand(query, top_k=10)
for doc in results:
    print(f"Document {doc.docid}: {doc.score}")

Classes

BuilderOptions

Options for index construction.

from impact_index import BuilderOptions

options = BuilderOptions()
options.checkpoint_frequency = 100000  # Checkpoint every N documents
options.in_memory_threshold = 1000000  # In-memory threshold

IndexBuilder

Builds a sparse index from documents.

IndexBuilder(folder: str, options: BuilderOptions = None)

Methods:

  • add(docid: int, terms: np.ndarray[int], values: np.ndarray[float]) - Add a document with term indices and impact values
  • get_checkpoint_doc_id() -> int | None - Get the last checkpointed document ID (useful for resuming indexing)
  • build(in_memory: bool) -> Index - Finalize and return the index

Index

A sparse index supporting efficient top-k retrieval.

Index.load(folder: str, in_memory: bool) -> Index

Search Methods:

  • search_wand(query: dict, top_k: int) -> list[ScoredDocument] - WAND algorithm
  • search_maxscore(query: dict, top_k: int) -> list[ScoredDocument] - MaxScore algorithm
  • aio_search_wand(query: dict, top_k: int) -> Coroutine - Async WAND search
  • aio_search_maxscore(query: dict, top_k: int) -> Coroutine - Async MaxScore search

Other Methods:

  • postings(term: int) -> SparseIndexIterator - Get posting list iterator for a term
  • num_postings() -> int - Total number of posting lists
  • to_bmp(output: str, bsize: int, compress_range: bool) - Convert to BMP format
  • to_bmp_streaming(output: str, bsize: int, compress_range: bool) - Memory-efficient BMP conversion

SparseIndexIterator

Iterator over a term's posting list.

iterator = index.postings(term_id)
print(f"Length: {iterator.length()}")
print(f"Max impact: {iterator.max_value()}")
print(f"Max doc ID: {iterator.max_doc_id()}")

for posting in iterator:
    print(f"Doc {posting.docid}: {posting.value}")

ScoredDocument

Search result with document ID and score.

  • docid: int - Document identifier
  • score: float - Retrieval score

Compression and Transforms

Apply compression to reduce index size:

from impact_index import (
    Index, CompressionTransform,
    EliasFanoCompressor, ImpactQuantizer, GlobalImpactQuantizer
)

index = Index.load("/path/to/raw_index", in_memory=True)

# Create compressors
docid_compressor = EliasFanoCompressor()
impact_compressor = ImpactQuantizer(nbits=8, min=0.0, max=10.0)
# Or use global quantization:
# impact_compressor = GlobalImpactQuantizer(nbits=8)

# Apply compression
transform = CompressionTransform(
    max_block_size=128,
    doc_ids_compressor=docid_compressor,
    impacts_compressor=impact_compressor
)
transform.process("/path/to/compressed_index", index)

SplitIndexTransform

Split index by impact quantiles for tiered retrieval:

from impact_index import SplitIndexTransform, CompressionTransform

base_transform = CompressionTransform(128, docid_compressor, impact_compressor)
split_transform = SplitIndexTransform(
    quantiles=[0.5, 0.9],  # Split at 50th and 90th percentile
    sink=base_transform
)
split_transform.process("/path/to/split_index", index)

BMP (Block-Max Pruning) Search

Fast approximate search using the BMP algorithm from the BMP repository, which implements "Faster Learned Sparse Retrieval with Block-Max Pruning" (SIGIR 2024).

from impact_index import Index, BmpSearcher

# Convert an existing index to BMP format
index = Index.load("/path/to/index", in_memory=True)
index.to_bmp_streaming("/path/to/bmp_index.bin", bsize=64, compress_range=True)

# Load and search with BMP
searcher = BmpSearcher("/path/to/bmp_index.bin")
print(f"Documents: {searcher.num_documents()}")

# Query uses string term IDs
query = {"term1": 1.0, "term2": 0.5}
doc_ids, scores = searcher.search(query, k=10, alpha=1.0, beta=1.0)
for docid, score in zip(doc_ids, scores):
    print(f"{docid}: {score}")

BMP Conversion Methods:

  • to_bmp_streaming(output, bsize, compress_range) - Recommended. Memory-efficient streaming conversion using O(num_terms × num_blocks) memory
  • to_bmp(output, bsize, compress_range) - Legacy method using O(total_postings) memory

BMP Search Parameters:

  • k - Number of results to return
  • alpha - Controls early termination aggressiveness (default: 1.0)
  • beta - Controls block skipping (default: 1.0)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

impact_index-0.31.0.tar.gz (1.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

impact_index-0.31.0-cp38-abi3-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.8+Windows x86-64

impact_index-0.31.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

impact_index-0.31.0-cp38-abi3-macosx_11_0_arm64.whl (1.8 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

impact_index-0.31.0-cp38-abi3-macosx_10_12_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file impact_index-0.31.0.tar.gz.

File metadata

  • Download URL: impact_index-0.31.0.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for impact_index-0.31.0.tar.gz
Algorithm Hash digest
SHA256 4ad7ce2d713b9c64348e65f454b71c1bbafd0945637416d9f00b41f51c472199
MD5 dcc5ce4491c9ed0f06d410a147a93aad
BLAKE2b-256 d0fdb98afd00c5b7fd24b602768b492bc8a9571d4846fa7c17ec09b4f7698c2d

See more details on using hashes here.

Provenance

The following attestation bundles were made for impact_index-0.31.0.tar.gz:

Publisher: release.yml on experimaestro/impact-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file impact_index-0.31.0-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for impact_index-0.31.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c80f9130b22052897ff7900f72d17415f7768962ad772db90f9f402452910e0b
MD5 dff300ef4219bc09b9ced46d4a41417e
BLAKE2b-256 434cd23645f86f7dac81b3de41a1346b06b4fde6c07e32d0915ec4dd0162a5a4

See more details on using hashes here.

Provenance

The following attestation bundles were made for impact_index-0.31.0-cp38-abi3-win_amd64.whl:

Publisher: release.yml on experimaestro/impact-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file impact_index-0.31.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for impact_index-0.31.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4a0b71995074fc705d203f4c57824566dbe3e662d18bd49c6e5df7374c619f76
MD5 363e23097d8e20d48a6b6e98283bbc48
BLAKE2b-256 f42d19ef227aa8edc317e6fa6cfada5a8c17758d218dbed41831c3ab7ae43d80

See more details on using hashes here.

Provenance

The following attestation bundles were made for impact_index-0.31.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on experimaestro/impact-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file impact_index-0.31.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for impact_index-0.31.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1cae4822bf5952664c911ed8dca321ca65b2ea981edf90ac8b46e197a99efe7d
MD5 f4c6517979e9a3031396f5feefcb6fc3
BLAKE2b-256 de59beccbe132f20ea33d627b05b893aa0097d6d7946e7f006f8e4f7f6343a2f

See more details on using hashes here.

Provenance

The following attestation bundles were made for impact_index-0.31.0-cp38-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on experimaestro/impact-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file impact_index-0.31.0-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for impact_index-0.31.0-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c3637b722db7eb0479f27b5b728c91f00dd684649af5c882332755b237129615
MD5 244f53bc81f4e93c38f6643277fcd667
BLAKE2b-256 8be4ff12b35dc217acde1e688cc90fa1e91a4b17f6b0c03dcc183e51a8fc0246

See more details on using hashes here.

Provenance

The following attestation bundles were made for impact_index-0.31.0-cp38-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on experimaestro/impact-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page