Skip to main content

No project description provided

Project description

Impact Index for Information Retrieval

This package is a library that implements efficient algorithms for sparse representations from neural information retrieval systems. Contrarily to other libraries, this one specifically targets neural IR models, and does not suppose any quantization. It also does not implement "standard" IR algorithms based on term frequencies.

Installation

pip install maturin
maturin develop --release

Python API

The library exposes the impact_index Python module for building and searching sparse indices.

Basic Usage

import numpy as np
from impact_index import IndexBuilder, Index

# Build an index
builder = IndexBuilder("/path/to/index")
# Add documents: docid, term_indices (numpy array), impact_values (numpy array)
builder.add(0, np.array([1, 5, 10]), np.array([0.5, 1.2, 0.8]))
builder.add(1, np.array([2, 5, 8]), np.array([0.3, 0.9, 1.1]))
index = builder.build(in_memory=True)

# Search
query = {5: 1.0, 10: 0.5}  # {term_index: weight}
results = index.search_wand(query, top_k=10)
for doc in results:
    print(f"Document {doc.docid}: {doc.score}")

Classes

BuilderOptions

Options for index construction.

from impact_index import BuilderOptions

options = BuilderOptions()
options.checkpoint_frequency = 100000  # Checkpoint every N documents
options.in_memory_threshold = 1000000  # In-memory threshold

IndexBuilder

Builds a sparse index from documents.

IndexBuilder(folder: str, options: BuilderOptions = None)

Methods:

  • add(docid: int, terms: np.ndarray[int], values: np.ndarray[float]) - Add a document with term indices and impact values
  • get_checkpoint_doc_id() -> int | None - Get the last checkpointed document ID (useful for resuming indexing)
  • build(in_memory: bool) -> Index - Finalize and return the index

Index

A sparse index supporting efficient top-k retrieval.

Index.load(folder: str, in_memory: bool) -> Index

Search Methods:

  • search_wand(query: dict, top_k: int) -> list[ScoredDocument] - WAND algorithm
  • search_maxscore(query: dict, top_k: int) -> list[ScoredDocument] - MaxScore algorithm
  • aio_search_wand(query: dict, top_k: int) -> Coroutine - Async WAND search
  • aio_search_maxscore(query: dict, top_k: int) -> Coroutine - Async MaxScore search

Other Methods:

  • postings(term: int) -> SparseIndexIterator - Get posting list iterator for a term
  • num_postings() -> int - Total number of posting lists
  • to_bmp(output: str, bsize: int, compress_range: bool) - Convert to BMP format
  • to_bmp_streaming(output: str, bsize: int, compress_range: bool) - Memory-efficient BMP conversion

SparseIndexIterator

Iterator over a term's posting list.

iterator = index.postings(term_id)
print(f"Length: {iterator.length()}")
print(f"Max impact: {iterator.max_value()}")
print(f"Max doc ID: {iterator.max_doc_id()}")

for posting in iterator:
    print(f"Doc {posting.docid}: {posting.value}")

ScoredDocument

Search result with document ID and score.

  • docid: int - Document identifier
  • score: float - Retrieval score

Compression and Transforms

Apply compression to reduce index size:

from impact_index import (
    Index, CompressionTransform,
    EliasFanoCompressor, ImpactQuantizer, GlobalImpactQuantizer
)

index = Index.load("/path/to/raw_index", in_memory=True)

# Create compressors
docid_compressor = EliasFanoCompressor()
impact_compressor = ImpactQuantizer(nbits=8, min=0.0, max=10.0)
# Or use global quantization:
# impact_compressor = GlobalImpactQuantizer(nbits=8)

# Apply compression
transform = CompressionTransform(
    max_block_size=128,
    doc_ids_compressor=docid_compressor,
    impacts_compressor=impact_compressor
)
transform.process("/path/to/compressed_index", index)

SplitIndexTransform

Split index by impact quantiles for tiered retrieval:

from impact_index import SplitIndexTransform, CompressionTransform

base_transform = CompressionTransform(128, docid_compressor, impact_compressor)
split_transform = SplitIndexTransform(
    quantiles=[0.5, 0.9],  # Split at 50th and 90th percentile
    sink=base_transform
)
split_transform.process("/path/to/split_index", index)

BMP (Block-Max Pruning) Search

Fast approximate search using the BMP algorithm from the BMP repository, which implements "Faster Learned Sparse Retrieval with Block-Max Pruning" (SIGIR 2024).

from impact_index import Index, BmpSearcher

# Convert an existing index to BMP format
index = Index.load("/path/to/index", in_memory=True)
index.to_bmp_streaming("/path/to/bmp_index.bin", bsize=64, compress_range=True)

# Load and search with BMP
searcher = BmpSearcher("/path/to/bmp_index.bin")
print(f"Documents: {searcher.num_documents()}")

# Query uses string term IDs
query = {"term1": 1.0, "term2": 0.5}
doc_ids, scores = searcher.search(query, k=10, alpha=1.0, beta=1.0)
for docid, score in zip(doc_ids, scores):
    print(f"{docid}: {score}")

BMP Conversion Methods:

  • to_bmp_streaming(output, bsize, compress_range) - Recommended. Memory-efficient streaming conversion using O(num_terms × num_blocks) memory
  • to_bmp(output, bsize, compress_range) - Legacy method using O(total_postings) memory

BMP Search Parameters:

  • k - Number of results to return
  • alpha - Controls early termination aggressiveness (default: 1.0)
  • beta - Controls block skipping (default: 1.0)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

impact_index-0.30.1.tar.gz (1.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

impact_index-0.30.1-cp38-abi3-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.8+Windows x86-64

impact_index-0.30.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

impact_index-0.30.1-cp38-abi3-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

impact_index-0.30.1-cp38-abi3-macosx_10_12_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file impact_index-0.30.1.tar.gz.

File metadata

  • Download URL: impact_index-0.30.1.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for impact_index-0.30.1.tar.gz
Algorithm Hash digest
SHA256 c4ed165a044bf07105f7e29fb2e43857d25f9bcdb7a4be9baab8849f311267d9
MD5 229b17c85cf0a382507495d028947cfc
BLAKE2b-256 1bb632464ae7cc699e920be91a9bcbcdce945b696651611edf8c7373ce0f3ea7

See more details on using hashes here.

Provenance

The following attestation bundles were made for impact_index-0.30.1.tar.gz:

Publisher: release.yml on experimaestro/impact-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file impact_index-0.30.1-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for impact_index-0.30.1-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 b440ca3e7fb3c73e3f3f560168a61ed38b1d97a4f98e0e785b66af928b3c6bba
MD5 99e97ce4d7b4b675a229937e1b0c16a4
BLAKE2b-256 359b947f4c8e22cfec3e482ed3ac856559a13a05e3a5d9327abf6ac0e35cf9da

See more details on using hashes here.

Provenance

The following attestation bundles were made for impact_index-0.30.1-cp38-abi3-win_amd64.whl:

Publisher: release.yml on experimaestro/impact-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file impact_index-0.30.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for impact_index-0.30.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d947c9179519a6bbcef42236ee18ce4d37d4c5a20be7bf474c30696cde4ac566
MD5 28b1d4e0672a7441b8061d0464ff90f8
BLAKE2b-256 1a40e4d3e03905436599ee9c44e243c669728c8affde273d54883c00efc50bcb

See more details on using hashes here.

Provenance

The following attestation bundles were made for impact_index-0.30.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on experimaestro/impact-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file impact_index-0.30.1-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for impact_index-0.30.1-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 19d6d32bb76c94bea8b20744bdbdeb2e73412028aa01ac3b1c4d0f4038b90446
MD5 5124dbc00d77d497967c6175d4508e03
BLAKE2b-256 b89f8dd21c377c767af1c220fcbeb993a9d42b03f1f7ae254aca2116fa9aba1e

See more details on using hashes here.

Provenance

The following attestation bundles were made for impact_index-0.30.1-cp38-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on experimaestro/impact-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file impact_index-0.30.1-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for impact_index-0.30.1-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 496ebafeddd9256e3aed5d768b894eded992e9bed4bf30703cd53200d644711f
MD5 978049e2b2973c641cf5ffafb8c171ae
BLAKE2b-256 5539d4545dd5ef27a661b266b5d7ebdcee6cb339b60ab43cf18869e8ab028092

See more details on using hashes here.

Provenance

The following attestation bundles were made for impact_index-0.30.1-cp38-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on experimaestro/impact-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page