Model-centric, config-driven, fail-stop-retry indexing toolkit for managing your index in vector databases.

These details have not been verified by PyPI

Project links

Project description

idxr: Model-Centric Indexing Story

idxr exists for teams who want a dependable, repeatable way to turn any structured dataset into a searchable vector index. Everything revolves around three pillars:

📚 Documentation – Browse the full MkDocs site at https://getadriai.github.io/idxr/ (or build it locally with mkdocs serve).
Model-centric – you describe your world as Pydantic models, and idxr keeps schemas, partitions, and manifests aligned with those models.
Config-driven – declarative JSON configs capture how each model should be prepared and indexed, so onboarding a new dataset is as easy as committing a config file.
Fail-stop-retry – every stage records checkpoints, row digests, and error payloads so the pipeline halts loudly when something goes wrong and then resumes from where it stopped.

A Day in the Life of an Index

The timeline below is an example run that demonstrates how idxr accompanies a team from the first dataset drop through ongoing maintenance.

First launch (Create)
You register your domain models in a registry module and run:

export MODEL_REGISTRY="my_project.registry:MODEL_REGISTRY"
idxr prepare_datasets new-config foundation --model "$MODEL_REGISTRY"

idxr scaffolds a config like:

{
  "Contract": {
    "path": "datasets/contracts.csv",
    "columns": {
      "id": "CONTRACT_ID",
      "title": "CONTRACT_TITLE",
      "summary": "DESCRIPTION"
    },
    "delimiter": ",",
    "drop_na_columns": ["summary"]
  }
}

That config is committed, reviewed, and becomes the contract between data engineers and the index.

Daily growth (Add records)
New exports arrive. You rerun idxr prepare_datasets with the same config; idxr deduplicates rows using digests, appends fresh partitions, and bumps manifest timestamps. No manual cleanup, no double counting.
Domain expansion (Add models)
Product introduces a SupportTicket model. You add it to the registry, run idxr prepare_datasets new-config support --model "$MODEL_REGISTRY" --models SupportTicket, and drop the resulting JSON alongside the original config. idxr keeps each model’s partitions distinct but indexed in the same collection.
Schema shakeups (Update models)
If Contract gains a new field, the model registry changes first. idxr prepare_datasets notices, versions the schema, and marks older partitions as stale. When idxr vectorize runs next, it honours resume checkpoints, reindexes only what changed, and writes audit-friendly error reports for anything it had to skip.
Operational guardrails
During indexing, any hard failure triggers a fail-stop. idxr writes a YAML report capturing offending rows and context so you can fix the source data, then rerun idxr vectorize --resume to continue exactly where it left off. Optional E2E sampling produces JSON snippets you can review with stakeholders before the big push.

Tools in the Box

idxr prepare_datasets – partitions CSV/JSONL sources, heals malformed rows, maintains a manifest with digests, and generates drop plans.
idxr vectorize – streams partitions into ChromaDB (local or cloud), enforces token budgets, compacts documents via OpenAI when needed, and exports structured error reports.
Shared libraries – offer manifest helpers, truncation strategies, drop orchestration, and CLI utilities to wire everything together.

Why idxr?

🔁 Lifecycle clarity – creation, accumulation, model expansion, and schema updates follow the same playbook.
✍️ Single source of truth – configs live in version control, so reviews and rollbacks are trivial.
🛑 Predictable failure semantics – when something breaks, the pipeline stops before corrupting data and tells you exactly what needs attention.
🔌 Bring-your-own registry – ship configs with ECC exports today, swap to CRM logs tomorrow, all with the same toolkit.
📦 PyPI-ready – install via pip install idxr, call the CLIs, import the libraries, and compose your own orchestration scripts.

Querying Multi-Collection Indexes

When indexing large datasets (16M+ records), idxr distributes data across multiple ChromaDB collections using the PartitionCollectionStrategy. To query efficiently across these collections:

Generate query config after indexing completes:

idxr vectorize generate-query-config \
  --partition-out-dir build/vector \
  --output query_config.json \
  --model "$MODEL_REGISTRY"

Use the async query client in your application:

from idxr.vectorize_lib.query_client import AsyncMultiCollectionQueryClient

async with AsyncMultiCollectionQueryClient(
    config_path=Path("query_config.json"),
    client_type="cloud",
    cloud_api_key=os.getenv("CHROMA_API_TOKEN"),
) as client:
    # Query specific models
    results = await client.query(
        query_texts=["SAP transaction tables"],
        n_results=10,
        models=["Table", "Field"],  # Auto fan-out to relevant collections
    )

The client automatically:

Maps model names to their collections
Fans out queries in parallel using asyncio
Merges and ranks results by distance across collections
Handles partial failures gracefully

For complete documentation, see QUERYING.md and examples/query_example.py.

For deep dives and operational recipes, explore FAQ.md, DOC.md, TRUNCATION_EXAMPLES.md, ERROR_HANDLING.md, and QUERYING.md.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.8.0

Nov 23, 2025

1.7.0

Nov 22, 2025

1.5.1

Nov 22, 2025

This version

1.5.0

Nov 22, 2025

0.2.0

Oct 31, 2025

0.1.0

Oct 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

idxr-1.5.0.tar.gz (113.9 kB view details)

Uploaded Nov 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

idxr-1.5.0-py3-none-any.whl (122.2 kB view details)

Uploaded Nov 22, 2025 Python 3

File details

Details for the file idxr-1.5.0.tar.gz.

File metadata

Download URL: idxr-1.5.0.tar.gz
Upload date: Nov 22, 2025
Size: 113.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for idxr-1.5.0.tar.gz
Algorithm	Hash digest
SHA256	`a1ac40e884b38a79d63c79505f2c14987d84224f8ce52d9edf61ca99731d416f`
MD5	`92f9949eb4724cf3c5fd7a40070f54d6`
BLAKE2b-256	`0b4912826d71073258b2a341847823901c986f0009572e15e44dfda06016d9a3`

See more details on using hashes here.

File details

Details for the file idxr-1.5.0-py3-none-any.whl.

File metadata

Download URL: idxr-1.5.0-py3-none-any.whl
Upload date: Nov 22, 2025
Size: 122.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for idxr-1.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8c296c17dfc5924d62e3b5d58bf2bd8c3557d2daff2b18a11d234eda1831c590`
MD5	`7e831762f1dd9f7e7863981bdbb951a2`
BLAKE2b-256	`60031354c097fa9cece32e68433e0d1415205d0f250bd08dae01b961b0fa0c1d`

See more details on using hashes here.

idxr 1.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

idxr: Model-Centric Indexing Story

A Day in the Life of an Index

Tools in the Box

Why idxr?

Querying Multi-Collection Indexes

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes