Make directories AI-ready, not just files — turn a directory into a portable knowledge space.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

indxjp

These details have not been verified by PyPI

Project links

Documentation

Project description

indx

Make directories AI-ready — not just files.

Point indx at a folder and get back a knowledge space: structure, folder lineage, file-to-file relationships, and semantic metadata that AI agents and RAG systems can actually reason over.

Documentation · Quickstart · Concepts · AI agents · Changelog

A parser turns one PDF into clean text. indx turns an entire folder into a knowledge space — and keeps the things parsers throw away: the folder a file lived in, the documents beside it, the report it continues, the contract it references, and what kind of document it is.

The thesis: most real knowledge doesn't live in a single file — it lives in the arrangement of files. indx keeps that map and hands it to your agent.

indx composes file parsers (Docling, Unstructured, LlamaParse, MarkItDown, …) rather than replacing them, then layers on what they discard. Every major component — parser, LLM, VLM, embedder, vector store, output — is a swappable, typed slot with a sensible default. Open-source · Python · CLI + SDK · Apache-2.0.

See it in one command: `indx demo`

No data, no installs, no API keys. indx demo builds, inspects, and queries a bundled sample corpus, fully offline — the whole flow in a single command:

pip install indx
indx demo

$ indx demo
indx demo — building a sample 'team handbook' knowledge space…

stage: walk
stage: parse
stage: chunk
stage: relate
stage: enrich
stage: embed-pack
✓ 7 docs · 7 chunks · 19 relations → /tmp/indx-demo-XXXX/demo (0.01s)
  components: parser=plaintext llm=none embedder=hash store=jsonl format=.indx

/tmp/indx-demo-XXXX/demo  schema=1 indx=0.0.1
  documents=7 chunks=7 relations=19 embeddings=7 embedding=hash/256
       Types                    Relations
  type       count        type         count
  markdown       6        references      14
  text           1        sibling          5

sample query (keyword/lexical, offline): how do I onboard?
  score  source                      text
  0.121  engineering/code-review.md  # Code Review  Code review keeps our codebase…
  0.098  people/remote-work.md       # Remote Work Policy  Acme Robotics is remote-…
  0.095  handbook/welcome.md         # Welcome to Acme Robotics  This is the Acme …

✓ that's the whole flow — built offline with keyword/lexical retrieval, no API key.
  run it on your own folder: indx ./your-docs --out ./ai-ready.indx --offline

(A trimmed, ANSI-stripped transcript of a real indx demo run.)

Now point it at your own folder:

indx ./docs --out ./ai-ready.indx --offline   # index a folder, fully offline, zero extra deps
indx inspect ./ai-ready.indx                   # structure, type histogram, relation sample
indx query   ./ai-ready.indx "how do I onboard?"
indx ask     ./ai-ready.indx "what is the leave policy?"   # answer with cited sources
indx app                                       # visual configure → build → inspect → query in the browser

A home knowledge base — one permanent DB, no path juggling

indx keeps a single persistent personal space under ~/.indx/ (override with $INDX_HOME). Build, add, query, and ask default to it when you pass no path — so it works like a personal notebook you can keep adding to:

indx add  ./notes/standup.md           # append a file to the home DB (no path = home)
indx add  ./reports                     # add a whole folder incrementally
indx query "what did we decide about retention?"   # query home, no archive argument
indx ask   "summarize this week's standups"        # answer with citations, offline-friendly
indx home stats                         # counts for the home space
indx home path                          # print the home dir

Incremental CRUD works on any .indx too — indx add <space> <path>, indx rm <space> <doc|path>, indx update <space> <path> mutate a sealed archive in place without a full rebuild.

Filter what gets indexed, run only the stages you need, compose spaces

# Conditional import: index only what matches (globs · extensions · size · depth · count)
indx ./repo --out ./code.indx --offline --ext md --ext py \
  --exclude '**/_drafts/**' --max-size 2mb --max-depth 3 --dry-run

# Granular stages: stop after chunking (no embeddings yet), or inspect one member
indx ./docs --out ./docs.indx --offline --through chunk
indx inspect ./docs.indx --part documents --json

# indx of indx: one parent space that federates query/inspect across child archives
indx compose ./all.indx --add ./eng.indx --add ./design.indx
indx query   ./all.indx "onboarding checklist"      # hits drawn from every child, globally ranked

Why it matters: a chunk that remembers everything

A flat parse → split → embed → store pipeline gives you orphaned text fragments. indx gives you chunks that carry their whole context with them. Here's a single chunk as it appears in the readable index.json:

{
  "id": "chunk_0481",
  "doc_id": "doc_0007",
  "position": 12,
  "text": "Enterprise data is retained for 90 days…",
  "prev_id": "chunk_0480",
  "next_id": "chunk_0482",
  "source":   { "path": "policies/data/retention.pdf", "folder": "policies/data", "type": "policy" },
  "metadata": { "topics": ["retention", "compliance"], "summary": "90-day retention rule…", "tags": ["data-retention", "gdpr"] },
  "relations": [ { "src": "chunk_0481", "dst": "legal/gdpr.md", "type": "references", "score": 1.0 } ]
}

It knows where it came from (source), what it's about (metadata), what sits next to it (prev_id / next_id), and what it points to (relations). An agent can filter by location or type, expand the context window around a hit, and follow knowledge instead of just matching it. Ids are deterministic, so a knowledge space is diffable and reproducible.

How it works

indx is a pipeline of six ordered, individually-replaceable stages that share one mutable SpaceContext:

01 Walk → 02 Parse → 03 Chunk → 04 Relate → 05 Enrich → 06 Embed+Pack

The pipeline is a list you control: insert a stage (say, PII redaction before enrichment), swap one, or drop one (skip Enrich when no LLM is available) without touching its neighbors. The whole model is symmetric across the CLI and SDK:

from indx import DirectoryPipeline, KnowledgeSpace

# Build (default stack is cloud-backed; needs OPENAI_API_KEY)
space = DirectoryPipeline().run("./docs", "./ai-ready")

print(space.stats)                          # counts, timings, components used
for doc in space.documents(type="contract"):
    print(doc.path, doc.topics, doc.summary)

# Re-load the portable archive anywhere — no re-processing
space = KnowledgeSpace.load("./ai-ready/handbook.indx")
hits  = space.search("data retention", k=5)
answer = space.ask("how long is data retained?")     # extractive offline, or LLM-synthesized
print(answer.answer, answer.sources)

# Incremental CRUD — append / re-ingest / delete without a full rebuild, then reseal
space.add("./notes/new-policy.md")
space.update("./notes/new-policy.md")
space.remove("./notes/new-policy.md")
space.save("./ai-ready/handbook.indx")

# Selective load and filtered build
from indx import WalkFilter
docs   = KnowledgeSpace.load_part("./ai-ready/handbook.indx", "documents")  # one member only
filtered = DirectoryPipeline(filter=WalkFilter(ext=[".md"], max_size="2mb"))

The four core objects you need to know — KnowledgeSpace, Document, Chunk, Relation — are explained in Core concepts.

Bring your own stack — no lock-in

Every slot is a typed interface with a default and zero lock-in. Mix and match by name:

Slot	Default (cloud)	Offline core	Other built-ins
Parser	`docling`	`plaintext`	unstructured · llamaparse · markitdown · textract · docintel · docai
LLM	`openai:gpt-5-mini`	`none`	ollama · anthropic · litellm · vllm · azure · bedrock · vertex
VLM	`none`	`none`	gpt4o · qwen-vl · local · bedrock · azure · vertex
Embedder	`openai:text-embedding-3-small`	`hash`	bge-m3 · e5 · cohere · bedrock · azure · vertex · litellm
Store	`qdrant`	`jsonl` (no DB)	pgvector · chroma · lancedb · s3vectors · opensearch · azure-search · bigquery · vertex-vector
Output	`.indx` archive	`.indx` / `jsonl`	langchain · llamaindex

indx ./docs --out ./ai-ready --offline             # zero-dependency core: plaintext → hash → jsonl → .indx
indx ./docs --out ./ai-ready --store chroma        # override a single slot; everything else keeps its default

Three managed cloud profiles wire every slot to one vendor with a single install and a single flag:

pip install "indx[aws]"   && indx ./docs --out ./out --aws     # Textract → Bedrock → Titan → S3 Vectors
pip install "indx[azure]" && indx ./docs --out ./out --azure   # Doc Intelligence → Azure OpenAI → AI Search
pip install "indx[gcp]"   && indx ./docs --out ./out --gcp     # Document AI → Gemini → gemini-embedding → BigQuery

About the offline core: the hash embedder is a deterministic hashing trick, so offline query is keyword/lexical retrieval, not semantic vector search — true semantic search needs a real embedder extra (e.g. bge or openai). Likewise the offline enrich step derives metadata (type, topics, tags, summary) locally, with no LLM call; LLM/VLM enrichment is opt-in via the cloud/local extras. The default (non---offline) stack is cloud-backed — install it with pip install "indx[cloud]" and set OPENAI_API_KEY.

Plug a knowledge space into an AI agent

A .indx archive is a portable knowledge space — carry it like a USB drive and plug it into any agent framework in one line:

from indx.agent import connect

kb = connect("ai-ready/handbook.indx")   # load the "USB drive"
tools = kb.openai()                       # OpenAI Agents SDK …or .langchain() / .pydantic_ai() / .claude()

Or serve it to any MCP client — Claude Desktop, Cursor, or the TypeScript Mastra framework — with no Python glue on the client side:

pip install "indx[agent]"            # all framework adapters + the MCP server
indx mcp ai-ready/handbook.indx      # serve indx_search / indx_overview / indx_get_document

Every connector exposes the same three read-only tools — search, overview, get-document — built on the same retrieval path as the CLI. See the AI agents guide.

Who it's for

RAG / agent engineers who want grounded context with relationships, not flat chunk soup.
Enterprise & air-gapped platform teams that need fully local, auditable, reproducible ingestion across large on-prem document estates — no byte leaves the network.
OSS developers & integrators who want a composable, no-lock-in library they can extend with their own parser, store, or output.
Researchers turning archives of papers, datasets, and notes into a navigable, citable, shareable knowledge graph.

indx is a build-time knowledge layer, not a runtime framework. It produces the portable archive that LangChain, LlamaIndex, agents, and vector DBs consume — use it with them, not instead of them. See the comparison.

Status

Alpha (0.0.1). The zero-dependency core path (plaintext → hash → jsonl → .indx) runs end to end and is fully air-gapped — reach it with indx demo or --offline. The optional cloud/local backends (docling, openai, ollama, bge-m3, qdrant, plus the managed AWS/Azure/GCP profiles, …) are implemented and selected through the registry: install the matching extra (e.g. pip install "indx[cloud]") and provide credentials to switch a slot onto it. The .indx format is at schema_version "2" (readers still load "1" archives — the children reference list is additive); public APIs may still shift before 1.0 — see the CHANGELOG and the docs.

Documentation

Full documentation — quickstart, concepts, the pipeline & stages, guides, and the complete CLI/SDK reference — lives at docs.indx.jp.

Development

python -m venv .venv && . .venv/bin/activate
pip install -e ".[dev]"
nox -s tests          # fast offline suite: unit + corpus
nox -l                # list every session (integration / docker / airgap / live / record-fixtures)

Contributions are welcome — see CONTRIBUTING.md, and Adding a backend to author a new slot implementation.

License

Apache-2.0.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

indxjp

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

0.0.6

Jun 14, 2026

0.0.5

Jun 14, 2026

This version

0.0.4

Jun 14, 2026

0.0.3

Jun 10, 2026

0.0.2

Jun 6, 2026

0.0.1

Jun 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indx-0.0.4.tar.gz (671.4 kB view details)

Uploaded Jun 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

indx-0.0.4-py3-none-any.whl (565.3 kB view details)

Uploaded Jun 14, 2026 Python 3

File details

Details for the file indx-0.0.4.tar.gz.

File metadata

Download URL: indx-0.0.4.tar.gz
Upload date: Jun 14, 2026
Size: 671.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for indx-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`448e140b979bd0203e309685de514829d71bfdaf57868954aed482ca0b62b61d`
MD5	`8cfff3700cd85c58c7b026c13cedc3f7`
BLAKE2b-256	`d7a557360a8370f4ee2b04e918c0904d1dfc39774704f095118ddfa8ecd4ac23`

See more details on using hashes here.

Provenance

The following attestation bundles were made for indx-0.0.4.tar.gz:

Publisher: release.yml on indxjp/indx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: indx-0.0.4.tar.gz
- Subject digest: 448e140b979bd0203e309685de514829d71bfdaf57868954aed482ca0b62b61d
- Sigstore transparency entry: 1814850482
- Sigstore integration time: Jun 14, 2026
Source repository:
- Permalink: indxjp/indx@ca7e7482114c2a86c1b27a2af6cf362caf440762
- Branch / Tag: refs/tags/v0.0.4
- Owner: https://github.com/indxjp
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ca7e7482114c2a86c1b27a2af6cf362caf440762
- Trigger Event: push

File details

Details for the file indx-0.0.4-py3-none-any.whl.

File metadata

Download URL: indx-0.0.4-py3-none-any.whl
Upload date: Jun 14, 2026
Size: 565.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for indx-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ca52adaa6e146c03082fbfca445f269d3072c6f86272a929ee730b5111be5212`
MD5	`a0082b6d72e0148b3f1f11dfecfed6ed`
BLAKE2b-256	`ca4c53c7296bc823b16ca001b318b7a3cf7a60ad697f2f0f8e6c55e75cebd5d1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for indx-0.0.4-py3-none-any.whl:

Publisher: release.yml on indxjp/indx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: indx-0.0.4-py3-none-any.whl
- Subject digest: ca52adaa6e146c03082fbfca445f269d3072c6f86272a929ee730b5111be5212
- Sigstore transparency entry: 1814850564
- Sigstore integration time: Jun 14, 2026
Source repository:
- Permalink: indxjp/indx@ca7e7482114c2a86c1b27a2af6cf362caf440762
- Branch / Tag: refs/tags/v0.0.4
- Owner: https://github.com/indxjp
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ca7e7482114c2a86c1b27a2af6cf362caf440762
- Trigger Event: push

indx 0.0.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

indx

Make directories AI-ready — not just files.

See it in one command: indx demo

A home knowledge base — one permanent DB, no path juggling

Filter what gets indexed, run only the stages you need, compose spaces

Why it matters: a chunk that remembers everything

How it works

Bring your own stack — no lock-in

Plug a knowledge space into an AI agent

Who it's for

Status

Documentation

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

See it in one command: `indx demo`