Make directories AI-ready, not just files — turn a directory into a portable knowledge space.
Project description
indx
Make directories AI-ready — not just files.
Point indx at a folder and get back a knowledge space: structure, folder lineage, file-to-file relationships, and semantic metadata that AI agents and RAG systems can actually reason over.
Documentation · Quickstart · Concepts · AI agents · Changelog
A parser turns one PDF into clean text. indx turns an entire folder into a knowledge space — and keeps the things parsers throw away: the folder a file lived in, the documents beside it, the report it continues, the contract it references, and what kind of document it is.
The thesis: most real knowledge doesn't live in a single file — it lives in the arrangement of files. indx keeps that map and hands it to your agent.
indx composes file parsers (Docling, Unstructured, LlamaParse, MarkItDown, …) rather than replacing them, then layers on what they discard. Every major component — parser, LLM, VLM, embedder, vector store, output — is a swappable, typed slot with a sensible default. Open-source · Python · CLI + SDK · Apache-2.0.
See it in one command: indx demo
No data, no installs, no API keys. indx demo builds, inspects, and queries a bundled
sample corpus, fully offline — the whole flow in a single command:
pip install indx
indx demo
$ indx demo
indx demo — building a sample 'team handbook' knowledge space…
stage: walk
stage: parse
stage: chunk
stage: relate
stage: enrich
stage: embed-pack
✓ 7 docs · 7 chunks · 19 relations → /tmp/indx-demo-XXXX/demo (0.01s)
components: parser=plaintext llm=none embedder=hash store=jsonl format=.indx
/tmp/indx-demo-XXXX/demo schema=2 indx=0.0.4
documents=7 chunks=7 relations=19 embeddings=7 embedding=hash/256
Types Relations
type count type count
markdown 6 references 14
text 1 sibling 5
sample query (keyword/lexical, offline): how do I onboard?
score source text
0.121 engineering/code-review.md # Code Review Code review keeps our codebase…
0.098 people/remote-work.md # Remote Work Policy Acme Robotics is remote-…
0.095 handbook/welcome.md # Welcome to Acme Robotics This is the Acme …
✓ that's the whole flow — built offline with keyword/lexical retrieval, no API key.
run it on your own folder: indx ./your-docs --out ./ai-ready.indx --offline
(A trimmed, ANSI-stripped transcript of a real
indx demorun.)
Now point it at your own folder:
indx ./docs --out ./ai-ready.indx --offline # index a folder, fully offline, zero extra deps
indx inspect ./ai-ready.indx # structure, type histogram, relation sample
indx query "how do I onboard?" ./ai-ready.indx
indx ask "what is the leave policy?" ./ai-ready.indx # answer with cited sources
indx app # visual configure → build → inspect → query in the browser
A home knowledge base — one permanent DB, no path juggling
indx keeps a single persistent personal space under ~/.indx/ (override with $INDX_HOME).
Build, add, query, and ask default to it when you pass no path — so it works like a personal
notebook you can keep adding to:
indx add ./notes/standup.md # append a file to the home DB (no path = home)
indx add ./reports # add a whole folder incrementally
indx query "what did we decide about retention?" # query home, no archive argument
indx ask "summarize this week's standups" # answer with citations, offline-friendly
indx home stats # counts for the home space
indx home path # print the home dir
Incremental CRUD works on any .indx too — indx add <path> <space>, indx rm <doc|path> <space>,
indx update <path> <space> mutate a sealed archive in place without a full rebuild.
Filter what gets indexed, run only the stages you need, compose spaces
# Conditional import: index only what matches (globs · extensions · size · depth · count)
indx ./repo --out ./code.indx --offline --ext md --ext py \
--exclude '**/_drafts/**' --max-size 2mb --max-depth 3 --dry-run
# Granular stages: stop after chunking (no embeddings yet), or inspect one member
indx ./docs --out ./docs.indx --offline --through chunk
indx inspect ./docs.indx --part documents --json
# indx of indx: build an empty parent first, then federate query/inspect across child archives
indx ./empty --out ./all.indx --offline # create the (empty) parent space
indx compose ./all.indx --add ./eng.indx --add ./design.indx
indx query "onboarding checklist" ./all.indx # value-first; hits drawn from every child, globally ranked
Why it matters: a chunk that remembers everything
A flat parse → split → embed → store pipeline gives you orphaned text fragments. indx
gives you chunks that carry their whole context with them. Here's a single chunk as it
appears in the readable index.json:
{
"id": "chunk_0481",
"doc_id": "doc_0007",
"position": 12,
"text": "Enterprise data is retained for 90 days…",
"prev_id": "chunk_0480",
"next_id": "chunk_0482",
"source": { "path": "policies/data/retention.pdf", "folder": "policies/data", "type": "policy" },
"metadata": { "topics": ["retention", "compliance"], "summary": "90-day retention rule…", "tags": ["data-retention", "gdpr"] },
"relations": [ { "src": "chunk_0481", "dst": "legal/gdpr.md", "type": "references", "score": 1.0 } ]
}
It knows where it came from (source), what it's about (metadata), what sits
next to it (prev_id / next_id), and what it points to (relations). An agent can
filter by location or type, expand the context window around a hit, and follow knowledge
instead of just matching it. Ids are deterministic, so a knowledge space is diffable and
reproducible.
How it works
indx is a pipeline of six ordered, individually-replaceable stages that share one
mutable SpaceContext:
01 Walk → 02 Parse → 03 Chunk → 04 Relate → 05 Enrich → 06 Embed+Pack
The pipeline is a list you control: insert a stage (say, PII redaction before enrichment), swap one, or drop one (skip Enrich when no LLM is available) without touching its neighbors. The whole model is symmetric across the CLI and SDK:
from indx import DirectoryPipeline, KnowledgeSpace
# Build (default stack is cloud-backed; needs OPENAI_API_KEY)
space = DirectoryPipeline().run("./docs", "./ai-ready")
print(space.stats) # counts, timings, components used
for doc in space.documents(type="contract"):
print(doc.path, doc.topics, doc.summary)
# Re-load the portable archive anywhere — no re-processing
space = KnowledgeSpace.load("./ai-ready/handbook.indx")
hits = space.search("data retention", k=5)
answer = space.ask("how long is data retained?") # extractive offline, or LLM-synthesized
print(answer.answer, answer.sources)
# Incremental CRUD — append / re-ingest / delete without a full rebuild, then reseal
space.add("./notes/new-policy.md")
space.update("./notes/new-policy.md")
space.remove("./notes/new-policy.md")
space.save("./ai-ready/handbook.indx")
# Selective load and filtered build
from indx import WalkFilter
docs = KnowledgeSpace.load_part("./ai-ready/handbook.indx", "documents") # one member only
filtered = DirectoryPipeline(filter=WalkFilter(ext=[".md"], max_size="2mb"))
The four core objects you need to know — KnowledgeSpace, Document, Chunk, Relation — are explained in Core concepts.
Bring your own stack — no lock-in
Every slot is a typed interface with a default and zero lock-in. Mix and match by name:
| Slot | Default (cloud) | Offline core | Other built-ins |
|---|---|---|---|
| Parser | docling |
plaintext |
unstructured · llamaparse · markitdown · textract · docintel · docai |
| LLM | openai:gpt-5-mini |
none |
ollama · anthropic · litellm · vllm · azure · bedrock · vertex |
| VLM | none |
none |
gpt4o · qwen-vl · local · bedrock · azure · vertex |
| Embedder | openai:text-embedding-3-small |
hash |
bge-m3 · e5 · cohere · bedrock · azure · vertex · litellm |
| Store | qdrant |
jsonl (no DB) |
pgvector · chroma · lancedb · s3vectors · opensearch · azure-search · bigquery · vertex-vector |
| Output | .indx archive |
.indx / jsonl |
langchain · llamaindex |
indx ./docs --out ./ai-ready --offline # zero-dependency core: plaintext → hash → jsonl → .indx
indx ./docs --out ./ai-ready --store chroma # override a single slot; everything else keeps its default
Three managed cloud profiles wire every slot to one vendor with a single install and a single flag:
pip install "indx[aws]" && indx ./docs --out ./out --aws # Textract → Bedrock → Titan → S3 Vectors
pip install "indx[azure]" && indx ./docs --out ./out --azure # Doc Intelligence → Azure OpenAI → AI Search
pip install "indx[gcp]" && indx ./docs --out ./out --gcp # Document AI → Gemini → gemini-embedding → BigQuery
About the offline core: the
hashembedder is a deterministic hashing trick, so offlinequeryis keyword/lexical retrieval, not semantic vector search — true semantic search needs a real embedder extra (e.g.bgeoropenai). Likewise the offlineenrichstep derives metadata (type, topics, tags, summary) locally, with no LLM call; build-timeenrichis local-only — LLM/VLM synthesis is applied at query time viaask(indx ask/ set a real--llm), not during the build. The default (non---offline) stack is cloud-backed — install it withpip install "indx[cloud]"and setOPENAI_API_KEY.
Plug a knowledge space into an AI agent
A .indx archive is a portable knowledge space — carry it like a USB drive and plug it
into any agent framework in one line:
from indx.agent import connect
kb = connect("ai-ready/handbook.indx") # load the "USB drive"
tools = kb.openai() # OpenAI Agents SDK …or .langchain() / .pydantic_ai() / .claude()
Or serve it to any MCP client — Claude Desktop, Cursor, or the TypeScript Mastra framework — with no Python glue on the client side:
pip install "indx[agent]" # all framework adapters + the MCP server
indx mcp ai-ready/handbook.indx # serve indx_search / indx_overview / indx_get_document
Every connector exposes the same three read-only tools — search, overview, get-document — built on the same retrieval path as the CLI. See the AI agents guide.
Who it's for
- RAG / agent engineers who want grounded context with relationships, not flat chunk soup.
- Enterprise & air-gapped platform teams that need fully local, auditable, reproducible ingestion across large on-prem document estates — no byte leaves the network.
- OSS developers & integrators who want a composable, no-lock-in library they can extend with their own parser, store, or output.
- Researchers turning archives of papers, datasets, and notes into a navigable, citable, shareable knowledge graph.
indx is a build-time knowledge layer, not a runtime framework. It produces the portable archive that LangChain, LlamaIndex, agents, and vector DBs consume — use it with them, not instead of them. See the comparison.
Status
Alpha (0.0.6). The zero-dependency core path (plaintext → hash → jsonl → .indx)
runs end to end and is fully air-gapped — reach it with indx demo or --offline. The
optional cloud/local backends (docling, openai, ollama, bge-m3, qdrant, plus the managed
AWS/Azure/GCP profiles, …) are implemented and selected through the registry: install the
matching extra (e.g. pip install "indx[cloud]") and provide credentials to switch a slot
onto it. The .indx format is at schema_version "2" (readers still load "1" archives — the
children reference list is additive); public APIs may still shift before 1.0 — see the
CHANGELOG and the docs.
Documentation
Full documentation — quickstart, concepts, the pipeline & stages, guides, and the complete CLI/SDK reference — lives at docs.indx.jp.
Development
python -m venv .venv && . .venv/bin/activate
pip install -e ".[dev]"
nox -s tests # fast offline suite: unit + corpus
nox -l # list every session (integration / docker / airgap / live / record-fixtures)
Contributions are welcome — see CONTRIBUTING.md, and Adding a backend to author a new slot implementation.
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file indx-0.0.6.tar.gz.
File metadata
- Download URL: indx-0.0.6.tar.gz
- Upload date:
- Size: 704.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59b04f2621e10d98b32c86175ee0bf68cbf566e07291ae291b07457dbf4bd397
|
|
| MD5 |
321a9c4afef26c946d4747a9a725185a
|
|
| BLAKE2b-256 |
88b5848de483439b7b03896770278b869f969152ad8507f890b4bc636a659049
|
Provenance
The following attestation bundles were made for indx-0.0.6.tar.gz:
Publisher:
release.yml on indxjp/indx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
indx-0.0.6.tar.gz -
Subject digest:
59b04f2621e10d98b32c86175ee0bf68cbf566e07291ae291b07457dbf4bd397 - Sigstore transparency entry: 1818465079
- Sigstore integration time:
-
Permalink:
indxjp/indx@4b0c35fee2e8fe809ce268e06a7a22fd6ab33fe9 -
Branch / Tag:
refs/tags/v0.0.6 - Owner: https://github.com/indxjp
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4b0c35fee2e8fe809ce268e06a7a22fd6ab33fe9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file indx-0.0.6-py3-none-any.whl.
File metadata
- Download URL: indx-0.0.6-py3-none-any.whl
- Upload date:
- Size: 579.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d4cc913f66110c11a91f2eed2d14dc278cbbc5c7a2cf652266e2ea5c36ec082
|
|
| MD5 |
6d1997a66a7e47ba4cbcd541304ee178
|
|
| BLAKE2b-256 |
f04827d1873c0566ff3268a60272264f6f1a4570892c61a6253afe6c4a346aff
|
Provenance
The following attestation bundles were made for indx-0.0.6-py3-none-any.whl:
Publisher:
release.yml on indxjp/indx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
indx-0.0.6-py3-none-any.whl -
Subject digest:
3d4cc913f66110c11a91f2eed2d14dc278cbbc5c7a2cf652266e2ea5c36ec082 - Sigstore transparency entry: 1818465100
- Sigstore integration time:
-
Permalink:
indxjp/indx@4b0c35fee2e8fe809ce268e06a7a22fd6ab33fe9 -
Branch / Tag:
refs/tags/v0.0.6 - Owner: https://github.com/indxjp
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4b0c35fee2e8fe809ce268e06a7a22fd6ab33fe9 -
Trigger Event:
push
-
Statement type: