Skip to main content

Turn messy files into agent-ready context.

Project description

ContextIQ

Turn messy files into agent-ready context for RAG, search, and AI workflows.

PyPI version Python Versions License


ContextIQ

ContextIQ is a local-first ingestion pipeline for developers building RAG systems, agent memory layers, document search, and eval datasets.

Point it at a folder of mixed files and it produces clean, traceable JSONL and Markdown outputs that AI systems can actually use.

Why ContextIQ

Most AI tooling starts after your data is already clean. Real projects usually break much earlier:

  • PDFs are noisy
  • Word docs lose structure
  • JSON and CSV need normalization
  • repos and notes mix formats
  • chunks become inconsistent
  • source traceability gets lost

ContextIQ focuses on the missing middle: ingestion, normalization, chunking, and export.

Installation

Install from PyPI:

pip install contextiq

Run the CLI:

contextiq ingest ./docs --out ./build/context

Or with module execution:

python -m contextiq ingest ./docs --out ./build/context

Quickstart

Use the built-in example content:

contextiq ingest ./examples --out ./build/context

PowerShell example:

contextiq ingest .\examples --out .\build\context

Generated output:

  • documents.jsonl - normalized source documents
  • chunks.jsonl - chunked outputs for RAG and agents
  • chunks.md - human-readable review output
  • manifest.json - run summary, warnings, and config

What It Supports

Built-in file types

  • .txt, .md, .rst
  • .json, .jsonl
  • .csv, .tsv
  • .html, .htm
  • optional .pdf via pypdf
  • optional .docx via python-docx

Output behavior

  • recursive directory ingestion
  • normalized plain-text extraction
  • document-aware chunking
  • source-preserving metadata
  • JSONL and Markdown export
  • manifest output for reproducibility

CLI

Basic usage

contextiq ingest <path> --out <directory>

Useful flags

  • --include-ext .md,.txt,.json
  • --exclude-glob "*.min.js,*.lock"
  • --chunk-size 1200
  • --chunk-overlap 150
  • --formats jsonl,markdown
  • --fail-on-warning

Example commands

contextiq ingest ./docs --out ./dist/context --chunk-size 900 --chunk-overlap 120
contextiq ingest ./knowledge-base --out ./build/export --include-ext .md,.txt,.json

How It Works

ContextIQ runs in four stages:

1. Discovery

Recursively finds supported files while skipping common noise such as virtualenvs, caches, and build directories.

2. Loading and normalization

Converts each file into normalized plain text:

  • Markdown and text are read directly
  • JSON and JSONL are pretty-printed into readable text
  • CSV and TSV become row-based text
  • HTML is stripped to visible text
  • PDF and DOCX are supported through optional extras

3. Chunking

Splits documents into retrieval-friendly chunks with:

  • target chunk size
  • overlap between chunks
  • paragraph and sentence-aware boundaries
  • source path and character ranges preserved

4. Export

Writes machine-friendly and human-readable outputs for downstream AI workflows.

Project Structure

src/contextiq/
|- cli.py
|- pipeline.py
|- loaders.py
|- chunking.py
|- exporters.py
|- discovery.py
|- models.py
`- utils.py

Use Cases

RAG ingestion

Prepare mixed files for vector indexing and retrieval pipelines.

Agent memory and context packing

Turn project docs into clean, bounded chunks for coding and research agents.

Search systems

Produce normalized text and chunk exports for semantic or hybrid retrieval.

Eval datasets

Create stable, traceable corpora for retrieval benchmarking and prompt evaluation.

Development

Install editable dependencies:

pip install -e .[dev]

Run tests:

pytest

Run the demo:

.\demo.ps1

Roadmap

  • embeddings plugin interface
  • vector database exporters
  • OCR pipeline
  • table extraction
  • citation-aware retrieval benchmarks

Contributing

Contributions are welcome.

  • improve loaders
  • add exporters
  • extend chunking strategies
  • improve docs and examples

Open an issue or submit a PR if you want to help shape ContextIQ.

License

MIT License - see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextiq-0.1.1.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

contextiq-0.1.1-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file contextiq-0.1.1.tar.gz.

File metadata

  • Download URL: contextiq-0.1.1.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for contextiq-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d01ae23d2750cf422f5c82052d53aebd4c6d41a01ac06a283aeaa99bedcf54a2
MD5 f439391e13b5f9421e58e95f39ff182f
BLAKE2b-256 64df7153ed63e5206336890ac64c1ceabe241bb39724404a2bf2fda6ef0f16e2

See more details on using hashes here.

File details

Details for the file contextiq-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: contextiq-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for contextiq-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a3488b853db5186f5297a3b99d02616f398c794da79c807ace82b94c0207d10d
MD5 59f0fccc9c8bab5c19777a08b98d6582
BLAKE2b-256 89d774a717a0cac49bcfc99ca390abc397ae429b64c14ae251a3395f453152a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page