Local-first document corpus pipeline for grounded AI agents

These details have not been verified by PyPI

Project links

Project description

Grounding

Local-first document corpus pipeline for grounded AI agents.

Grounding converts PDF, EPUB, DOCX, and Markdown documents into a structured, searchable corpus with per-agent embedding indexes. Drop documents into staging, get chunked Markdown with provenance hashing, FAISS vector indexes, and agent-filtered search -- all running locally, no cloud APIs required.

What It Does

Documents (PDF/EPUB/DOCX/MD)
    |
    v
[ Parse ] ── Unstructured / Marker
    |
    v
[ Chunk ] ── LangChain text splitters + YAML front matter
    |
    v
[ Hash ]  ── SHA-1 + SHA-256 + BLAKE3 provenance
    |
    v
[ Index ] ── FAISS embeddings, filtered per agent
    |
    v
[ Query ] ── Local RAG via Ollama with agentic tool calling

Key Features

Deterministic pipeline -- same inputs produce byte-identical outputs
Content provenance -- SHA-1, SHA-256, and BLAKE3 hashing on every document and chunk
Agent-based corpus partitioning -- YAML-defined agents filter the corpus by collection tags, each with their own FAISS embedding index
Persona system -- agents have configurable communication styles, expertise areas, and greeting messages
Staging watcher -- drop files into a folder, auto-ingest with embedding updates
Multi-machine ready -- optional Syncthing-based architecture for dedicated ingestion servers
Fully local -- no cloud APIs, no telemetry, your documents stay on your machine
Agentic RAG -- local LLMs autonomously decide when to search the corpus via tool calling

Quick Start

Install from PyPI

python3 -m venv venv   # Python 3.10-3.13 supported
source venv/bin/activate
pip install grounding-ai

Then grab the example configs and agents from the repo:

curl -O https://raw.githubusercontent.com/andyliszewski/grounding-ai/main/config.example.yaml
curl -O https://raw.githubusercontent.com/andyliszewski/grounding-ai/main/.mcp.example.json
mkdir -p agents && cd agents && \
  curl -O https://raw.githubusercontent.com/andyliszewski/grounding-ai/main/agents/examples/scientist.yaml && \
  cd ..
cp config.example.yaml config.yaml
cp .mcp.example.json .mcp.json

Install from source (for development)

git clone https://github.com/andyliszewski/grounding-ai.git
cd grounding-ai
python3 -m venv venv
source venv/bin/activate
pip install -e .
cp config.example.yaml config.yaml
cp .mcp.example.json .mcp.json
cp agents/examples/*.yaml agents/

First run (end-to-end)

# 1. Ingest some documents
grounding ./my-documents ./corpus --collections science

# 2. Generate embeddings for the scientist agent
grounding embeddings --agent scientist --corpus ./corpus

# 3. Query with a local LLM (requires Ollama running)
python scripts/local_rag.py --agent scientist -A

A typical session looks like this:

$ python scripts/local_rag.py --agent scientist -A
🔬 Scientist agent ready (3,142 chunks indexed across 8 collections)

> What does Kuhn mean by a paradigm shift?

[searching corpus... 5 chunks retrieved]

A paradigm shift, in Kuhn's framing, is a discontinuous change in the
fundamental assumptions of a scientific community [Source: Kuhn, The
Structure of Scientific Revolutions, corpus]. It happens when accumulated
anomalies can no longer be explained within the existing paradigm and a
new framework displaces the old one — not through gradual refinement but
through a gestalt-like reorientation.

[Derived] The process is social as much as epistemic: Kuhn emphasizes that
competing paradigms are often incommensurable, meaning proponents of each
literally see the world differently.

Agent System

Agents are YAML files that define a persona and a corpus filter:

name: scientist
description: Scientific research agent

persona:
  icon: "🔬"
  style: |
    You communicate like a rigorous scientist: analytical,
    evidence-based, and methodical.
  expertise:
    - Scientific method and experimental design
    - Biology and biochemistry
    - Physics fundamentals
  greeting: |
    I'm your scientific advisor. What would you like to investigate?

corpus_filter:
  collections:
    - science
    - biology
    - chemistry
    - physics

Each agent gets its own FAISS embedding index containing only documents matching its collections. See agents/examples/ for starter templates.

Creating Your Own Agents

Define the agent. Create a YAML file in agents/:

# agents/my-agent.yaml
name: my-agent
description: What this agent knows about

persona:
  icon: "🎯"
  style: |
    How you want the agent to communicate.
  expertise:
    - Domain area 1
    - Domain area 2
  greeting: |
    Message shown when the agent activates.

corpus_filter:
  collections:
    - collection-tag-1
    - collection-tag-2

Ingest documents with matching collection tags. Collections are kebab-case labels you assign when ingesting:
```
grounding ./physics-textbooks ./corpus --collections physics
grounding ./biology-papers ./corpus --collections biology,science
```
A document can belong to multiple collections (comma-separated). An agent sees all documents whose collection tags overlap with its corpus_filter.collections list.
Generate embeddings for the agent:
```
grounding embeddings --agent my-agent --corpus ./corpus
```
This builds a FAISS index at embeddings/my-agent/ containing only chunks from documents matching the agent's collection filter.

Query the agent:

python scripts/local_rag.py --agent my-agent -A

Where Do My Agent Files Live?

By default, agents/*.yaml is gitignored (only agents/examples/ is tracked). This means YAMLs you create in agents/ won't show up in git status and won't be committed to your fork. There are three common workflows depending on how you want to manage your agents:

Workflow A: Local-only (simplest, no version control)

Just create YAMLs in agents/ and use them. Nothing extra to manage.

cp agents/examples/scientist.yaml agents/my-physicist.yaml
# Edit, then use immediately
grounding embeddings --agent my-physicist --corpus ./corpus

Good for: Trying things out, single machine, agents you don't need to back up.

Workflow B: Separate private agents repo (recommended for serious use)

Create your own private repo for agent definitions and point AGENTS_DIR at it. This is how the maintainer runs grounding -- agents are version-controlled and sync between machines via git.

# Create a private repo with this structure:
#   my-agents/
#   ├── agents/
#   │   ├── physicist.yaml
#   │   └── biologist.yaml
#   └── commands/         # Optional: Claude Code slash commands

# Clone it alongside grounding-ai
git clone git@github.com:youruser/my-agents.git ~/my-agents

# Point grounding at it
grounding embeddings --agent physicist --corpus ./corpus --agents-dir ~/my-agents/agents

# Or set the environment variable for the staging watcher
export AGENTS_DIR=~/my-agents/agents

Good for: Multi-machine setups, version-controlled agent definitions, keeping personal agents private while contributing back to grounding-ai.

Workflow C: Fork grounding-ai

Fork the repo and remove agents/*.yaml from .gitignore. Your agents become part of your fork.

# After forking
sed -i '' '/agents\/\*\.yaml/d' .gitignore   # remove the gitignore line
git add agents/ .gitignore
git commit -m "track personal agent definitions"

Good for: Single-repo workflow, public agent libraries, contributing agent templates back upstream.

Organizing Collections

Collections are free-form tags -- there's no predefined list. Choose whatever makes sense for your domain:

staging/
├── physics/           # Collection: physics
├── biology/           # Collection: biology
├── game-theory/       # Collection: game-theory
└── machine-learning/  # Collection: machine-learning

One agent can span many collections (a "scientist" agent might include physics, biology, and chemistry). Multiple agents can share the same collections. The agent YAML is the only thing that defines which slices of the corpus each agent can search.

Project Structure

grounding-ai/
├── grounding/              # Python package (the pipeline)
├── scripts/                # Watcher, local RAG, utilities
├── mcp_servers/            # MCP corpus search server
├── agents/
│   └── examples/           # Starter agent definitions
├── tests/                  # Test suite
├── config.example.yaml     # Configuration template
├── .mcp.example.json       # MCP server config template
└── staging/                # Drop documents here for ingestion

Configuration

Copy config.example.yaml to config.yaml and adjust paths:

paths:
  corpus: ./corpus
  embeddings: ./embeddings
  staging: ./staging
  agents: ./agents
  originals: ./originals

Single machine (default): All paths are relative, everything runs locally.

Multi-machine: Point paths at Syncthing-shared directories. A dedicated server runs the staging watcher and generates embeddings; workstations sync the corpus and query it. See docs/multi-machine.md.

CLI Reference

# Ingest documents
grounding ./input-dir ./output-dir [options]
  --chunk-size 1200        # Characters per chunk (default: 1200)
  --chunk-overlap 150      # Overlap between chunks (default: 150)
  --parser marker          # Parser: unstructured or marker
  --ocr auto               # OCR: auto, on, or off
  --collections sci,math   # Collection tags (comma-separated)
  --dry-run                # Preview without writing
  --verbose                # Debug logging

# Agent management
grounding agents list --agents-dir ./agents
grounding agents show scientist --agents-dir ./agents

# Embedding generation
grounding embeddings --agent scientist --corpus ./corpus
grounding embeddings --agent scientist --corpus ./corpus --incremental
grounding embeddings --agent scientist --corpus ./corpus --check

Staging Watcher (Auto-Ingest)

For continuous ingestion, run the staging watcher: drop a document into your staging folder and it's automatically parsed, chunked, hashed, moved to originals/, and (optionally) added to affected agents' embedding indexes.

Single-machine setup

Requirements:

Linux: inotify-tools (sudo apt install inotify-tools)
macOS: fswatch (brew install fswatch) — the shipped script uses inotifywait; macOS users typically wrap it with fswatch or run the watcher inside a Linux VM/container

Run manually:

export STAGING_DIR=./staging
export CORPUS_DIR=./corpus
export ORIGINALS_DIR=./originals
export SKIPPED_DIR=./skipped
export AGENTS_DIR=./agents
export EMBEDDINGS_DIR=./embeddings
export AUTO_EMBEDDINGS=true
export LOG_FILE=./watcher.log

./scripts/staging-watcher.sh

Then drop documents into a collection subfolder:

mkdir -p staging/science
cp ~/Downloads/paper.pdf staging/science/
# Watcher logs show: parsing → chunking → hashing → embeddings update

Processing rules:

Source location	Collection tag	Destination after processing
`staging/science/paper.pdf`	`science`	`corpus/<slug>/`, original → `originals/science/`
`staging/biology/book.epub`	`biology`	`corpus/<slug>/`, original → `originals/biology/`
Scanned PDF (no text yield)	—	moved to `skipped/<collection>/`

When AUTO_EMBEDDINGS=true, every ingested document triggers incremental embedding updates for each agent whose corpus_filter.collections matches the document's collection.

Run as a systemd service (Linux)

# Copy the sample unit file and edit the paths
cp scripts/grounding-watcher.service.example ~/.config/systemd/user/grounding-watcher.service
# Edit Environment= lines to point at your directories

systemctl --user daemon-reload
systemctl --user enable --now grounding-watcher
journalctl --user -u grounding-watcher -f    # follow logs

Multi-machine setup

Point the watcher's paths at Syncthing-shared directories and run it on a dedicated ingestion server. Workstations sync corpus and embeddings and never run the watcher themselves. See docs/multi-machine.md.

Querying Your Corpus

Once you have an agent with embeddings, there are two ways to query its corpus. Both run entirely locally -- no cloud APIs, no telemetry.

Option 1: Local LLM via Ollama (recommended)

Ground a local LLM in your corpus using scripts/local_rag.py. The script loads your agent's persona, retrieves relevant chunks from its FAISS index, and feeds them to the LLM as context.

# Install Ollama and pull a model
brew install ollama          # macOS (or curl ... | sh on Linux)
ollama pull qwen2.5:14b      # Recommended for 32GB+ RAM
ollama serve

# Query your agent
python scripts/local_rag.py --agent scientist -A

The -A flag enables agentic mode, where the LLM autonomously decides when to search the corpus via tool calling. It can also perform multi-step reasoning, web searches, and file operations.

# Interactive REPL session
python scripts/local_rag.py --agent scientist -A

# Single query
python scripts/local_rag.py --agent scientist -A \
  --query "What are the thermodynamic limits of a Carnot engine?"

# With verbose output (see tool calls)
python scripts/local_rag.py --agent scientist -A -v

See how-to-local-agent.md for the full guide -- model recommendations, REPL commands, troubleshooting, and tool reference.

Option 2: MCP Server (for Claude Code, Cursor, etc.)

Grounding ships with an MCP (Model Context Protocol) server that exposes corpus search to any MCP-compatible client.

# 1. Install the MCP runtime into your venv (not a default dependency)
./venv/bin/pip install mcp

# 2. Copy the example and edit paths to match your setup
cp .mcp.example.json .mcp.json

Edit the three env vars in .mcp.json:

Variable	Points at
`CORPUS_DIR`	Root containing `_index.json` and `<slug>/chunks/` (e.g. `./corpus` or `~/Documents/Corpora/corpus`)
`EMBEDDINGS_DIR`	Root containing `<agent>/_embeddings.faiss` and `_chunk_map.json`
`AGENTS_DIR`	Directory of agent YAML files. Can point outside the repo if you keep agents in a separate repo.

CORPUS_DIR must be the same root used when embeddings were generated -- chunk paths in the FAISS chunk map are stored relative to it.

.mcp.json is gitignored by default so machine-specific paths don't leak into commits. Restart your MCP client (e.g., Claude Code) to load the server. Once configured, your client can call search_corpus to query any agent's corpus directly from inside the chat interface.

How Grounding Works

The "grounding" comes from three layers working together:

Persona -- the agent YAML's persona.style, expertise, and greeting shape how the LLM communicates
Corpus filter -- the agent's corpus_filter.collections restricts what documents the LLM can see
Retrieval -- relevant chunks from those documents are pulled in via FAISS similarity search and injected as context

The LLM never has access to your entire corpus at once. It sees the agent's persona prompt plus only the chunks most semantically relevant to the current question. This keeps responses focused, lets you scale to large corpora, and gives different agents different "knowledge" from the same underlying document set.

Quality

Retrieval changes are gated by an evaluation harness that scores each agent's FAISS index against a hand-curated fixture of query → expected-document pairs. A GitHub Actions workflow runs the harness on every PR that touches retrieval code and fails the check if any aggregate metric drops more than the configured threshold relative to a committed baseline.

See docs/eval/README.md for the fixture schema, CLI usage, CI gate details, and the baseline-refresh procedure.

Requirements

Python 3.10 - 3.13 (3.14+ not yet supported due to unstructured compatibility)
poppler-utils (system package for PDF text extraction)
Ollama (optional, for local RAG queries)

Troubleshooting

grounding: command not found — activate the venv (source venv/bin/activate) or invoke directly: ./venv/bin/grounding.

unstructured install fails on Python 3.14 — unstructured pins python<3.14. Use Python 3.13: python3.13 -m venv venv.

pdftotext: command not found — install poppler: sudo apt install poppler-utils (Linux) or brew install poppler (macOS).

PDFs are moved to skipped/ instead of corpus/ — they're likely scanned (no extractable text). The watcher's yield threshold is 1000 chars per MB. To force OCR, run grounding directly with --ocr on.

Agent search returns zero results — check embeddings exist: grounding embeddings --agent <name> --corpus ./corpus --check. If stale, regenerate: grounding embeddings --agent <name> --corpus ./corpus --incremental.

Watcher doesn't pick up files on macOS — the shipped staging-watcher.sh uses inotifywait (Linux only). On macOS, run the watcher inside a Linux VM/container, or port it to fswatch.

Ollama queries hang or time out — confirm the model is pulled and ollama serve is running: ollama list and curl http://localhost:11434/api/tags.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Apr 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grounding_ai-0.3.0.tar.gz (230.1 kB view details)

Uploaded Apr 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

grounding_ai-0.3.0-py3-none-any.whl (128.7 kB view details)

Uploaded Apr 17, 2026 Python 3

File details

Details for the file grounding_ai-0.3.0.tar.gz.

File metadata

Download URL: grounding_ai-0.3.0.tar.gz
Upload date: Apr 17, 2026
Size: 230.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for grounding_ai-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`1306818f3104c0019abeb26e4c8720a6561655b90c5a0b963476359d14e3538a`
MD5	`08116cdd7f2332a78a60660fd2c09a63`
BLAKE2b-256	`1e2bd55b5919884d09ae747a6aa4132233c1d5f88385c37fc49e15b99a28ce7d`

See more details on using hashes here.

File details

Details for the file grounding_ai-0.3.0-py3-none-any.whl.

File metadata

Download URL: grounding_ai-0.3.0-py3-none-any.whl
Upload date: Apr 17, 2026
Size: 128.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for grounding_ai-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cd6457d00a2530e1842424827f99a910360e823ed85e977dd292b698c2437e5e`
MD5	`28dc23453173eff8bbd6c70217afa0aa`
BLAKE2b-256	`4954dee673d0783f9ea56ea86c965281366e150c3444efa0b6593b8e09151c3d`

See more details on using hashes here.

grounding-ai 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Grounding

What It Does

Key Features

Quick Start

Install from PyPI

Install from source (for development)

First run (end-to-end)

Agent System

Creating Your Own Agents

Where Do My Agent Files Live?

Workflow A: Local-only (simplest, no version control)

Workflow B: Separate private agents repo (recommended for serious use)

Workflow C: Fork grounding-ai

Organizing Collections

Project Structure

Configuration

CLI Reference

Staging Watcher (Auto-Ingest)

Single-machine setup

Run as a systemd service (Linux)

Multi-machine setup

Querying Your Corpus

Option 1: Local LLM via Ollama (recommended)

Option 2: MCP Server (for Claude Code, Cursor, etc.)

How Grounding Works

Quality

Requirements

Troubleshooting

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes