Forge searchable context from Confluence and git repos for AI coding assistants
Project description
docforge
The self-hosted context engine for AI coding assistants.
Point docforge at your Confluence spaces and local git repositories. It indexes, embeds, and serves them over MCP — so Claude Code, Cursor, Copilot, and any assistant that speaks MCP can search your team's knowledge without your data leaving your infrastructure.
docforge doesn't replace your AI assistant. It feeds it — turning Claude Code, Cursor, Copilot, and anything else that speaks MCP into tools that actually know your team's docs and code.
Why docforge
| Tool | Self-hosted | Integration | Confluence + code | Footprint | Complements AI assistants? |
|---|---|---|---|---|---|
| docforge | ✓ | MCP server | ✓ (Confluence + local git) | Minimal (PG + 1 container) | ✓ (any MCP client) |
| Atlassian Rovo MCP | ✗ (Cloud-only) | MCP server | Confluence only (Cloud) | SaaS | ✓ |
| zilliztech/claude-context | ✓ | MCP server | Code only | Minimal | ✓ |
| Onyx | ✓ | MCP + chat UI | ✓ (50+ connectors) | Heavy (Standard) / Minimal (Lite) | ✓ (+ its own UI) |
| Cursor codebase index + @Docs | ✗ | Proprietary | Code + public web docs | SaaS | — (built into Cursor only) |
| Copilot Spaces | ✗ | Proprietary (MCP for actions) | Code + attachments | SaaS | — (built into Copilot only) |
| Sourcegraph Cody | ✓ (Enterprise) | OpenCtx / MCP | ✓ (via OpenCtx) | Heavy (Sourcegraph platform) | — (built into Cody only) |
| LangChain / LlamaIndex DIY | ✓ | Whatever you build | You wire it | Depends | Depends |
docforge is the narrow, focused option in this landscape: minimal footprint, MCP-native so it works with every assistant, and combines Confluence + code out of the box. It doesn't compete on connector count (Onyx wins there), visual UX (Cursor and Cody win), or SaaS convenience (Rovo). It competes on being small, legible, vendor-neutral, and self-hosted — four properties no commercial option offers together.
✅ When docforge fits
- You run Confluence Data Center/Server, or you want to self-host.
- Your team uses MCP-capable assistants (Claude Code, Cursor with MCP, Copilot with MCP, etc.).
- You want Confluence + git repos indexed together with one tool.
- Operational simplicity matters — one Postgres, one container, MIT-licensed code you can audit in an afternoon.
❌ When docforge is the wrong choice
- You need 50+ connectors (Slack, Jira, Gmail, Drive, Notion) → use Onyx or Glean.
- You need per-document ACLs enforced at query time → not yet supported; use Onyx.
- You need a chat UI for non-developers → docforge has no UI; use Onyx, Glean, or Cody.
- You're on Atlassian Cloud and happy with SaaS → Atlassian Rovo MCP is free and official.
- You need SSO / SCIM / RBAC → out of scope; docforge authenticates but doesn't authorize per-resource.
- Your corpus is very large (>100K pages/chunks) → dense-only retrieval without hybrid starts to degrade; on the roadmap.
- You need near-real-time updates → ingest is batch; no webhook-driven continuous sync yet.
- You need multilingual search evaluated → EmbeddingGemma is multilingual, but docforge has no eval coverage on non-English corpora yet.
Quick Start
pip install docforge-cli
docforge init my-project
cd my-project
# Edit docforge.yml with your Confluence URL
# Edit sources.yml with your page IDs and local git repo paths
# Edit .env with your credentials
docker compose up -d db
docforge init-db
docforge ingest
docforge serve
Note: The git crawler indexes local filesystem paths — docforge does not clone GitHub URLs. Clone first, then point docforge at the checkout path in sources.yml.
How It Works
- Configure your Confluence URL, page IDs, and local git repo paths in
sources.yml. - Ingest crawls pages and files, chunks text (~500 tokens), generates vector embeddings (768-dim).
- Serve exposes an MCP server that AI assistants query automatically.
When an AI assistant needs cross-team context, it calls docforge's search_documentation MCP tool behind the scenes and gets relevant documentation chunks with source attribution.
Architecture
Commands
| Command | Description |
|---|---|
docforge init <name> |
Scaffold a new project with config templates |
docforge init-db |
Initialize the PostgreSQL database schema |
docforge ingest |
Crawl all sources, embed, store in PostgreSQL |
docforge search "<query>" |
Test search from terminal |
docforge serve |
Run MCP server for AI assistants |
docforge serve --api |
Run FastAPI search API (for hosted deployment) |
docforge status |
Show index stats and health |
Deploy to your infrastructure
For team-wide use, deploy the search API to Azure (~$35/month at default SKUs):
- PostgreSQL Flexible Server (Burstable B1ms, 32 GB) with pgvector.
- Container App running the FastAPI search API.
- Container Registry, Key Vault, Log Analytics, managed environment.
- Team members use a lightweight MCP client that calls the hosted API.
See deploy/azure/ for Bicep templates and a full cost breakdown.
Configuration
See docs/ for the full configuration reference, including docforge.yml and sources.yml schemas.
Contributing
Contributions welcome. See CONTRIBUTING.md for development setup, branch conventions, and PR expectations. Bug reports and feature requests go through GitHub Issues; open-ended questions and ideas live in Discussions.
Evaluation & retrieval quality
docforge ships with a retrieval-quality eval harness at src/docforge/scripts/eval_search.py. It measures recall@1, recall@k, and MRR against a ground-truth query set you maintain. The harness is designed for drift detection — run it after sources.yml changes, embedding-model updates, or ranking tweaks, and compare against your baseline. There is no absolute quality threshold; the metric magnitude depends on how closely your ground-truth queries match source titles. See src/docforge/scripts/README.md for details.
FAQ
The three install-time issues new users hit most often are inline below. The full FAQ — including "no results found", "ingest skipped everything", removing sources, swapping embedding models, and where to file issues — lives on the microsite FAQ.
"HF_TOKEN required" or model download fails
The embedding model google/embeddinggemma-300m requires a Hugging Face token with access to the gated model. Create one at https://huggingface.co/settings/tokens, accept the model license at https://huggingface.co/google/embeddinggemma-300m, and set HF_TOKEN=hf_... in .env.
First ingest / first container start is very slow
The first run downloads the 300M embedding model (~1.2 GB) from Hugging Face. Locally, the model is cached at ~/.cache/huggingface/. In the Docker image, it is cached at /app/.cache/huggingface/ — mount this as a volume so container restarts do not re-download: docker run -v docforge-hf-cache:/app/.cache/huggingface ....
"Cannot connect to PostgreSQL"
Check that the database is running: docker compose up -d db. Verify DATABASE_URL in .env points to postgresql://docforge:localdev@localhost:5432/docforge (or your custom value).
License
MIT. See LICENSE.
Credits
docforge stands on open shoulders:
- EmbeddingGemma-300M — open-weights embedding model under the Gemma license.
- pgvector — vector similarity for Postgres.
- FastMCP — MCP server framework.
- FastAPI, Typer, asyncpg, sentence-transformers — core infrastructure.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docforge_cli-0.2.1.tar.gz.
File metadata
- Download URL: docforge_cli-0.2.1.tar.gz
- Upload date:
- Size: 37.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
933284ebe19a8df29f8ac6a4ba40c4639b96a8a48ac31507d16b82963cf584a4
|
|
| MD5 |
942affc67821931e63145d8f9051c1a6
|
|
| BLAKE2b-256 |
8d9653e8e7f2bcfc074915fb872da2f4548c1e55b038833b96e641305869fd79
|
Provenance
The following attestation bundles were made for docforge_cli-0.2.1.tar.gz:
Publisher:
release.yml on GranatenUdo/docforge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docforge_cli-0.2.1.tar.gz -
Subject digest:
933284ebe19a8df29f8ac6a4ba40c4639b96a8a48ac31507d16b82963cf584a4 - Sigstore transparency entry: 1381663746
- Sigstore integration time:
-
Permalink:
GranatenUdo/docforge@44f20a81d51aaa68934518f18a131bfbbef8e89a -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/GranatenUdo
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@44f20a81d51aaa68934518f18a131bfbbef8e89a -
Trigger Event:
push
-
Statement type:
File details
Details for the file docforge_cli-0.2.1-py3-none-any.whl.
File metadata
- Download URL: docforge_cli-0.2.1-py3-none-any.whl
- Upload date:
- Size: 42.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5e4f40634cc49daad8df8edc8805244437b29ad4d07a63e728dbb77dfb04072
|
|
| MD5 |
4621526dd7d87573192e3f639c4be9a0
|
|
| BLAKE2b-256 |
279df32956fa1b809e8a04331a9b9832abf3b2f5ca55bc1313b1c55960747275
|
Provenance
The following attestation bundles were made for docforge_cli-0.2.1-py3-none-any.whl:
Publisher:
release.yml on GranatenUdo/docforge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docforge_cli-0.2.1-py3-none-any.whl -
Subject digest:
e5e4f40634cc49daad8df8edc8805244437b29ad4d07a63e728dbb77dfb04072 - Sigstore transparency entry: 1381663809
- Sigstore integration time:
-
Permalink:
GranatenUdo/docforge@44f20a81d51aaa68934518f18a131bfbbef8e89a -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/GranatenUdo
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@44f20a81d51aaa68934518f18a131bfbbef8e89a -
Trigger Event:
push
-
Statement type: