Forge searchable context from Confluence and git repos for AI coding assistants

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Project description

docforge

The self-hosted context engine for AI coding assistants.

Point docforge at your Confluence spaces and local git repositories. It indexes, embeds, and serves them over MCP — so Claude Code, Cursor, Copilot, and any assistant that speaks MCP can search your team's knowledge without your data leaving your infrastructure.

docforge doesn't replace your AI assistant. It feeds it — turning Claude Code, Cursor, Copilot, and anything else that speaks MCP into tools that actually know your team's docs and code.

Why docforge

Tool	Self-hosted	Integration	Confluence + code	Footprint	Complements AI assistants?
docforge	✓	MCP server	✓ (Confluence + local git)	Minimal (PG + 1 container)	✓ (any MCP client)
Atlassian Rovo MCP	✗ (Cloud-only)	MCP server	Confluence only (Cloud)	SaaS	✓
zilliztech/claude-context	✓	MCP server	Code only	Minimal	✓
Onyx	✓	MCP + chat UI	✓ (50+ connectors)	Heavy (Standard) / Minimal (Lite)	✓ (+ its own UI)
Cursor codebase index + @Docs	✗	Proprietary	Code + public web docs	SaaS	— (built into Cursor only)
Copilot Spaces	✗	Proprietary (MCP for actions)	Code + attachments	SaaS	— (built into Copilot only)
Sourcegraph Cody	✓ (Enterprise)	OpenCtx / MCP	✓ (via OpenCtx)	Heavy (Sourcegraph platform)	— (built into Cody only)
LangChain / LlamaIndex DIY	✓	Whatever you build	You wire it	Depends	Depends

docforge is the narrow, focused option in this landscape: minimal footprint, MCP-native so it works with every assistant, and combines Confluence + code out of the box. It doesn't compete on connector count (Onyx wins there), visual UX (Cursor and Cody win), or SaaS convenience (Rovo). It competes on being small, legible, vendor-neutral, and self-hosted — four properties no commercial option offers together.

✅ When docforge fits

You run Confluence Data Center/Server, or you want to self-host.
Your team uses MCP-capable assistants (Claude Code, Cursor with MCP, Copilot with MCP, etc.).
You want Confluence + git repos indexed together with one tool.
Operational simplicity matters — one Postgres, one container, MIT-licensed code you can audit in an afternoon.

❌ When docforge is the wrong choice

You need 50+ connectors (Slack, Jira, Gmail, Drive, Notion) → use Onyx or Glean.
You need per-document ACLs enforced at query time → not yet supported; use Onyx.
You need a chat UI for non-developers → docforge has no UI; use Onyx, Glean, or Cody.
You're on Atlassian Cloud and happy with SaaS → Atlassian Rovo MCP is free and official.
You need SSO / SCIM / RBAC → out of scope; docforge authenticates but doesn't authorize per-resource.
Your corpus is very large (>100K pages/chunks) → dense-only retrieval without hybrid starts to degrade; on the roadmap.
You need near-real-time updates → ingest is batch; no webhook-driven continuous sync yet.
You need multilingual search evaluated → Qwen3-Embedding-4B is multilingual, but docforge has no eval coverage on non-English corpora yet.

For the full trust model, accepted risks, and assumptions docforge makes about its operating environment, see docs/threat-model.md.

Quick Start

Prerequisites:

Python 3.12+
Docker (for the local Postgres + pgvector container)
A Hugging Face token (for private/gated models; not required for Qwen3-Embedding-4B which is Apache 2.0 and publicly accessible).

pip install docforge-cli
docforge init my-project
cd my-project
# Edit docforge.yml with your Confluence URL
# Edit sources.yml with your page IDs and local git repo paths
# Edit .env with your credentials (CONFLUENCE_API_TOKEN, HF_TOKEN, DATABASE_URL)
docker compose up -d db
docforge init-db
docforge ingest
docforge serve

Note: The git crawler indexes local filesystem paths — docforge does not clone GitHub URLs. Clone first, then point docforge at the checkout path in sources.yml.

How It Works

Configure your Confluence URL, page IDs, and local git repo paths in sources.yml.
Ingest crawls pages and files, chunks text (~500 tokens), generates vector embeddings (1024-dim).
Serve exposes an MCP server that AI assistants query automatically.

When an AI assistant needs cross-team context, it calls docforge's search_documentation MCP tool behind the scenes and gets relevant documentation chunks with source attribution.

Architecture

docforge architecture: Confluence and local git repos flow through docforge ingest into Postgres with pgvector, then docforge serve exposes an MCP server consumed by Claude Code, Cursor, and Copilot

Commands

Command	Description
`docforge init <name>`	Scaffold a new project with config templates
`docforge init-db`	Initialize the PostgreSQL database schema
`docforge ingest`	Crawl all sources, embed, store in PostgreSQL
`docforge search "<query>"`	Test search from terminal
`docforge serve`	Run MCP server for AI assistants
`docforge serve --api`	Run FastAPI search API (for hosted deployment)
`docforge status`	Show index stats and health

Deploy to your infrastructure

For team-wide use, deploy the search API to Azure (~€900/month at default SKUs with the Qwen3-Embedding-4B GPU embedder on a workload-profile environment):

PostgreSQL Flexible Server (Burstable B1ms, 32 GB) with pgvector.
Container App running the FastAPI search API.
Container App running the embedder service (Qwen3-Embedding-4B, model baked into the image) on a GPU workload profile (NC8as_T4).
Container Registry (Standard), Key Vault, Log Analytics, managed environment.
Team members use a lightweight MCP client that calls the hosted API.

See deploy/azure/ for Bicep templates and a full cost breakdown.

Use a hosted instance (no local DB required)

If your team already operates a docforge deployment and you only want to use it from your editor (Claude Code, etc.), you don't need to clone, ingest, or run Postgres locally:

# Generic (no auth)
pip install docforge-cli
claude mcp add -s user -e DOCFORGE_API_URL=https://docforge.example.com \
  docforge -- docforge serve --remote-api $DOCFORGE_API_URL

# Static Bearer token
pip install docforge-cli
claude mcp add -s user \
  -e DOCFORGE_API_URL=https://docforge.example.com \
  -e DOCFORGE_API_TOKEN=eyJ... \
  -e DOCFORGE_AUTH=bearer \
  docforge -- docforge serve --remote-api $DOCFORGE_API_URL --auth bearer

# Entra (Azure AD)
pip install docforge-cli[azure]
az login --tenant <your-tenant-id>
claude mcp add -s user \
  -e DOCFORGE_API_URL=https://docforge.example.com \
  -e DOCFORGE_AUDIENCE=api://<app-registration-uri> \
  -e DOCFORGE_AUTH=azure \
  -e DOCFORGE_TEAM=your-team \
  docforge -- docforge serve --remote-api $DOCFORGE_API_URL --auth azure

With --auth azure, user_name is bound to your Entra JWT subject — you can't (and don't need to) configure it.

DOCFORGE_TEAM is optional but recommended for team-tag relevance boosting in search results.

Self-hosting / forking

The embedder image bakes the Qwen3-Embedding-4B model at build time. The model is Apache 2.0 and publicly accessible — no Hugging Face gate. Forks and adopters need to:

Optionally get an HF token at https://huggingface.co/settings/tokens (not required for Qwen3-Embedding-4B, but needed if you swap to a gated model).
Add a repo secret HF_TOKEN under Settings → Secrets and variables → Actions if you use a gated model.

The CI workflow forwards the secret to BuildKit via --mount=type=secret,id=hf_token; the token never enters any image layer. If you fork this repo and run the CI workflow, it will build the embedder image automatically on commits to master and PRs (without pushing unless on master). To enable pushes to a registry, also add secrets ACR_LOGIN_SERVER, ACR_USERNAME, and ACR_PASSWORD.

Upgrading the embedding model

The dimension-mismatch guard in RemoteEmbedder makes an embedder/search API mismatch loud (HTTP 503 with a clear log line) rather than silent. Upgrade procedure:

Pick the new model. Note its output dimensionality D (e.g. 1024 for Qwen3-Embedding-4B, 768 for many older models).
Update config. Set embedding_model: <new> and embedding_dimensions: D in the search API's deployment config (Bicep parameters + Key Vault, or docforge.yml for self-hosters).

Build the embedder image with the new model:

docker build \
  --build-arg EMBEDDING_MODEL=<new> \
  --secret id=hf_token,env=HF_TOKEN \
  -f Dockerfile.embedder \
  -t docforge-embedder:<tag> .

Apply schema migration. Add a new vector column:
```
ALTER TABLE chunks ADD COLUMN embedding_new vector(D);
```
Re-ingest to populate the new column. Until backfill completes, the search API serves from the old column.
Cut over. Deploy the new embedder image first, then the new search API. The dim-mismatch guard ensures search refuses to serve wrong-dim vectors.
Drop the old column after a confidence interval.

Configuration

See docs/ for the full configuration reference, including docforge.yml and sources.yml schemas.

Contributing

Contributions welcome. See CONTRIBUTING.md for development setup, branch conventions, and PR expectations. Bug reports and feature requests go through GitHub Issues; open-ended questions and ideas live in Discussions.

Evaluation & retrieval quality

docforge ships with a retrieval-quality eval harness at src/docforge/scripts/eval_search.py. It measures recall@1, recall@k, and MRR against a ground-truth query set you maintain. The harness is designed for drift detection — run it after sources.yml changes, embedding-model updates, or ranking tweaks, and compare against your baseline. There is no absolute quality threshold; the metric magnitude depends on how closely your ground-truth queries match source titles. See src/docforge/scripts/README.md for details.

FAQ

The three install-time issues new users hit most often are inline below. The full FAQ — including "no results found", "ingest skipped everything", removing sources, swapping embedding models, and where to file issues — lives on the microsite FAQ.

"HF_TOKEN required" or model download fails

The default embedding model Qwen/Qwen3-Embedding-4B is Apache 2.0 and publicly accessible — no Hugging Face token required. If you have swapped to a gated model, create a token at https://huggingface.co/settings/tokens, accept the model license on the model page, and set HF_TOKEN=hf_... in .env.

First ingest / first container start is very slow

The first run downloads the Qwen3-Embedding-4B model (~10 GB) from Hugging Face. Locally, the model is cached at ~/.cache/huggingface/. In the Docker image, it is cached at /app/.cache/huggingface/ — mount this as a volume so container restarts do not re-download: docker run -v docforge-hf-cache:/app/.cache/huggingface .... In the GPU-backed hosted deployment the model loads into VRAM in 2-3 minutes; the API runs with minReplicas: 2 so there is no scale-to-zero cold start in normal operation.

"Cannot connect to PostgreSQL"

Check that the database is running: docker compose up -d db. Verify DATABASE_URL in .env points to postgresql://docforge:localdev@localhost:5432/docforge (or your custom value).

License

MIT. See LICENSE.

License compatibility

docforge is MIT-licensed; the default embedding model, Qwen3-Embedding-4B, is distributed under the Apache 2.0 license — fully permissive, no usage restrictions. Swap to a different model via embedding_model in docforge.yml if needed (see microsite FAQ — Can I use a different embedding model?).

Credits

docforge stands on open shoulders:

Qwen3-Embedding-4B — open-weights embedding model under Apache 2.0.
pgvector — vector similarity for Postgres.
FastMCP — MCP server framework.
FastAPI, Typer, asyncpg, sentence-transformers — core infrastructure.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

granatenudo

Release history Release notifications | RSS feed

0.7.3

May 13, 2026

This version

0.7.2

May 13, 2026

0.7.0

May 12, 2026

0.6.2

May 12, 2026

0.6.1

May 12, 2026

0.6.0

May 9, 2026

0.5.2

May 9, 2026

0.5.1

May 9, 2026

0.5.0

May 9, 2026

0.4.1

May 8, 2026

0.4.0

May 8, 2026

0.3.0

May 7, 2026

0.2.1

Apr 25, 2026

0.2.0

Apr 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docforge_cli-0.7.2.tar.gz (51.2 kB view details)

Uploaded May 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docforge_cli-0.7.2-py3-none-any.whl (56.8 kB view details)

Uploaded May 13, 2026 Python 3

File details

Details for the file docforge_cli-0.7.2.tar.gz.

File metadata

Download URL: docforge_cli-0.7.2.tar.gz
Upload date: May 13, 2026
Size: 51.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docforge_cli-0.7.2.tar.gz
Algorithm	Hash digest
SHA256	`13dae2a98a5e459da6d115badd14a34cacb9ee3feed0d0ec5f4767ab01486a0b`
MD5	`df53afbe3df2b5d03f6f547805c2fe92`
BLAKE2b-256	`18bd9225a7f4eb94de1aa017c3cd22c2e6c86c00fbd0a1b184fdc66ab59e210e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docforge_cli-0.7.2.tar.gz:

Publisher: release.yml on GranatenUdo/docforge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docforge_cli-0.7.2.tar.gz
- Subject digest: 13dae2a98a5e459da6d115badd14a34cacb9ee3feed0d0ec5f4767ab01486a0b
- Sigstore transparency entry: 1525042139
- Sigstore integration time: May 13, 2026
Source repository:
- Permalink: GranatenUdo/docforge@8f15ba4600fc22f76e044a88cbf00476cee45f85
- Branch / Tag: refs/tags/v0.7.2
- Owner: https://github.com/GranatenUdo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8f15ba4600fc22f76e044a88cbf00476cee45f85
- Trigger Event: push

File details

Details for the file docforge_cli-0.7.2-py3-none-any.whl.

File metadata

Download URL: docforge_cli-0.7.2-py3-none-any.whl
Upload date: May 13, 2026
Size: 56.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docforge_cli-0.7.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e08c7cf7114158ff97a3c818bde8569c316a04da2a9bbef3e02839284e9149a0`
MD5	`e689cdc10d9c6a295cb0bfbf80165415`
BLAKE2b-256	`f79bbfdaa71c27e5f5ca51ac9be38b403971595813141d5edafac09e872935f6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docforge_cli-0.7.2-py3-none-any.whl:

Publisher: release.yml on GranatenUdo/docforge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docforge_cli-0.7.2-py3-none-any.whl
- Subject digest: e08c7cf7114158ff97a3c818bde8569c316a04da2a9bbef3e02839284e9149a0
- Sigstore transparency entry: 1525042163
- Sigstore integration time: May 13, 2026
Source repository:
- Permalink: GranatenUdo/docforge@8f15ba4600fc22f76e044a88cbf00476cee45f85
- Branch / Tag: refs/tags/v0.7.2
- Owner: https://github.com/GranatenUdo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8f15ba4600fc22f76e044a88cbf00476cee45f85
- Trigger Event: push

docforge-cli 0.7.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

docforge

Why docforge

✅ When docforge fits

❌ When docforge is the wrong choice

Quick Start

How It Works

Architecture

Commands

Deploy to your infrastructure

Use a hosted instance (no local DB required)

Self-hosting / forking

Upgrading the embedding model

Configuration

Contributing

Evaluation & retrieval quality

FAQ

"HF_TOKEN required" or model download fails

First ingest / first container start is very slow

"Cannot connect to PostgreSQL"

License

License compatibility

Credits

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance