Semantic fact indexer over markdown knowledge bases — extracts atomic facts, entities, and authority scores, then exposes search via MCP
Project description
mcp-fact-finder
Semantic fact indexer over markdown knowledge bases.
Crawls markdown documents, extracts atomic facts using local NLP (free, no API key) or an optional LLM enrichment pass, stores facts in LanceDB with authority scoring and entity relationships, and exposes structured search via the Model Context Protocol.
Why local inference?
Pass 1 (rebel-large NLP) is completely free.
No API keys, no rate limits, no cost. Runs locally on CPU or Apple Silicon MPS (Metal Performance Shaders). A full 7,600-document knowledge base indexes in ~5 hours at $0.
Pass 2 (LLM enrichment) is optional. It adds canonical fact forms, tense classification, and better entity linking. Use a local Ollama instance (free) or OpenRouter (~$5–9 for 7,600 docs).
Recommendation: Run Pass 1 first. For most knowledge bases, Pass 1 alone is sufficient for useful search results.
Status: v0.0.3
| Component | Status |
|---|---|
| Data models | ✅ Complete |
| Markdown crawler | ✅ Complete |
| Authority scoring | ✅ Complete |
| rebel-large extractor | ✅ Complete |
| Ollama extractor | ✅ Complete |
| OpenRouter/Bedrock | ✅ Complete |
| Embedder (bge-small) | ✅ Complete |
| LanceDB store | ✅ Complete |
| MCP server (6 tools) | ✅ Complete |
| Inconsistency checker | ✅ Complete |
| Indexing pipeline | ✅ Complete |
| Setup script | ✅ Complete |
Quick start
1. Install
pip install mcp-fact-finder
# or
uv add mcp-fact-finder
Or clone and develop locally:
git clone https://github.com/bobmatnyc/mcp-fact-finder
cd mcp-fact-finder
uv sync
2. Index your knowledge base (Pass 1 — free, no API key needed)
# Index a directory of markdown files
uv run scripts/index.py --path ~/my-docs/
# First 100 docs only (quick smoke test)
uv run scripts/index.py --path ~/my-docs/ --limit 100
# Show current stats
uv run scripts/index.py --path ~/my-docs/ --stats
The index is stored in {path}/.mcp-fact-finder/db — co-located with your docs,
trivial to delete and rebuild. Incremental re-runs only process new or changed files
(content-hash based — moving a file does not trigger reindexing).
3. Set up Claude Code MCP integration
# In the project you want to query:
cd ~/my-project
uv run --project ~/Projects/mcp-fact-finder scripts/setup.py
Creates .mcp.json (Claude Code MCP config) and
.claude/skills/mcp-fact-finder.md (explains when/how to use each tool),
then adds .mcp-fact-finder/ to .gitignore. Restart Claude Code to activate.
4. Optional: LLM enrichment (Pass 2)
# Via local Ollama (free, no API key):
FACT_FINDER_INFERENCE_BACKEND=ollama \
uv run scripts/index.py --path ~/my-docs/ --pass 2 --force
# Via OpenRouter (~$5-9 for 7,600 docs):
FACT_FINDER_OPENROUTER_API_KEY=sk-... \
uv run scripts/index.py --path ~/my-docs/ --pass 2 --force --workers 150
MCP Tools
Six tools are exposed to Claude Code once the index is running:
| Tool | Description |
|---|---|
search_facts |
Natural language semantic + keyword search, authority-ranked |
get_document_facts |
All facts extracted from one source document |
compare_sources |
How different docs describe the same topic |
get_entity_facts |
All facts about a named entity (person, system, concept) |
check_inconsistencies |
Contradictions and conflicts for an entity |
get_conflict_report |
Detailed explanation of a specific pair inconsistency |
Example queries
"What was decided about the Pricerator architecture?"
→ search_facts(query="Pricerator architecture", fact_type="decided", min_authority=0.8)
"What do all documents say about latency?"
→ compare_sources(topic="P90 latency")
"Everything known about Thomas Evans"
→ get_entity_facts(entity="Thomas Evans")
"Are there contradictions about the GCP migration?"
→ check_inconsistencies(entity="GCP migration")
Authority scoring
Facts are ranked by source document type × recency × LLM confidence:
| Score | Document type |
|---|---|
| 1.0 | Technical specs, RFCs, ADRs, TDDs |
| 0.8 | Product requirements (PRDs) |
| 0.5 | Meeting notes, retros, sprint notes |
| 0.3 | Summaries, digests, inferred content |
Use min_authority: 0.8 to restrict results to high-confidence sources.
Fact types
| Type | Pattern | Example |
|---|---|---|
is |
X is / has Y | "Pricerator is the pricing engine" |
said |
X stated / believes | "Thomas said the migration was overdue" |
happened |
X occurred (past) | "GCP migration completed Q3 2024" |
planned |
X will / intends | "Team plans to deprecate the monolith" |
decided |
It was decided | "Team decided to use LanceDB" |
metric |
X = N (quantified) | "P90 latency is 340ms" |
Architecture
Markdown files
│
▼
Crawler (discover_documents)
• Confluence / YAML frontmatter parsing
• Authority tier from filename keywords
• Content-hash document identity (stable across moves)
│
▼
Pass 1: rebel-large NLP [FREE — local CPU/MPS]
• Relation extraction (SPO triples)
• ~20 docs/min on Apple M-series
│ (optional)
▼
Pass 2: LLM enrichment [Ollama local = free | OpenRouter ~$5–9]
• Canonical fact forms + tense + fact_type
• Better entity linking
│
▼
LanceDB (vector + FTS hybrid search)
• facts, entities, documents, inconsistencies tables
│
▼
MCP Server (stdio transport, Claude Code)
• 6 structured search tools
Environment variables
| Variable | Default | Description |
|---|---|---|
FACT_FINDER_DB_PATH |
{corpus}/.mcp-fact-finder/db |
LanceDB path |
FACT_FINDER_INFERENCE_BACKEND |
rebel |
rebel | ollama | openrouter | bedrock |
FACT_FINDER_OPENROUTER_API_KEY |
— | OpenRouter key (Pass 2) |
FACT_FINDER_OPENROUTER_MODEL_ID |
google/gemini-2.0-flash-lite-001 |
Pass 2 model |
FACT_FINDER_BEDROCK_REGION |
us-east-1 |
AWS Bedrock region |
Copy .env.example to .env and fill in as needed.
Development
uv run pytest # run tests
uv run ruff check src/ # lint
uv run mypy src/ # type check
make build # build wheel + sdist
make publish # bump patch, publish to PyPI, tag, push
make publish-minor # bump minor version
See docs/architecture.md for the full design document.
License
Proprietary. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mcp_fact_finder-0.1.2.tar.gz.
File metadata
- Download URL: mcp_fact_finder-0.1.2.tar.gz
- Upload date:
- Size: 170.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5571f36558d7caa90dbbfb28a3684f2035c45c19e19b4fbf0fb9e9fa90e054d
|
|
| MD5 |
f7ef6c3bdde568efb88d3ce20ff1f5ed
|
|
| BLAKE2b-256 |
436cb9bcb1acb4d940940b06a3e68ef379b09f36240f8d10ea75dbf204e0f3c1
|
File details
Details for the file mcp_fact_finder-0.1.2-py3-none-any.whl.
File metadata
- Download URL: mcp_fact_finder-0.1.2-py3-none-any.whl
- Upload date:
- Size: 108.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
383c9a03735747bc855d68d968cb8af069ee2db54dcd388d2bfd9f3871610304
|
|
| MD5 |
f90f7df7f0cf12ed2458c4d6b16b0326
|
|
| BLAKE2b-256 |
fb07e722770c48f60e05bcb556ec290a80411d3bbc91737a717f44aa94635500
|