Skip to main content

Semantic fact indexer over markdown knowledge bases — extracts atomic facts, entities, and authority scores, then exposes search via MCP

Project description

mcp-fact-finder

Semantic fact indexer over markdown knowledge bases.

Crawls markdown documents, extracts atomic facts using local NLP (free, no API key) or an optional LLM enrichment pass, stores facts in LanceDB with authority scoring and entity relationships, and exposes structured search via the Model Context Protocol.


Why local inference?

Pass 1 (rebel-large NLP) is completely free.

No API keys, no rate limits, no cost. Runs locally on CPU or Apple Silicon MPS (Metal Performance Shaders). A full 7,600-document knowledge base indexes in ~5 hours at $0.

Pass 2 (LLM enrichment) is optional. It adds canonical fact forms, tense classification, and better entity linking. Use a local Ollama instance (free) or OpenRouter (~$5–9 for 7,600 docs).

Recommendation: Run Pass 1 first. For most knowledge bases, Pass 1 alone is sufficient for useful search results.


Status: v0.0.3

Component Status
Data models ✅ Complete
Markdown crawler ✅ Complete
Authority scoring ✅ Complete
rebel-large extractor ✅ Complete
Ollama extractor ✅ Complete
OpenRouter/Bedrock ✅ Complete
Embedder (bge-small) ✅ Complete
LanceDB store ✅ Complete
MCP server (6 tools) ✅ Complete
Inconsistency checker ✅ Complete
Indexing pipeline ✅ Complete
Setup script ✅ Complete

Quick start

1. Install

pip install mcp-fact-finder
# or
uv add mcp-fact-finder

Or clone and develop locally:

git clone https://github.com/bobmatnyc/mcp-fact-finder
cd mcp-fact-finder
uv sync

2. Index your knowledge base (Pass 1 — free, no API key needed)

# Index a directory of markdown files
uv run scripts/index.py --path ~/my-docs/

# First 100 docs only (quick smoke test)
uv run scripts/index.py --path ~/my-docs/ --limit 100

# Show current stats
uv run scripts/index.py --path ~/my-docs/ --stats

The index is stored in {path}/.mcp-fact-finder/db — co-located with your docs, trivial to delete and rebuild. Incremental re-runs only process new or changed files (content-hash based — moving a file does not trigger reindexing).

3. Set up Claude Code MCP integration

# In the project you want to query:
cd ~/my-project
uv run --project ~/Projects/mcp-fact-finder scripts/setup.py

Creates .mcp.json (Claude Code MCP config) and .claude/skills/mcp-fact-finder.md (explains when/how to use each tool), then adds .mcp-fact-finder/ to .gitignore. Restart Claude Code to activate.

4. Optional: LLM enrichment (Pass 2)

# Via local Ollama (free, no API key):
FACT_FINDER_INFERENCE_BACKEND=ollama \
uv run scripts/index.py --path ~/my-docs/ --pass 2 --force

# Via OpenRouter (~$5-9 for 7,600 docs):
FACT_FINDER_OPENROUTER_API_KEY=sk-... \
uv run scripts/index.py --path ~/my-docs/ --pass 2 --force --workers 150

MCP Tools

Six tools are exposed to Claude Code once the index is running:

Tool Description
search_facts Natural language semantic + keyword search, authority-ranked
get_document_facts All facts extracted from one source document
compare_sources How different docs describe the same topic
get_entity_facts All facts about a named entity (person, system, concept)
check_inconsistencies Contradictions and conflicts for an entity
get_conflict_report Detailed explanation of a specific pair inconsistency

Example queries

"What was decided about the Pricerator architecture?"
→ search_facts(query="Pricerator architecture", fact_type="decided", min_authority=0.8)

"What do all documents say about latency?"
→ compare_sources(topic="P90 latency")

"Everything known about Thomas Evans"
→ get_entity_facts(entity="Thomas Evans")

"Are there contradictions about the GCP migration?"
→ check_inconsistencies(entity="GCP migration")

Authority scoring

Facts are ranked by source document type × recency × LLM confidence:

Score Document type
1.0 Technical specs, RFCs, ADRs, TDDs
0.8 Product requirements (PRDs)
0.5 Meeting notes, retros, sprint notes
0.3 Summaries, digests, inferred content

Use min_authority: 0.8 to restrict results to high-confidence sources.


Fact types

Type Pattern Example
is X is / has Y "Pricerator is the pricing engine"
said X stated / believes "Thomas said the migration was overdue"
happened X occurred (past) "GCP migration completed Q3 2024"
planned X will / intends "Team plans to deprecate the monolith"
decided It was decided "Team decided to use LanceDB"
metric X = N (quantified) "P90 latency is 340ms"

Architecture

Markdown files
    │
    ▼
Crawler (discover_documents)
  • Confluence / YAML frontmatter parsing
  • Authority tier from filename keywords
  • Content-hash document identity (stable across moves)
    │
    ▼
Pass 1: rebel-large NLP  [FREE — local CPU/MPS]
  • Relation extraction (SPO triples)
  • ~20 docs/min on Apple M-series
    │   (optional)
    ▼
Pass 2: LLM enrichment  [Ollama local = free | OpenRouter ~$5–9]
  • Canonical fact forms + tense + fact_type
  • Better entity linking
    │
    ▼
LanceDB  (vector + FTS hybrid search)
  • facts, entities, documents, inconsistencies tables
    │
    ▼
MCP Server  (stdio transport, Claude Code)
  • 6 structured search tools

Environment variables

Variable Default Description
FACT_FINDER_DB_PATH {corpus}/.mcp-fact-finder/db LanceDB path
FACT_FINDER_INFERENCE_BACKEND rebel rebel | ollama | openrouter | bedrock
FACT_FINDER_OPENROUTER_API_KEY OpenRouter key (Pass 2)
FACT_FINDER_OPENROUTER_MODEL_ID google/gemini-2.0-flash-lite-001 Pass 2 model
FACT_FINDER_BEDROCK_REGION us-east-1 AWS Bedrock region

Copy .env.example to .env and fill in as needed.


Development

uv run pytest          # run tests
uv run ruff check src/ # lint
uv run mypy src/       # type check

make build             # build wheel + sdist
make publish           # bump patch, publish to PyPI, tag, push
make publish-minor     # bump minor version

See docs/architecture.md for the full design document.


License

Proprietary. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_fact_finder-0.1.3.tar.gz (179.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcp_fact_finder-0.1.3-py3-none-any.whl (117.5 kB view details)

Uploaded Python 3

File details

Details for the file mcp_fact_finder-0.1.3.tar.gz.

File metadata

  • Download URL: mcp_fact_finder-0.1.3.tar.gz
  • Upload date:
  • Size: 179.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mcp_fact_finder-0.1.3.tar.gz
Algorithm Hash digest
SHA256 604ed0b15a3312477ef63315e53f590e149a67f79f3601fea5ed7a23cd4fde4b
MD5 693c0b3a74c10877e87fd967030a1914
BLAKE2b-256 370b856e595954516107d4ede20979b7d24f19afdc11a4a86fd8583760769deb

See more details on using hashes here.

File details

Details for the file mcp_fact_finder-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: mcp_fact_finder-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 117.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mcp_fact_finder-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 f630bfd42614f25c0a7109db6106d60306bba98884d97c5e729ca75aadcf6db1
MD5 5d5f7d97848bf4b6c8c963b11cfab604
BLAKE2b-256 773000708ff44cb3f85bb64de4d829af60936b199f378f41e43202a220824664

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page