Skip to main content

Semantic fact indexer over markdown knowledge bases — extracts atomic facts, entities, and authority scores, then exposes search via MCP

Project description

mcp-fact-finder

Semantic fact indexer over markdown knowledge bases.

Crawls markdown documents, extracts atomic facts using local NLP (free, no API key) or an optional LLM enrichment pass, stores facts in LanceDB with authority scoring and entity relationships, and exposes structured search via the Model Context Protocol.


Why local inference?

Pass 1 (rebel-large NLP) is completely free.

No API keys, no rate limits, no cost. Runs locally on CPU or Apple Silicon MPS (Metal Performance Shaders). A full 7,600-document knowledge base indexes in ~5 hours at $0.

Pass 2 (LLM enrichment) is optional. It adds canonical fact forms, tense classification, and better entity linking. Use a local Ollama instance (free) or OpenRouter (~$5–9 for 7,600 docs).

Recommendation: Run Pass 1 first. For most knowledge bases, Pass 1 alone is sufficient for useful search results.


Status: v0.0.3

Component Status
Data models ✅ Complete
Markdown crawler ✅ Complete
Authority scoring ✅ Complete
rebel-large extractor ✅ Complete
Ollama extractor ✅ Complete
OpenRouter/Bedrock ✅ Complete
Embedder (bge-small) ✅ Complete
LanceDB store ✅ Complete
MCP server (6 tools) ✅ Complete
Inconsistency checker ✅ Complete
Indexing pipeline ✅ Complete
Setup script ✅ Complete

Quick start

1. Install

pip install mcp-fact-finder
# or
uv add mcp-fact-finder

Or clone and develop locally:

git clone https://github.com/bobmatnyc/mcp-fact-finder
cd mcp-fact-finder
uv sync

2. Index your knowledge base (Pass 1 — free, no API key needed)

# Index a directory of markdown files
uv run scripts/index.py --path ~/my-docs/

# First 100 docs only (quick smoke test)
uv run scripts/index.py --path ~/my-docs/ --limit 100

# Show current stats
uv run scripts/index.py --path ~/my-docs/ --stats

The index is stored in {path}/.mcp-fact-finder/db — co-located with your docs, trivial to delete and rebuild. Incremental re-runs only process new or changed files (content-hash based — moving a file does not trigger reindexing).

3. Set up Claude Code MCP integration

# In the project you want to query:
cd ~/my-project
uv run --project ~/Projects/mcp-fact-finder scripts/setup.py

Creates .mcp.json (Claude Code MCP config) and .claude/skills/mcp-fact-finder.md (explains when/how to use each tool), then adds .mcp-fact-finder/ to .gitignore. Restart Claude Code to activate.

4. Optional: LLM enrichment (Pass 2)

# Via local Ollama (free, no API key):
FACT_FINDER_INFERENCE_BACKEND=ollama \
uv run scripts/index.py --path ~/my-docs/ --pass 2 --force

# Via OpenRouter (~$5-9 for 7,600 docs):
FACT_FINDER_OPENROUTER_API_KEY=sk-... \
uv run scripts/index.py --path ~/my-docs/ --pass 2 --force --workers 150

MCP Tools

Six tools are exposed to Claude Code once the index is running:

Tool Description
search_facts Natural language semantic + keyword search, authority-ranked
get_document_facts All facts extracted from one source document
compare_sources How different docs describe the same topic
get_entity_facts All facts about a named entity (person, system, concept)
check_inconsistencies Contradictions and conflicts for an entity
get_conflict_report Detailed explanation of a specific pair inconsistency

Example queries

"What was decided about the Pricerator architecture?"
→ search_facts(query="Pricerator architecture", fact_type="decided", min_authority=0.8)

"What do all documents say about latency?"
→ compare_sources(topic="P90 latency")

"Everything known about Thomas Evans"
→ get_entity_facts(entity="Thomas Evans")

"Are there contradictions about the GCP migration?"
→ check_inconsistencies(entity="GCP migration")

Authority scoring

Facts are ranked by source document type × recency × LLM confidence:

Score Document type
1.0 Technical specs, RFCs, ADRs, TDDs
0.8 Product requirements (PRDs)
0.5 Meeting notes, retros, sprint notes
0.3 Summaries, digests, inferred content

Use min_authority: 0.8 to restrict results to high-confidence sources.


Fact types

Type Pattern Example
is X is / has Y "Pricerator is the pricing engine"
said X stated / believes "Thomas said the migration was overdue"
happened X occurred (past) "GCP migration completed Q3 2024"
planned X will / intends "Team plans to deprecate the monolith"
decided It was decided "Team decided to use LanceDB"
metric X = N (quantified) "P90 latency is 340ms"

Architecture

Markdown files
    │
    ▼
Crawler (discover_documents)
  • Confluence / YAML frontmatter parsing
  • Authority tier from filename keywords
  • Content-hash document identity (stable across moves)
    │
    ▼
Pass 1: rebel-large NLP  [FREE — local CPU/MPS]
  • Relation extraction (SPO triples)
  • ~20 docs/min on Apple M-series
    │   (optional)
    ▼
Pass 2: LLM enrichment  [Ollama local = free | OpenRouter ~$5–9]
  • Canonical fact forms + tense + fact_type
  • Better entity linking
    │
    ▼
LanceDB  (vector + FTS hybrid search)
  • facts, entities, documents, inconsistencies tables
    │
    ▼
MCP Server  (stdio transport, Claude Code)
  • 6 structured search tools

Environment variables

Variable Default Description
FACT_FINDER_DB_PATH {corpus}/.mcp-fact-finder/db LanceDB path
FACT_FINDER_INFERENCE_BACKEND rebel rebel | ollama | openrouter | bedrock
FACT_FINDER_OPENROUTER_API_KEY OpenRouter key (Pass 2)
FACT_FINDER_OPENROUTER_MODEL_ID google/gemini-2.0-flash-lite-001 Pass 2 model
FACT_FINDER_BEDROCK_REGION us-east-1 AWS Bedrock region

Copy .env.example to .env and fill in as needed.


Development

uv run pytest          # run tests
uv run ruff check src/ # lint
uv run mypy src/       # type check

make build             # build wheel + sdist
make publish           # bump patch, publish to PyPI, tag, push
make publish-minor     # bump minor version

See docs/architecture.md for the full design document.


License

Proprietary. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_fact_finder-0.1.2.tar.gz (170.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcp_fact_finder-0.1.2-py3-none-any.whl (108.9 kB view details)

Uploaded Python 3

File details

Details for the file mcp_fact_finder-0.1.2.tar.gz.

File metadata

  • Download URL: mcp_fact_finder-0.1.2.tar.gz
  • Upload date:
  • Size: 170.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mcp_fact_finder-0.1.2.tar.gz
Algorithm Hash digest
SHA256 b5571f36558d7caa90dbbfb28a3684f2035c45c19e19b4fbf0fb9e9fa90e054d
MD5 f7ef6c3bdde568efb88d3ce20ff1f5ed
BLAKE2b-256 436cb9bcb1acb4d940940b06a3e68ef379b09f36240f8d10ea75dbf204e0f3c1

See more details on using hashes here.

File details

Details for the file mcp_fact_finder-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: mcp_fact_finder-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 108.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mcp_fact_finder-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 383c9a03735747bc855d68d968cb8af069ee2db54dcd388d2bfd9f3871610304
MD5 f90f7df7f0cf12ed2458c4d6b16b0326
BLAKE2b-256 fb07e722770c48f60e05bcb556ec290a80411d3bbc91737a717f44aa94635500

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page