Skip to main content

Medical RAG with Asset-Aware MCP - Precise PDF asset retrieval (tables, figures, sections) for AI Agents

Project description

asset-aware-mcp

๐Ÿฅ Medical RAG with Asset-Aware MCP - Precise PDF asset retrieval (tables, figures, sections) and Knowledge Graph for AI Agents.

License

๐ŸŒ ็น้ซ”ไธญๆ–‡ ยท Docs Site ยท GitHub Wiki

๐ŸŽฏ Why Asset-Aware MCP?

AI cannot directly read image files on your computer. This is a common misconception.

Method Can AI analyze image content? Description
โŒ Provide PNG path No AI cannot access the local file system
โœ… Asset-Aware MCP Yes Retrieves Base64 via MCP, allowing AI vision to understand directly

Real-world Effect

# After retrieving the image via MCP, the AI can analyze it directly:

User: What is this figure about?

AI: This is the architecture diagram for Scaled Dot-Product Attention:
    1. Inputs: Q (Query), K (Key), V (Value)
    2. MatMul of Q and K
    3. Scale (1/โˆšdโ‚–)
    4. Optional Mask (for decoder)
    5. SoftMax normalization
    6. Final MatMul with V to get the output

This is the value of Asset-Aware MCP - enabling AI Agents to truly "see" and understand charts and tables in your PDF literature.


โœจ Features

  • ๐Ÿ“„ Asset-Aware ETL - PDF โ†’ Markdown with a PyMuPDF-first parser and retained Marker code path:
    • PyMuPDF (default) - Fast extraction (~50MB)
    • Marker (use_marker=True) - High-precision structured parsing code path retained, but packaged runtime remains on security hold in v0.7.0 until upstream marker-pdf supports patched Pillow
  • ๐Ÿงฉ Unified Segmentation Export - Normalized segmentation.json merges manifest, blocks, reading order, and persisted markdown line spans for downstream tools and extensions.
  • ๐Ÿ›ก๏ธ PDF Safety/Structure/Coverage/Accessibility Audits - OpenDataloader-inspired artifact-only reports flag suspicious hidden/off-page/prompt-injection text, native structure signals, segmentation coverage gaps, and accessibility/readability readiness via the existing document facade. document(op="prepare_ai") and document(op="auto") expose agent-ready status and next actions without adding public tools.
  • ๐Ÿงญ Structural Pointer Retrieval - Proxy-Pointer-inspired document(op="pointer_index"), document(op="structural_retrieve"), and document(op="compare") preserve section breadcrumbs, line/char/byte locators, source hashes, asset IDs, and evidence-span provenance without adding MCP tools.
  • ๐Ÿ–ผ๏ธ Layout Overlay Debugging - Render page overlays from original.pdf to inspect bbox, segment type, and reading order visually.
  • ๐Ÿ”ค On-Demand OCR Preprocessing - Optional ocrmypdf preprocessing path for scanned PDFs before ETL.
  • ๐Ÿงญ Section Navigation - Dynamic hierarchy section tree through the section facade: browse, search, detail, content reading, and block extraction for any depth of headings.
  • ๐Ÿ”„ Async Job Pipeline - Supports asynchronous ingest, Marker-required parse, OCR, and conversion jobs with progress tracking.
  • ๐Ÿ—บ๏ธ Document Manifest - Provides a structured "map" of the document for precise data access by Agents.
  • ๐Ÿง  LightRAG Integration - Knowledge Graph + Vector Index, supporting cross-document comparison and reasoning.
  • ๐Ÿงพ Verified Citation Bundles - citation_bundle, Foam evidence packs, citation health checks, table/figure evidence notes, and claim promotion export citation-ready spans with locator, quote/hash, context, CRAAP scaffold, and verification status.
  • ๐Ÿ“ Docx Editing (DFM) - Edit .docx files in Markdown via Docx-Flavored Markdown format. Supports legacy .doc, .odt, and .ods ingest via LibreOffice auto-conversion. The balanced surface keeps 6 DOCX/DFM public entrypoints for ingest, read, save, validation, conversion, table edit planning, and Docx โ†” A2T bridges.
  • ๐Ÿ›ก๏ธ DFM Integrity Checker - Automatic validation and auto-repair at every pipeline stage (post-ingest, pre-save, post-save). Catches orphan markers, column mismatches, and format inconsistencies.
  • ๐Ÿ“Š A2T (Anything to Table) - 7 operation-based tools for building professional tables from any source (PDF assets, Knowledge Graph, URLs, user input). Features: stable row IDs, row search/filter/paging, citation coverage, artifact-only large-table render, skipped-large-table UX, Citations (AssetRef), Audit Trail, Schema Evolution, Templates, Drafting, and Token-efficient resumption.
  • ๐Ÿ–ฅ๏ธ VS Code Management Extension - Graphical interface for monitoring server status, ingested documents, document artifacts, citation spans, and A2T tables/drafts with one-click Excel export.
  • ๐Ÿ”Œ MCP Server - Exposes tools and resources to Copilot/Claude via FastMCP.
  • ๐Ÿฅ Medical Research Focus - Optimized for medical literature, supporting Base64 image transmission for Vision AI analysis.

๐Ÿ—๏ธ Architecture

Asset-Aware MCP Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    AI Agent (Copilot)                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚ MCP Protocol (Tools & Resources)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚            MCP Server (Modular Presentation)            โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚ tools/: 30 public tools (balanced surface)                   โ”‚   โ”‚
โ”‚  โ”‚   17 facade tools + 13 high-frequency shortcuts       โ”‚   โ”‚
โ”‚  โ”‚   compact=17 โ”‚ legacy/direct compatibility=63 โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚ resources/: 13 resources in 2 modules           โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  ETL Pipeline (DDD)                     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”              โ”‚
โ”‚  โ”‚ PyMuPDF  โ”‚  โ”‚  Asset   โ”‚  โ”‚ LightRAG โ”‚              โ”‚
โ”‚  โ”‚ Adapter  โ”‚โ†’ โ”‚  Parser  โ”‚โ†’ โ”‚  Index   โ”‚              โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   Local Storage                         โ”‚
โ”‚  ./data/                                                โ”‚
โ”‚  โ”œโ”€โ”€ {doc_id}/        # PDF document artifacts          โ”‚
โ”‚  โ”œโ”€โ”€ docx_{id}/       # Docx IR + DFM + Assets          โ”‚
โ”‚  โ”œโ”€โ”€ tables/          # A2T Tables (JSON/MD/XLSX)       โ”‚
โ”‚  โ”‚   โ””โ”€โ”€ drafts/      # Table Drafts (Persistence)      โ”‚
โ”‚  โ””โ”€โ”€ lightrag_db/     # Knowledge Graph                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“ Project Structure (DDD)

asset-aware-mcp/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ domain/              # ๐Ÿ”ต Domain: Entities, Value Objects, Interfaces
โ”‚   โ”œโ”€โ”€ application/         # ๐ŸŸข Application: Doc Service, Table Service (A2T), Asset Service
โ”‚   โ”œโ”€โ”€ infrastructure/      # ๐ŸŸ  Infrastructure: PyMuPDF, LightRAG, Excel Renderer
โ”‚   โ””โ”€โ”€ presentation/        # ๐Ÿ”ด Presentation: MCP Server (FastMCP)
โ”œโ”€โ”€ data/                    # Document and Asset Storage
โ”œโ”€โ”€ docs/
โ”‚   โ””โ”€โ”€ spec.md              # Technical Specification
โ”œโ”€โ”€ tests/                   # Unit and Integration Tests
โ”œโ”€โ”€ vscode-extension/        # VS Code Management Extension
โ””โ”€โ”€ pyproject.toml           # uv Project Config

๐Ÿ“ Architecture Diagrams

Visual overview for the project. All diagrams use consistent GitHub README style.

Diagram Description
01 โ€” System Architecture Full stack: Telegram โ†’ Gateway โ†’ MCP Adapter โ†’ 3 MCP servers โ†’ Ollama
02 โ€” Data Layout 30 balanced public tools + 13 resources; legacy direct tool compatibility remains available
03 โ€” PDF Ingestion Pipeline 7-stage flow from PDF upload to knowledge graph
04 โ€” DOCX Bidirectional Edit DOCX ingest โ†’ TableContext edit โ†’ round-trip save workflow
05 โ€” Knowledge Graph Search Cross-document search with 3 parallel query paths
06 โ€” Installation Steps 7-step installation from clone to verification
07 โ€” PDF ETL Pipeline PyMuPDF default path + Marker security-hold diagnostics
08 โ€” KG Architecture lightrag-hku 3-layer KG architecture
09 โ€” Agent Harness Concept Assistant harness model for stateless agents

๐Ÿ’ก All generation prompts are saved in docs/diagrams/ALL-PROMPTS.md for style consistency and regeneration.

๐Ÿš€ Quick Start

# Install dependencies (using uv) โ€” default install skips Marker/torch
uv sync

# v0.7.0: Marker extra is temporarily empty because marker-pdf pins
# Pillow<11 while the secure runtime requires Pillow>=12.2.0.
# Use the default PyMuPDF backend until upstream marker-pdf supports patched Pillow.

# Run MCP Server
uv run python -m src.presentation.server

# Or use the VS Code extension for graphical management

Runtime note: The VS Code extension prefers a managed Python 3.11 runtime when launching the MCP server via version-pinned uv tool run, with Python 3.10 fallback for older machines. This avoids native package builds on end-user machines, especially macOS systems without Xcode Command Line Tools, while keeping the project itself compatible with newer Python versions.

Installation scope note:

  • The VS Code extension installs once per user (global). MCP launch env defaults DATA_DIR to workspace ./data and UV_CACHE_DIR to DATA_DIR/.uv-cache; Prepare Server Runtime warms a workspace .uv-cache, falling back to extension global storage only when no workspace is open.
  • Runtime data stays with your repo: .env and assetAwareMcp.dataDir default to ./data, so ingested assets and the uv cache used by the launched server remain scoped to the current workspace.

Marker note: Since v0.6.28 the packaged Marker extra has intentionally stayed on security hold: upstream marker-pdf 1.10.2 requires Pillow<11, while this release pins Pillow>=12.2.0 for patched image-processing security. Default installs use the PyMuPDF backend only. use_marker=True / parse_pdf_structure will report that Marker is unavailable until upstream Marker supports a patched Pillow range.

๐Ÿ”Œ MCP Tools

The default runtime surface is balanced: 30 public tools that keep the full document workflow available without overwhelming agents. It is made of 17 operation-based facade tools plus 13 high-frequency shortcuts. Set ASSET_AWARE_MCP_TOOL_SURFACE=compact for the 17 facade-only surface, or ASSET_AWARE_MCP_TOOL_SURFACE=legacy / ASSET_AWARE_MCP_ENABLE_LEGACY_TOOLS=true for the full 63-tool compatibility inventory.

Area Balanced public tools
Documents, assets, evidence, conversion document, document_asset, evidence, convert_document, ingest_documents, list_documents, parse_pdf_structure, fetch_document_asset, find_evidence_spans, verify_citation_ref, citation_bundle
DOCX / DFM docx, docx_table, ingest_docx, get_docx_content, save_docx, docx_table_edit_plan
Sections, jobs, KG, ETL profiles section, job, get_job_status, list_jobs, knowledge, etl_profile
A2T tables plan_table, table_manage, table_data, table_cite, table_history, table_draft, discover_sources

See MCP Tools and Tool Consolidation for operation details, shortcut rationale, and legacy direct-tool mapping.

Agent handoff note: Use document(op="auto", file_paths=[...]) for new PDFs and document(op="auto", doc_id="...") or document(op="prepare_ai", doc_id="...") for existing documents. document(op="prepare_ai", output_format="json") returns the v2 readiness contract with status, blockers, warnings, capabilities, artifacts, missing_audits, invalid_audits, audit_artifacts, and next_actions. document(op="audit", doc_id="...") reuses current audit artifacts only when they are present and valid; pass refresh=true to rebuild safety, native-structure, coverage, and accessibility reports. Use document(op="pointer_index"), document(op="structural_retrieve", query="..."), and document(op="compare", doc_b_id="...", criteria="...") when an agent needs section-level structural retrieval or comparison without new public tools. Readiness and job-status artifact discovery are read-only, so status checks do not create document directories.

PDF audit caveat: The audit reports are inspired by OpenDataloader-style artifact workflows, but they are not a sanitizer, a PDF/UA certification, or an OpenDataloader compatibility layer. They preserve source artifacts and report conservative diagnostics for review.

๐Ÿ”ง Tech Stack

Category Technology
Language Python 3.10+
Package Manager uv (all pip/setup-python removed)
ETL PyMuPDF (fitz); Marker is temporarily on security hold
RAG LightRAG (lightrag-hku)
MCP FastMCP
Storage Local filesystem (JSON/Markdown/PNG)

๐Ÿ“‹ Documentation

Installation guidance:

  • Default install: uv sync (slim ~227 MB; no LightRAG/KG dependencies).

  • LightRAG / Knowledge Graph backend (optional, since v0.6.34): uv tool install --upgrade --python 3.11 'asset-aware-mcp[lightrag]' for uvx/published users, or uv sync --extra lightrag for local source checkouts. Required before setting ENABLE_LIGHTRAG=true.

  • VS Code extension: run the command Asset-Aware MCP: Install LightRAG Backend from the Command Palette; it auto-detects source vs published mode and emits the matching install command.

  • OpenRouter optional preset (since v0.6.35): set LLM_BACKEND=openrouter, OPENROUTER_API_KEY=..., and optionally OPENROUTER_MODEL=liquid/lfm-2.5-1.2b-instruct:free for fast low-cost summaries and draft RAG answers. LightRAG retrieval still uses the configured embedding backend.

  • Marker backend: temporarily disabled in v0.7.0 because marker-pdf pins vulnerable Pillow<11; the marker / pdf extras are compatibility placeholders until upstream supports patched Pillow.

  • VS Code extension: assetAwareMcp.enableMarkerBackend is retained as a setting, but the launcher will not install marker-pdf while the security hold is active.

  • Technical Spec - Detailed technical specification

  • Architecture - System architecture

  • Constitution - Project principles

  • Competitive Analysis - MCP + DOCX ecosystem landscape

๐Ÿ“„ License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asset_aware_mcp-0.7.0.tar.gz (363.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

asset_aware_mcp-0.7.0-py3-none-any.whl (372.5 kB view details)

Uploaded Python 3

File details

Details for the file asset_aware_mcp-0.7.0.tar.gz.

File metadata

  • Download URL: asset_aware_mcp-0.7.0.tar.gz
  • Upload date:
  • Size: 363.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for asset_aware_mcp-0.7.0.tar.gz
Algorithm Hash digest
SHA256 0663d0ad9b9ea5275224bfb049b979433bf6aef95b66bd801d48702bcb6e76ea
MD5 be6b66c748ce60862f83f8ddd010b992
BLAKE2b-256 bef7f4709e02896a4d9b43860970b2d09001300d3b4ebfb5a6b633d586cf6909

See more details on using hashes here.

Provenance

The following attestation bundles were made for asset_aware_mcp-0.7.0.tar.gz:

Publisher: release.yml on u9401066/asset-aware-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file asset_aware_mcp-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: asset_aware_mcp-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 372.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for asset_aware_mcp-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7c0bc6bc52b75f7cb5a586cb6d18e33ef8eabc88cf486f78a7777840ee3c5e04
MD5 9dfacb61772e90d54d7d3c2506575e8a
BLAKE2b-256 fb79a277bea9b4915c5908e13a5faf25b5b480658dc02a2935563f104b9a69ad

See more details on using hashes here.

Provenance

The following attestation bundles were made for asset_aware_mcp-0.7.0-py3-none-any.whl:

Publisher: release.yml on u9401066/asset-aware-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page