Skip to main content

Medical RAG with Asset-Aware MCP - Precise PDF asset retrieval (tables, figures, sections) for AI Agents

Project description

asset-aware-mcp

๐Ÿฅ Medical RAG with Asset-Aware MCP - Precise PDF asset retrieval (tables, figures, sections) and Knowledge Graph for AI Agents.

License

๐ŸŒ ็น้ซ”ไธญๆ–‡

๐ŸŽฏ Why Asset-Aware MCP?

AI cannot directly read image files on your computer. This is a common misconception.

Method Can AI analyze image content? Description
โŒ Provide PNG path No AI cannot access the local file system
โœ… Asset-Aware MCP Yes Retrieves Base64 via MCP, allowing AI vision to understand directly

Real-world Effect

# After retrieving the image via MCP, the AI can analyze it directly:

User: What is this figure about?

AI: This is the architecture diagram for Scaled Dot-Product Attention:
    1. Inputs: Q (Query), K (Key), V (Value)
    2. MatMul of Q and K
    3. Scale (1/โˆšdโ‚–)
    4. Optional Mask (for decoder)
    5. SoftMax normalization
    6. Final MatMul with V to get the output

This is the value of Asset-Aware MCP - enabling AI Agents to truly "see" and understand charts and tables in your PDF literature.


โœจ Features

  • ๐Ÿ“„ Asset-Aware ETL - PDF โ†’ Markdown with dual-engine PDF parsing:
    • PyMuPDF (default) - Fast extraction (~50MB)
    • Marker (optional, use_marker=True) - High-precision structured parsing with blocks.json (bbox/coordinates)
  • ๐Ÿงฉ Unified Segmentation Export - Normalized segmentation.json merges manifest, blocks, reading order, and persisted markdown line spans for downstream tools and extensions.
  • ๐Ÿ–ผ๏ธ Layout Overlay Debugging - Render page overlays from original.pdf to inspect bbox, segment type, and reading order visually.
  • ๐Ÿ”ค On-Demand OCR Preprocessing - Optional ocrmypdf preprocessing path for scanned PDFs before ETL.
  • ๐Ÿงญ Section Navigation - Dynamic hierarchy section tree with 5 tools: browse, search, detail, content reading, and block extraction for any depth of headings.
  • ๐Ÿ”„ Async Job Pipeline - Supports asynchronous task processing and progress tracking for large documents.
  • ๐Ÿ—บ๏ธ Document Manifest - Provides a structured "map" of the document for precise data access by Agents.
  • ๐Ÿง  LightRAG Integration - Knowledge Graph + Vector Index, supporting cross-document comparison and reasoning.
  • ๐Ÿงพ Citation-Aware KG Output - consult_knowledge_graph now supports structured answer/reference payloads for downstream agent workflows.
  • ๐Ÿ“ Docx Editing (DFM) - Edit .docx files in Markdown via Docx-Flavored Markdown format. Supports legacy .doc, .odt, and .ods ingest via LibreOffice auto-conversion. 14 tools: ingest, read, save, list, delete, export, strict round-trip validation, DOCXโ†’PDF, DOCXโ†’DOC, DOCXโ†’ODT, and Docx โ†” A2T bridges.
  • ๐Ÿ›ก๏ธ DFM Integrity Checker - Automatic validation and auto-repair at every pipeline stage (post-ingest, pre-save, post-save). Catches orphan markers, column mismatches, and format inconsistencies.
  • ๐Ÿ“Š A2T (Anything to Table) - 7 operation-based tools for building professional tables from any source (PDF assets, Knowledge Graph, URLs, user input). Features: Citations (AssetRef), Audit Trail, Schema Evolution, Templates, Drafting, and Token-efficient resumption.
  • ๐Ÿ–ฅ๏ธ VS Code Management Extension - Graphical interface for monitoring server status, ingested documents, and A2T tables/drafts with one-click Excel export.
  • ๐Ÿ”Œ MCP Server - Exposes tools and resources to Copilot/Claude via FastMCP.
  • ๐Ÿฅ Medical Research Focus - Optimized for medical literature, supporting Base64 image transmission for Vision AI analysis.

๐Ÿ—๏ธ Architecture

Asset-Aware MCP Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    AI Agent (Copilot)                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚ MCP Protocol (Tools & Resources)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚            MCP Server (Modular Presentation)            โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚ tools/: 48 tools in 7 modules                   โ”‚   โ”‚
โ”‚  โ”‚   document (11) โ”‚ docx (14) โ”‚ section (5)       โ”‚   โ”‚
โ”‚  โ”‚   job (3) โ”‚ knowledge (2) โ”‚ table (7) โ”‚ profile (5) โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚ resources/: 13 resources in 2 modules           โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  ETL Pipeline (DDD)                     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”              โ”‚
โ”‚  โ”‚ PyMuPDF  โ”‚  โ”‚  Asset   โ”‚  โ”‚ LightRAG โ”‚              โ”‚
โ”‚  โ”‚ Adapter  โ”‚โ†’ โ”‚  Parser  โ”‚โ†’ โ”‚  Index   โ”‚              โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   Local Storage                         โ”‚
โ”‚  ./data/                                                โ”‚
โ”‚  โ”œโ”€โ”€ doc_{id}/        # Document Assets                 โ”‚
โ”‚  โ”œโ”€โ”€ docx_{id}/       # Docx IR + DFM + Assets          โ”‚
โ”‚  โ”œโ”€โ”€ tables/          # A2T Tables (JSON/MD/XLSX)       โ”‚
โ”‚  โ”‚   โ””โ”€โ”€ drafts/      # Table Drafts (Persistence)      โ”‚
โ”‚  โ””โ”€โ”€ lightrag_db/     # Knowledge Graph                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“ Project Structure (DDD)

asset-aware-mcp/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ domain/              # ๐Ÿ”ต Domain: Entities, Value Objects, Interfaces
โ”‚   โ”œโ”€โ”€ application/         # ๐ŸŸข Application: Doc Service, Table Service (A2T), Asset Service
โ”‚   โ”œโ”€โ”€ infrastructure/      # ๐ŸŸ  Infrastructure: PyMuPDF, LightRAG, Excel Renderer
โ”‚   โ””โ”€โ”€ presentation/        # ๐Ÿ”ด Presentation: MCP Server (FastMCP)
โ”œโ”€โ”€ data/                    # Document and Asset Storage
โ”œโ”€โ”€ docs/
โ”‚   โ””โ”€โ”€ spec.md              # Technical Specification
โ”œโ”€โ”€ tests/                   # Unit and Integration Tests
โ”œโ”€โ”€ vscode-extension/        # VS Code Management Extension
โ””โ”€โ”€ pyproject.toml           # uv Project Config

๐Ÿ“ Architecture Diagrams

Visual overview for the project. All diagrams use consistent GitHub README style.

Diagram Description
01 โ€” System Architecture Full stack: Telegram โ†’ Gateway โ†’ MCP Adapter โ†’ 3 MCP servers โ†’ Ollama
02 โ€” Data Layout 48 tools organized in 7 categories with asset-aware data tree
03 โ€” PDF Ingestion Pipeline 7-stage flow from PDF upload to knowledge graph
04 โ€” DOCX Bidirectional Edit DOCX ingest โ†’ TableContext edit โ†’ round-trip save workflow
05 โ€” Knowledge Graph Search Cross-document search with 3 parallel query paths
06 โ€” Installation Steps 7-step installation from clone to verification
07 โ€” PDF ETL Pipeline Dual-engine parsing: PyMuPDF + Marker
08 โ€” KG Architecture lightrag-hku 3-layer KG architecture
08 โ€” KG Architecture lightrag-hku 3-layer KG architecture

๐Ÿ’ก All generation prompts are saved in docs/diagrams/prompts/README.md for style consistency and regeneration.

๐Ÿš€ Quick Start

# Install dependencies (using uv) โ€” default install skips Marker/torch
uv sync

# Optional: install Marker backend only if you need structured parsing
uv sync --extra marker

# Run MCP Server
uv run python -m src.presentation.server

# Or use the VS Code extension for graphical management

Runtime note: The VS Code extension prefers a managed Python 3.11 runtime when launching the MCP server via uv or uvx. This avoids native package builds on end-user machines, especially macOS systems without Xcode Command Line Tools, while keeping the project itself compatible with newer Python versions.

Installation scope note:

  • The VS Code extension installs once per user (global). The MCP server launched through uvx asset-aware-mcp reuses the user uv cache rather than reinstalling per workspace.
  • Runtime data stays with your repo: .env and assetAwareMcp.dataDir default to ./data, so ingested assets remain scoped to the current workspace.

Marker note: marker-pdf is now an optional dependency because it may pull in torch, surya, and platform-specific ML wheels. Default installs use the PyMuPDF backend only. Enable Marker only when you need use_marker=True or parse_pdf_structure.

๐Ÿ”Œ MCP Tools

Document & Asset Tools

Tool Purpose
ingest_documents Process PDF files with optional Marker backend (use_marker=True for blocks.json)
list_documents List all ingested documents and their asset counts
delete_document Delete an ingested PDF, its local artifacts, and LightRAG index entries when enabled
convert_pdf_to_docx Reconstruct a readable DOCX from extracted PDF content
convert_pdf_to_pptx Rebuild editable PPTX slides from extracted PDF markdown and figures
inspect_document_manifest Inspect document structure before fetching specific assets
fetch_document_asset Precisely retrieve tables (MD) / figures (B64) / sections
parse_pdf_structure Run high-precision Marker parsing and emit structured blocks
search_source_location Search exact source locations with page + bbox for verification
export_document_segmentation Export normalized segmentation.json with reading order + line ranges
visualize_document_layout Render page overlay images for bbox / type / reading-order inspection
ocr_pdf_document Run OCR preprocessing and generate a cleaned PDF for later ETL

Job Management Tools

Tool Purpose
get_job_status Get async ingestion job progress and final result
list_jobs List active or historical ETL jobs
cancel_job Cancel a running ETL job

Knowledge Graph Tools

Tool Purpose
consult_knowledge_graph Citation-aware knowledge graph query with structured, data, and text response modes
export_knowledge_graph Export graph summary / JSON / Mermaid for inspection

Knowledge graph note:

  • consult_knowledge_graph defaults to response_mode="structured" and can return answer, references, metadata, retrieval, and counts for agent-side citation workflows.
  • Use response_mode="data" when you want retrieval payloads without final answer synthesis, or response_mode="text" for legacy plain-text behavior.

Section Navigation Tools (Dynamic Hierarchy)

Tool Purpose
list_section_tree Display complete section hierarchy tree (supports any depth)
get_section_detail Get detailed info for a specific section
get_section_blocks Extract all blocks from a section with page + bbox
search_sections Search section titles
get_section_content Read section content via asset service

Docx Editing Tools (DFM โ€” Docx-Flavored Markdown)

Edit .docx files as Markdown. Preserves formatting, tables, media on round-trip.

Tool Purpose
ingest_docx Import .docx and decompose into DFM blocks
get_docx_content Read DFM content of specific blocks
save_docx Write DFM edits back to .docx
list_docx_blocks List document block structure
list_docx_documents List all ingested DOCX/DFM documents
delete_docx Delete an ingested DOCX/DFM document and its local artifacts
convert_docx_to_pdf Export the current DOCX/DFM state to PDF in fidelity mode
convert_docx_to_doc Export the current DOCX/DFM state to DOC in fidelity mode
docx_validate_roundtrip 6-dimension round-trip fidelity validation + file-level comparison (SHA-256, ZIP diff)
docx_table_to_context Bridge: Docx table โ†’ A2T context
docx_table_from_context Bridge: A2T table โ†’ Docx table
docx_chart_data Extract chart data from Docx
export_markdown Export Markdown to .docx/.pdf/.doc

A2T (Anything to Table) Tools โ€” 7 Operation-Based Tools

Agent-friendly design: each tool handles multiple operations via operation parameter. Tables accept any source โ€” PDF assets, KG entities, external URLs, or user input.

Tool Operations Purpose
plan_table schema / templates / from_template Schema planning, browse 4 built-in templates, create from template
table_manage create / delete / list / preview / resume / render / add_column / remove_column / rename_column Table lifecycle + Schema evolution
table_data add_rows / get_row / update_row / delete_row / get_cell / update_cell / clear_cell Row & cell CRUD
table_cite add / get / remove / cell_history Citation management with AssetRef (7 source types)
table_history changes / tokens Audit trail & token estimation
table_draft create / update / add_rows / resume / commit / list / delete Draft workflow with persistence
discover_sources โ€” Cross-document source discovery (sections, tables, figures, KG)

ETL Profile Tools

Different journals/formats need different extraction settings. Use these tools to switch profiles.

Tool Purpose
list_etl_profiles List all available profiles (default, arxiv, nature, ieee, elsevier)
get_etl_profile Get detailed configuration of a specific profile
get_current_etl_profile Show currently active profile
set_etl_profile Switch profile for subsequent document ingestion
load_etl_profile_from_json Load custom profile from JSON file

๐Ÿ”ง Tech Stack

Category Technology
Language Python 3.10+
Package Manager uv (all pip/setup-python removed)
ETL PyMuPDF (fitz) + Marker (optional, high-precision)
RAG LightRAG (lightrag-hku)
MCP FastMCP
Storage Local filesystem (JSON/Markdown/PNG)

๐Ÿ“‹ Documentation

Installation guidance:

  • Default install: uv sync

  • Install Marker backend only when needed: uv sync --extra marker

  • Safer extension Marker setup: enable Marker backend in settings and keep torchBackend=cpu unless you explicitly need GPU wheels

  • Technical Spec - Detailed technical specification

  • Architecture - System architecture

  • Constitution - Project principles

  • Competitive Analysis - MCP + DOCX ecosystem landscape

๐Ÿ“„ License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asset_aware_mcp-0.6.13.tar.gz (246.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

asset_aware_mcp-0.6.13-py3-none-any.whl (254.6 kB view details)

Uploaded Python 3

File details

Details for the file asset_aware_mcp-0.6.13.tar.gz.

File metadata

  • Download URL: asset_aware_mcp-0.6.13.tar.gz
  • Upload date:
  • Size: 246.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for asset_aware_mcp-0.6.13.tar.gz
Algorithm Hash digest
SHA256 baeed166e1700c07c07e7b63387b94882672bbbbd3caa7a7f917c953b924528b
MD5 1bacc1be92366160664cb427c0e264df
BLAKE2b-256 896b03133378cd01f90ae5a6511c7238f5a510fe07288c7444ff81d1e310c2d5

See more details on using hashes here.

Provenance

The following attestation bundles were made for asset_aware_mcp-0.6.13.tar.gz:

Publisher: release.yml on u9401066/asset-aware-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file asset_aware_mcp-0.6.13-py3-none-any.whl.

File metadata

File hashes

Hashes for asset_aware_mcp-0.6.13-py3-none-any.whl
Algorithm Hash digest
SHA256 9cff9176841a3c9f361ca18b6432d1047f53adbec6c8fc2a7c86f68080e9c7cd
MD5 5a9a69f0244055f806f0108863b98de4
BLAKE2b-256 cff563cb6a9173b4a64520badc2c8fa8bda7f16d78ca09f44b081eb45672b6bf

See more details on using hashes here.

Provenance

The following attestation bundles were made for asset_aware_mcp-0.6.13-py3-none-any.whl:

Publisher: release.yml on u9401066/asset-aware-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page