Skip to main content

Medical RAG with Asset-Aware MCP - Precise PDF asset retrieval (tables, figures, sections) for AI Agents

Project description

asset-aware-mcp

๐Ÿฅ Medical RAG with Asset-Aware MCP - Precise PDF asset retrieval (tables, figures, sections) and Knowledge Graph for AI Agents.

License

๐ŸŒ ็น้ซ”ไธญๆ–‡

๐ŸŽฏ Why Asset-Aware MCP?

AI cannot directly read image files on your computer. This is a common misconception.

Method Can AI analyze image content? Description
โŒ Provide PNG path No AI cannot access the local file system
โœ… Asset-Aware MCP Yes Retrieves Base64 via MCP, allowing AI vision to understand directly

Real-world Effect

# After retrieving the image via MCP, the AI can analyze it directly:

User: What is this figure about?

AI: This is the architecture diagram for Scaled Dot-Product Attention:
    1. Inputs: Q (Query), K (Key), V (Value)
    2. MatMul of Q and K
    3. Scale (1/โˆšdโ‚–)
    4. Optional Mask (for decoder)
    5. SoftMax normalization
    6. Final MatMul with V to get the output

This is the value of Asset-Aware MCP - enabling AI Agents to truly "see" and understand charts and tables in your PDF literature.


โœจ Features

  • ๐Ÿ“„ Asset-Aware ETL - PDF โ†’ Markdown with dual-engine PDF parsing:
    • PyMuPDF (default) - Fast extraction (~50MB)
    • Marker (optional, use_marker=True) - High-precision structured parsing with blocks.json (bbox/coordinates)
  • ๐Ÿงญ Section Navigation - Dynamic hierarchy section tree with 5 tools: browse, search, detail, content reading, and block extraction for any depth of headings.
  • ๐Ÿ”„ Async Job Pipeline - Supports asynchronous task processing and progress tracking for large documents.
  • ๐Ÿ—บ๏ธ Document Manifest - Provides a structured "map" of the document for precise data access by Agents.
  • ๐Ÿง  LightRAG Integration - Knowledge Graph + Vector Index, supporting cross-document comparison and reasoning.
  • ๐Ÿ“ Docx Editing (DFM) - Edit .docx files in Markdown via Docx-Flavored Markdown format. Supports legacy .doc files (auto-converts via LibreOffice). 12 tools: ingest, read, save, list, delete, strict round-trip validation, DOCXโ†’PDF, DOCXโ†’DOC, and Docx โ†” A2T bridges.
  • ๐Ÿ›ก๏ธ DFM Integrity Checker - Automatic validation and auto-repair at every pipeline stage (post-ingest, pre-save, post-save). Catches orphan markers, column mismatches, and format inconsistencies.
  • ๐Ÿ“Š A2T (Anything to Table) - 7 operation-based tools for building professional tables from any source (PDF assets, Knowledge Graph, URLs, user input). Features: Citations (AssetRef), Audit Trail, Schema Evolution, Templates, Drafting, and Token-efficient resumption.
  • ๐Ÿ–ฅ๏ธ VS Code Management Extension - Graphical interface for monitoring server status, ingested documents, and A2T tables/drafts with one-click Excel export.
  • ๐Ÿ”Œ MCP Server - Exposes tools and resources to Copilot/Claude via FastMCP.
  • ๐Ÿฅ Medical Research Focus - Optimized for medical literature, supporting Base64 image transmission for Vision AI analysis.

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    AI Agent (Copilot)                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚ MCP Protocol (Tools & Resources)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚            MCP Server (Modular Presentation)            โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚ tools/: 42 tools in 7 modules                   โ”‚   โ”‚
โ”‚  โ”‚   document (8) โ”‚ docx (12) โ”‚ section (5)        โ”‚   โ”‚
โ”‚  โ”‚   job (3) โ”‚ knowledge (2) โ”‚ table (7) โ”‚ profile (5) โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚ resources/: 12 resources in 2 modules           โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  ETL Pipeline (DDD)                     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”              โ”‚
โ”‚  โ”‚ PyMuPDF  โ”‚  โ”‚  Asset   โ”‚  โ”‚ LightRAG โ”‚              โ”‚
โ”‚  โ”‚ Adapter  โ”‚โ†’ โ”‚  Parser  โ”‚โ†’ โ”‚  Index   โ”‚              โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   Local Storage                         โ”‚
โ”‚  ./data/                                                โ”‚
โ”‚  โ”œโ”€โ”€ doc_{id}/        # Document Assets                 โ”‚
โ”‚  โ”œโ”€โ”€ docx_{id}/       # Docx IR + DFM + Assets          โ”‚
โ”‚  โ”œโ”€โ”€ tables/          # A2T Tables (JSON/MD/XLSX)       โ”‚
โ”‚  โ”‚   โ””โ”€โ”€ drafts/      # Table Drafts (Persistence)      โ”‚
โ”‚  โ””โ”€โ”€ lightrag_db/     # Knowledge Graph                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“ Project Structure (DDD)

asset-aware-mcp/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ domain/              # ๐Ÿ”ต Domain: Entities, Value Objects, Interfaces
โ”‚   โ”œโ”€โ”€ application/         # ๐ŸŸข Application: Doc Service, Table Service (A2T), Asset Service
โ”‚   โ”œโ”€โ”€ infrastructure/      # ๐ŸŸ  Infrastructure: PyMuPDF, LightRAG, Excel Renderer
โ”‚   โ””โ”€โ”€ presentation/        # ๐Ÿ”ด Presentation: MCP Server (FastMCP)
โ”œโ”€โ”€ data/                    # Document and Asset Storage
โ”œโ”€โ”€ docs/
โ”‚   โ””โ”€โ”€ spec.md              # Technical Specification
โ”œโ”€โ”€ tests/                   # Unit and Integration Tests
โ”œโ”€โ”€ vscode-extension/        # VS Code Management Extension
โ””โ”€โ”€ pyproject.toml           # uv Project Config

๐Ÿš€ Quick Start

# Install dependencies (using uv)
uv sync

# Run MCP Server
uv run python -m src.presentation.server

# Or use the VS Code extension for graphical management

๐Ÿ”Œ MCP Tools

Document & Asset Tools

Tool Purpose
ingest_documents Process PDF files with optional Marker backend (use_marker=True for blocks.json)
list_documents List all ingested documents and their asset counts
inspect_document_manifest Inspect document structure before fetching specific assets
fetch_document_asset Precisely retrieve tables (MD) / figures (B64) / sections
parse_pdf_structure Run high-precision Marker parsing and emit structured blocks
search_source_location Search exact source locations with page + bbox for verification

Job Management Tools

Tool Purpose
get_job_status Get async ingestion job progress and final result
list_jobs List active or historical ETL jobs
cancel_job Cancel a running ETL job

Knowledge Graph Tools

Tool Purpose
consult_knowledge_graph Knowledge graph query, cross-document comparison
export_knowledge_graph Export graph summary / JSON / Mermaid for inspection

Section Navigation Tools (Dynamic Hierarchy)

Tool Purpose
list_section_tree Display complete section hierarchy tree (supports any depth)
get_section_detail Get detailed info for a specific section
get_section_blocks Extract all blocks from a section with page + bbox
search_sections Search section titles
get_section_content Read section content via asset service

Docx Editing Tools (DFM โ€” Docx-Flavored Markdown)

Edit .docx files as Markdown. Preserves formatting, tables, media on round-trip.

Tool Purpose
ingest_docx Import .docx and decompose into DFM blocks
get_docx_content Read DFM content of specific blocks
save_docx Write DFM edits back to .docx
list_docx_blocks List document block structure
docx_validate_roundtrip 6-dimension round-trip fidelity validation + file-level comparison (SHA-256, ZIP diff)
docx_table_to_context Bridge: Docx table โ†’ A2T context
docx_table_from_context Bridge: A2T table โ†’ Docx table
docx_chart_data Extract chart data from Docx

A2T (Anything to Table) Tools โ€” 7 Operation-Based Tools

Agent-friendly design: each tool handles multiple operations via operation parameter. Tables accept any source โ€” PDF assets, KG entities, external URLs, or user input.

Tool Operations Purpose
plan_table schema / templates / from_template Schema planning, browse 4 built-in templates, create from template
table_manage create / delete / list / preview / resume / render / add_column / remove_column / rename_column Table lifecycle + Schema evolution
table_data add_rows / get_row / update_row / delete_row / get_cell / update_cell / clear_cell Row & cell CRUD
table_cite add / get / remove / cell_history Citation management with AssetRef (7 source types)
table_history changes / tokens Audit trail & token estimation
table_draft create / update / add_rows / resume / commit / list / delete Draft workflow with persistence
discover_sources โ€” Cross-document source discovery (sections, tables, figures, KG)

ETL Profile Tools

Different journals/formats need different extraction settings. Use these tools to switch profiles.

Tool Purpose
list_etl_profiles List all available profiles (default, arxiv, nature, ieee, elsevier)
get_etl_profile Get detailed configuration of a specific profile
get_current_etl_profile Show currently active profile
set_etl_profile Switch profile for subsequent document ingestion
load_etl_profile_from_json Load custom profile from JSON file

๐Ÿ”ง Tech Stack

Category Technology
Language Python 3.10+
Package Manager uv (all pip/setup-python removed)
ETL PyMuPDF (fitz) + Marker (optional, high-precision)
RAG LightRAG (lightrag-hku)
MCP FastMCP
Storage Local filesystem (JSON/Markdown/PNG)

๐Ÿ“‹ Documentation

๐Ÿ“„ License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asset_aware_mcp-0.4.2.tar.gz (711.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

asset_aware_mcp-0.4.2-py3-none-any.whl (187.0 kB view details)

Uploaded Python 3

File details

Details for the file asset_aware_mcp-0.4.2.tar.gz.

File metadata

  • Download URL: asset_aware_mcp-0.4.2.tar.gz
  • Upload date:
  • Size: 711.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for asset_aware_mcp-0.4.2.tar.gz
Algorithm Hash digest
SHA256 e457ea491f06bba65978453ec237d06476295c50c21a9b4cb740c8d94fd6b60a
MD5 41a39f974ff68442e7f578556043fd6a
BLAKE2b-256 a9bbf8865866a2a341a56fe6ceeab4e01e90d5a8577b0e98624caf00ef3eec30

See more details on using hashes here.

Provenance

The following attestation bundles were made for asset_aware_mcp-0.4.2.tar.gz:

Publisher: release.yml on u9401066/asset-aware-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file asset_aware_mcp-0.4.2-py3-none-any.whl.

File metadata

File hashes

Hashes for asset_aware_mcp-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8ed63fa921e166406bb5f239cd79e516e242a39c793fa293ec87e320ddd7fa9f
MD5 cf937da5f929d1af48be88f157ce8ce5
BLAKE2b-256 3d423ae2a6fc01629565a3e4df8b12490438a6a71ce48e34b6065a55dd118e32

See more details on using hashes here.

Provenance

The following attestation bundles were made for asset_aware_mcp-0.4.2-py3-none-any.whl:

Publisher: release.yml on u9401066/asset-aware-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page