Skip to main content

Transform PDF documents into structured knowledge graphs with citation provenance

Project description

MalimGraph

███╗   ███╗ █████╗ ██╗     ██╗███╗   ███╗ ██████╗ ██████╗  █████╗ ██████╗ ██╗  ██╗
████╗ ████║██╔══██╗██║     ██║████╗ ████║██╔════╝ ██╔══██╗██╔══██╗██╔══██╗██║  ██║
██╔████╔██║███████║██║     ██║██╔████╔██║██║  ███╗██████╔╝███████║██████╔╝███████║
██║╚██╔╝██║██╔══██║██║     ██║██║╚██╔╝██║██║   ██║██╔══██╗██╔══██║██╔═══╝ ██╔══██║
██║ ╚═╝ ██║██║  ██║███████╗██║██║ ╚═╝ ██║╚██████╔╝██║  ██║██║  ██║██║     ██║  ██║
╚═╝     ╚═╝╚═╝  ╚═╝╚══════╝╚═╝╚═╝     ╚═╝ ╚═════╝ ╚═╝  ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝  ╚═╝

PyPI version License: MIT Python 3.10+ MCP Compatible CI

From documents to knowledge graphs.

Transform PDF documents into structured knowledge graphs with full citation provenance. Every entity and relationship traces back to the exact PDF page and verbatim text that supports it.


Features

Tool Description
extract_knowledge_graph Hybrid rule + LLM extraction → entities, relationships, citations
chunk_document Token-aware overlapping chunks with heading context for RAG
render_document_html Structured HTML with page anchors, entity annotations, TOC + search
manage_graph_db Load, query, and manage graphs in Neo4j or PostgreSQL (Apache AGE)
embed_and_store_chunks Embed chunks into PostgreSQL pgvector (OpenAI / Voyage / local)

Three ways to use:

  • MCP Server — connect to Claude Desktop, Claude Code, or claude.ai
  • CLImalimgraph extract, chunk, render, db, vector
  • Claude Skills — 5 installable .skill packages for claude.ai

Quick Start

Claude Code / Claude Desktop (no API key needed)

pip install malimgraph
claude mcp add malimgraph -- malimgraph-plugin

Restart Claude Code, then just ask:

"Extract a knowledge graph from report.pdf and save to ./output/"

Claude reads the PDF, extracts entities using its own intelligence, and saves the graph. No ANTHROPIC_API_KEY required.

CLI (standalone, requires API key)

pip install malimgraph
export ANTHROPIC_API_KEY=sk-ant-...

malimgraph extract --input report.pdf --output ./output/ --format all
malimgraph chunk --input report.pdf --output ./chunks/
malimgraph render --input report.pdf --output document.html

How It Works

PDF
 │
 ▼
pdf_reader.py ──────────────────────────────────────────────┐
 │  (PyMuPDF: text, headings, tables, page structure)       │
 ├──────────────────────────────────┐                        │
 ▼                                  ▼                        ▼
rule_extractor.py              llm_extractor.py          chunker.py
 │ (regex: dates, amounts,      │ (Anthropic API:         │ (sliding window
 │  emails, legal refs,         │  semantic entities,     │  with heading
 │  section numbers)            │  relationships,         │  context)
 │                              │  source_text required)  │
 └──────────────┬───────────────┘                         │
                ▼                                          ▼
          graph_builder.py                          embedder.py
           │ (merge + dedup:                        │ (OpenAI / Voyage /
           │  hybrid method,                        │  local sentence-
           │  citation accumulation,                │  transformers)
           │  stable IDs)                           │
           ▼                                        ▼
     knowledge_graph.json                    vector_client.py
           │                                 (pgvector: HNSW index,
     ┌─────┴──────┐                           cosine similarity search)
     ▼             ▼
 cypher.py     age_sql.py
 (.cypher)      (.sql)

Three Ways to Use

Claude Code Plugin (recommended — no API key)

claude mcp add malimgraph -- malimgraph-plugin

Claude Desktop (claude_desktop_config.json):

{
  "mcpServers": {
    "malimgraph": {
      "command": "malimgraph-plugin"
    }
  }
}

Claude uses its own subscription to extract entities — no ANTHROPIC_API_KEY needed. See docs/claude-code-plugin.md for full details.

MCP Server (standalone / HTTP)

# stdio — Claude Desktop / Claude Code (with API key for LLM extraction)
malimgraph serve

# HTTP — remote connections / claude.ai
malimgraph serve --transport http --port 8080

CLI

# Extract knowledge graph from PDF
malimgraph extract \
  --input report.pdf \
  --output ./output/ \
  --entity-types auto \
  --format all \
  --graph-name my_graph

# Chunk for embeddings
malimgraph chunk \
  --input report.pdf \
  --output ./chunks/ \
  --chunk-size 512 \
  --overlap 64 \
  --format json

# Embed chunks into PostgreSQL pgvector
malimgraph vector load \
  --input ./chunks/chunks.json \
  --uri "postgresql://user:pass@localhost:5432/mydb" \
  --provider openai \
  --table document_chunks

# Semantic search over embedded chunks
malimgraph vector search \
  --query "What are the financial risks?" \
  --uri "postgresql://user:pass@localhost:5432/mydb" \
  --top-k 5

# Render as browsable HTML
malimgraph render \
  --input report.pdf \
  --output document.html \
  --knowledge-graph ./output/knowledge_graph.json

# Load into Neo4j
malimgraph db load \
  --input ./output/knowledge_graph.json \
  --target neo4j \
  --uri bolt://localhost:7687 \
  --user neo4j \
  --password secret

# Query the graph
malimgraph db query \
  --target neo4j \
  --uri bolt://localhost:7687 \
  --query "MATCH (n:Organization) RETURN n.label, n.source_pages LIMIT 10"

Claude Skills

Download .skill files from GitHub Releases and install in claude.ai → Settings → Skills.

Skill Trigger phrases
pdf-to-knowledge-graph "knowledge graph", "extract entities", "PDF to Cypher"
pdf-to-chunks "chunk document", "split for embeddings", "RAG chunks"
document-to-html "convert PDF to HTML", "render document", "make PDF browsable"
graph-db-admin "load into Neo4j", "Cypher query", "graph statistics"
chunks-to-pgvector "store in pgvector", "embed into PostgreSQL", "semantic search", "RAG with PostgreSQL"

Installation

# Core (knowledge graph + chunking + HTML)
pip install malimgraph

# With Neo4j support
pip install "malimgraph[neo4j]"

# With Apache AGE support
pip install "malimgraph[age]"

# With pgvector + OpenAI embeddings
pip install "malimgraph[pgvector,openai]"

# With pgvector + Voyage AI embeddings
pip install "malimgraph[pgvector,voyage]"

# With local embeddings (no API key needed)
pip install "malimgraph[pgvector,local]"

# Everything
pip install "malimgraph[all]"

Environment Variables

ANTHROPIC_API_KEY=sk-ant-...      # Required for LLM extraction
OPENAI_API_KEY=sk-...             # Required for OpenAI embeddings
VOYAGE_API_KEY=pa-...             # Required for Voyage AI embeddings
PGVECTOR_URI=postgresql://...     # PostgreSQL connection for pgvector
NEO4J_URI=bolt://localhost:7687   # Neo4j connection
NEO4J_USER=neo4j
NEO4J_PASSWORD=yourpassword
AGE_CONNECTION_URI=host=...       # Apache AGE connection

Output Schema — knowledge_graph.json

Every entity and relationship carries full citation provenance:

Field Type Description
id string Stable hash ID: e_ + MD5(type:label)[:8]
label string Canonical entity name
type string Entity type (Organization, Person, Date, …)
extraction_method enum rule / llm / hybrid
confidence enum high / medium / low
source_pages int[] PDF page numbers where found
source_text string Primary verbatim supporting quote
source_chunk_id string Processing chunk ID
citations[] object[] All supporting quotes with page refs
citation_count int Stored as property in graph DBs

pgvector — Semantic Search Schema

Chunks are stored with embeddings in PostgreSQL, enabling semantic search:

-- Find chunks most similar to a query
SELECT chunk_text, source_file, page_numbers, heading_context,
       1 - (embedding <=> '[...]'::vector) AS score
FROM document_chunks
ORDER BY embedding <=> '[...]'::vector
LIMIT 10;

-- Filter by document
SELECT * FROM document_chunks
WHERE document_id = 'annual_report_2024'
ORDER BY embedding <=> '[...]'::vector LIMIT 5;

Supported embedding providers:

Provider Default model Dimension API key
openai text-embedding-3-small 1536-d OPENAI_API_KEY
voyage voyage-3-large 1024-d VOYAGE_API_KEY
local all-MiniLM-L6-v2 384-d none (CPU)

Database Setup

Neo4j

docker run -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/yourpassword neo4j:latest

Apache AGE (PostgreSQL)

docker run -p 5432:5432 -e POSTGRES_PASSWORD=secret apache/age:latest

pgvector (PostgreSQL)

docker run -p 5432:5432 -e POSTGRES_PASSWORD=secret pgvector/pgvector:pg17

See docs/database-setup.md for full guides.


Contributing

  1. Fork the repo
  2. Create a feature branch: git checkout -b feature/my-feature
  3. Install dev deps: pip install -e ".[dev]"
  4. Run tests: make test
  5. Lint: make lint
  6. Submit a PR

Credits

Built by Malim AI Labs — AI-powered knowledge infrastructure for Southeast Asia.

Malim AI Labs Social Enterprise (003827047-U) · Kuala Lumpur, Malaysia


License

MIT — see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

malimgraph-0.1.2.tar.gz (72.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

malimgraph-0.1.2-py3-none-any.whl (48.2 kB view details)

Uploaded Python 3

File details

Details for the file malimgraph-0.1.2.tar.gz.

File metadata

  • Download URL: malimgraph-0.1.2.tar.gz
  • Upload date:
  • Size: 72.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for malimgraph-0.1.2.tar.gz
Algorithm Hash digest
SHA256 fb9ff5563b2041b10abc4c223b6573238f30a92d7a31f87029d96dfb3869465c
MD5 5e98b8f9f9e0ad654a409b130c3c306c
BLAKE2b-256 041bd760daf3d91abf15994ad3837d1d7f3d49223791fa71ec2168bad5506daa

See more details on using hashes here.

Provenance

The following attestation bundles were made for malimgraph-0.1.2.tar.gz:

Publisher: publish.yml on malim-ai-labs/malim-graph-plugin

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file malimgraph-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: malimgraph-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 48.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for malimgraph-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 54af2bcfc69177ec8a4a9591f7dbc5c92358ab9e49b4dc0f57e7ef50cd5d909b
MD5 973a2e037bc2e81c4074f05431975170
BLAKE2b-256 d89dfa68eb9134d8b4a0b31b7f4534b9814c57af012c67e7e6c5ef8006df38a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for malimgraph-0.1.2-py3-none-any.whl:

Publisher: publish.yml on malim-ai-labs/malim-graph-plugin

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page