Transform PDF documents into structured knowledge graphs with citation provenance
Project description
MalimGraph
███╗ ███╗ █████╗ ██╗ ██╗███╗ ███╗ ██████╗ ██████╗ █████╗ ██████╗ ██╗ ██╗
████╗ ████║██╔══██╗██║ ██║████╗ ████║██╔════╝ ██╔══██╗██╔══██╗██╔══██╗██║ ██║
██╔████╔██║███████║██║ ██║██╔████╔██║██║ ███╗██████╔╝███████║██████╔╝███████║
██║╚██╔╝██║██╔══██║██║ ██║██║╚██╔╝██║██║ ██║██╔══██╗██╔══██║██╔═══╝ ██╔══██║
██║ ╚═╝ ██║██║ ██║███████╗██║██║ ╚═╝ ██║╚██████╔╝██║ ██║██║ ██║██║ ██║ ██║
╚═╝ ╚═╝╚═╝ ╚═╝╚══════╝╚═╝╚═╝ ╚═╝ ╚═════╝ ╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═╝ ╚═╝
From documents to knowledge graphs.
Transform PDF documents into structured knowledge graphs with full citation provenance. Every entity and relationship traces back to the exact PDF page and verbatim text that supports it.
Features
| Tool | Description |
|---|---|
extract_knowledge_graph |
Hybrid rule + LLM extraction → entities, relationships, citations |
chunk_document |
Token-aware overlapping chunks with heading context for RAG |
render_document_html |
Structured HTML with page anchors, entity annotations, TOC + search |
manage_graph_db |
Load, query, and manage graphs in Neo4j or PostgreSQL (Apache AGE) |
embed_and_store_chunks |
Embed chunks into PostgreSQL pgvector (OpenAI / Voyage / local) |
Three ways to use:
- MCP Server — connect to Claude Desktop, Claude Code, or claude.ai
- CLI —
malimgraph extract,chunk,render,db,vector - Claude Skills — 5 installable
.skillpackages for claude.ai
Quick Start
Claude Code / Claude Desktop (no API key needed)
pip install malimgraph
claude mcp add malimgraph -- malimgraph-plugin
Restart Claude Code, then just ask:
"Extract a knowledge graph from report.pdf and save to ./output/"
Claude reads the PDF, extracts entities using its own intelligence, and saves the graph. No ANTHROPIC_API_KEY required.
CLI (standalone, requires API key)
pip install malimgraph
export ANTHROPIC_API_KEY=sk-ant-...
malimgraph extract --input report.pdf --output ./output/ --format all
malimgraph chunk --input report.pdf --output ./chunks/
malimgraph render --input report.pdf --output document.html
How It Works
PDF
│
▼
pdf_reader.py ──────────────────────────────────────────────┐
│ (PyMuPDF: text, headings, tables, page structure) │
├──────────────────────────────────┐ │
▼ ▼ ▼
rule_extractor.py llm_extractor.py chunker.py
│ (regex: dates, amounts, │ (Anthropic API: │ (sliding window
│ emails, legal refs, │ semantic entities, │ with heading
│ section numbers) │ relationships, │ context)
│ │ source_text required) │
└──────────────┬───────────────┘ │
▼ ▼
graph_builder.py embedder.py
│ (merge + dedup: │ (OpenAI / Voyage /
│ hybrid method, │ local sentence-
│ citation accumulation, │ transformers)
│ stable IDs) │
▼ ▼
knowledge_graph.json vector_client.py
│ (pgvector: HNSW index,
┌─────┴──────┐ cosine similarity search)
▼ ▼
cypher.py age_sql.py
(.cypher) (.sql)
Three Ways to Use
Claude Code Plugin (recommended — no API key)
claude mcp add malimgraph -- malimgraph-plugin
Claude Desktop (claude_desktop_config.json):
{
"mcpServers": {
"malimgraph": {
"command": "malimgraph-plugin"
}
}
}
Claude uses its own subscription to extract entities — no ANTHROPIC_API_KEY needed.
See docs/claude-code-plugin.md for full details.
MCP Server (standalone / HTTP)
# stdio — Claude Desktop / Claude Code (with API key for LLM extraction)
malimgraph serve
# HTTP — remote connections / claude.ai
malimgraph serve --transport http --port 8080
CLI
# Extract knowledge graph from PDF
malimgraph extract \
--input report.pdf \
--output ./output/ \
--entity-types auto \
--format all \
--graph-name my_graph
# Chunk for embeddings
malimgraph chunk \
--input report.pdf \
--output ./chunks/ \
--chunk-size 512 \
--overlap 64 \
--format json
# Embed chunks into PostgreSQL pgvector
malimgraph vector load \
--input ./chunks/chunks.json \
--uri "postgresql://user:pass@localhost:5432/mydb" \
--provider openai \
--table document_chunks
# Semantic search over embedded chunks
malimgraph vector search \
--query "What are the financial risks?" \
--uri "postgresql://user:pass@localhost:5432/mydb" \
--top-k 5
# Render as browsable HTML
malimgraph render \
--input report.pdf \
--output document.html \
--knowledge-graph ./output/knowledge_graph.json
# Load into Neo4j
malimgraph db load \
--input ./output/knowledge_graph.json \
--target neo4j \
--uri bolt://localhost:7687 \
--user neo4j \
--password secret
# Query the graph
malimgraph db query \
--target neo4j \
--uri bolt://localhost:7687 \
--query "MATCH (n:Organization) RETURN n.label, n.source_pages LIMIT 10"
Claude Skills
Download .skill files from GitHub Releases and install in claude.ai → Settings → Skills.
| Skill | Trigger phrases |
|---|---|
pdf-to-knowledge-graph |
"knowledge graph", "extract entities", "PDF to Cypher" |
pdf-to-chunks |
"chunk document", "split for embeddings", "RAG chunks" |
document-to-html |
"convert PDF to HTML", "render document", "make PDF browsable" |
graph-db-admin |
"load into Neo4j", "Cypher query", "graph statistics" |
chunks-to-pgvector |
"store in pgvector", "embed into PostgreSQL", "semantic search", "RAG with PostgreSQL" |
Installation
# Core (knowledge graph + chunking + HTML)
pip install malimgraph
# With Neo4j support
pip install "malimgraph[neo4j]"
# With Apache AGE support
pip install "malimgraph[age]"
# With pgvector + OpenAI embeddings
pip install "malimgraph[pgvector,openai]"
# With pgvector + Voyage AI embeddings
pip install "malimgraph[pgvector,voyage]"
# With local embeddings (no API key needed)
pip install "malimgraph[pgvector,local]"
# Everything
pip install "malimgraph[all]"
Environment Variables
ANTHROPIC_API_KEY=sk-ant-... # Required for LLM extraction
OPENAI_API_KEY=sk-... # Required for OpenAI embeddings
VOYAGE_API_KEY=pa-... # Required for Voyage AI embeddings
PGVECTOR_URI=postgresql://... # PostgreSQL connection for pgvector
NEO4J_URI=bolt://localhost:7687 # Neo4j connection
NEO4J_USER=neo4j
NEO4J_PASSWORD=yourpassword
AGE_CONNECTION_URI=host=... # Apache AGE connection
Output Schema — knowledge_graph.json
Every entity and relationship carries full citation provenance:
| Field | Type | Description |
|---|---|---|
id |
string | Stable hash ID: e_ + MD5(type:label)[:8] |
label |
string | Canonical entity name |
type |
string | Entity type (Organization, Person, Date, …) |
extraction_method |
enum | rule / llm / hybrid |
confidence |
enum | high / medium / low |
source_pages |
int[] | PDF page numbers where found |
source_text |
string | Primary verbatim supporting quote |
source_chunk_id |
string | Processing chunk ID |
citations[] |
object[] | All supporting quotes with page refs |
citation_count |
int | Stored as property in graph DBs |
pgvector — Semantic Search Schema
Chunks are stored with embeddings in PostgreSQL, enabling semantic search:
-- Find chunks most similar to a query
SELECT chunk_text, source_file, page_numbers, heading_context,
1 - (embedding <=> '[...]'::vector) AS score
FROM document_chunks
ORDER BY embedding <=> '[...]'::vector
LIMIT 10;
-- Filter by document
SELECT * FROM document_chunks
WHERE document_id = 'annual_report_2024'
ORDER BY embedding <=> '[...]'::vector LIMIT 5;
Supported embedding providers:
| Provider | Default model | Dimension | API key |
|---|---|---|---|
openai |
text-embedding-3-small |
1536-d | OPENAI_API_KEY |
voyage |
voyage-3-large |
1024-d | VOYAGE_API_KEY |
local |
all-MiniLM-L6-v2 |
384-d | none (CPU) |
Database Setup
Neo4j
docker run -p 7474:7474 -p 7687:7687 \
-e NEO4J_AUTH=neo4j/yourpassword neo4j:latest
Apache AGE (PostgreSQL)
docker run -p 5432:5432 -e POSTGRES_PASSWORD=secret apache/age:latest
pgvector (PostgreSQL)
docker run -p 5432:5432 -e POSTGRES_PASSWORD=secret pgvector/pgvector:pg17
See docs/database-setup.md for full guides.
Contributing
- Fork the repo
- Create a feature branch:
git checkout -b feature/my-feature - Install dev deps:
pip install -e ".[dev]" - Run tests:
make test - Lint:
make lint - Submit a PR
Credits
Built by Malim AI Labs — AI-powered knowledge infrastructure for Southeast Asia.
Malim AI Labs Social Enterprise (003827047-U) · Kuala Lumpur, Malaysia
License
MIT — see LICENSE
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file malimgraph-0.1.2.tar.gz.
File metadata
- Download URL: malimgraph-0.1.2.tar.gz
- Upload date:
- Size: 72.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb9ff5563b2041b10abc4c223b6573238f30a92d7a31f87029d96dfb3869465c
|
|
| MD5 |
5e98b8f9f9e0ad654a409b130c3c306c
|
|
| BLAKE2b-256 |
041bd760daf3d91abf15994ad3837d1d7f3d49223791fa71ec2168bad5506daa
|
Provenance
The following attestation bundles were made for malimgraph-0.1.2.tar.gz:
Publisher:
publish.yml on malim-ai-labs/malim-graph-plugin
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
malimgraph-0.1.2.tar.gz -
Subject digest:
fb9ff5563b2041b10abc4c223b6573238f30a92d7a31f87029d96dfb3869465c - Sigstore transparency entry: 1458739204
- Sigstore integration time:
-
Permalink:
malim-ai-labs/malim-graph-plugin@99cac9de55bb1403b0cf6bf0df1a1ac74b3a729d -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/malim-ai-labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@99cac9de55bb1403b0cf6bf0df1a1ac74b3a729d -
Trigger Event:
push
-
Statement type:
File details
Details for the file malimgraph-0.1.2-py3-none-any.whl.
File metadata
- Download URL: malimgraph-0.1.2-py3-none-any.whl
- Upload date:
- Size: 48.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54af2bcfc69177ec8a4a9591f7dbc5c92358ab9e49b4dc0f57e7ef50cd5d909b
|
|
| MD5 |
973a2e037bc2e81c4074f05431975170
|
|
| BLAKE2b-256 |
d89dfa68eb9134d8b4a0b31b7f4534b9814c57af012c67e7e6c5ef8006df38a2
|
Provenance
The following attestation bundles were made for malimgraph-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on malim-ai-labs/malim-graph-plugin
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
malimgraph-0.1.2-py3-none-any.whl -
Subject digest:
54af2bcfc69177ec8a4a9591f7dbc5c92358ab9e49b4dc0f57e7ef50cd5d909b - Sigstore transparency entry: 1458739340
- Sigstore integration time:
-
Permalink:
malim-ai-labs/malim-graph-plugin@99cac9de55bb1403b0cf6bf0df1a1ac74b3a729d -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/malim-ai-labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@99cac9de55bb1403b0cf6bf0df1a1ac74b3a729d -
Trigger Event:
push
-
Statement type: