Skip to main content

Transform PDF documents into structured knowledge graphs with citation provenance

Project description

MalimGraph

███╗   ███╗ █████╗ ██╗     ██╗███╗   ███╗ ██████╗ ██████╗  █████╗ ██████╗ ██╗  ██╗
████╗ ████║██╔══██╗██║     ██║████╗ ████║██╔════╝ ██╔══██╗██╔══██╗██╔══██╗██║  ██║
██╔████╔██║███████║██║     ██║██╔████╔██║██║  ███╗██████╔╝███████║██████╔╝███████║
██║╚██╔╝██║██╔══██║██║     ██║██║╚██╔╝██║██║   ██║██╔══██╗██╔══██║██╔═══╝ ██╔══██║
██║ ╚═╝ ██║██║  ██║███████╗██║██║ ╚═╝ ██║╚██████╔╝██║  ██║██║  ██║██║     ██║  ██║
╚═╝     ╚═╝╚═╝  ╚═╝╚══════╝╚═╝╚═╝     ╚═╝ ╚═════╝ ╚═╝  ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝  ╚═╝

PyPI version License: MIT Python 3.10+ MCP Compatible CI

From documents to knowledge graphs.

Transform PDF documents into structured knowledge graphs with full citation provenance. Every entity and relationship traces back to the exact PDF page and verbatim text that supports it.


Features

Tool Description
extract_knowledge_graph Hybrid rule + LLM extraction → entities, relationships, citations
chunk_document Token-aware overlapping chunks with heading context for RAG
render_document_html Structured HTML with page anchors, entity annotations, TOC + search
manage_graph_db Load, query, and manage graphs in Neo4j or PostgreSQL (Apache AGE)
embed_and_store_chunks Embed chunks into PostgreSQL pgvector (OpenAI / Voyage / local)

Three ways to use:

  • MCP Server — connect to Claude Desktop, Claude Code, or claude.ai
  • CLImalimgraph extract, chunk, render, db, vector
  • Claude Skills — 5 installable .skill packages for claude.ai

Quick Start

pip install malimgraph
export ANTHROPIC_API_KEY=sk-ant-...

# Extract knowledge graph
malimgraph extract --input report.pdf --output ./output/ --format all

# Chunk for RAG
malimgraph chunk --input report.pdf --output ./chunks/

# Embed chunks into pgvector
export PGVECTOR_URI="postgresql://user:pass@localhost:5432/mydb"
export OPENAI_API_KEY=sk-...
malimgraph vector load --input ./chunks/chunks.json

# Render as browsable HTML
malimgraph render --input report.pdf --output document.html

How It Works

PDF
 │
 ▼
pdf_reader.py ──────────────────────────────────────────────┐
 │  (PyMuPDF: text, headings, tables, page structure)       │
 ├──────────────────────────────────┐                        │
 ▼                                  ▼                        ▼
rule_extractor.py              llm_extractor.py          chunker.py
 │ (regex: dates, amounts,      │ (Anthropic API:         │ (sliding window
 │  emails, legal refs,         │  semantic entities,     │  with heading
 │  section numbers)            │  relationships,         │  context)
 │                              │  source_text required)  │
 └──────────────┬───────────────┘                         │
                ▼                                          ▼
          graph_builder.py                          embedder.py
           │ (merge + dedup:                        │ (OpenAI / Voyage /
           │  hybrid method,                        │  local sentence-
           │  citation accumulation,                │  transformers)
           │  stable IDs)                           │
           ▼                                        ▼
     knowledge_graph.json                    vector_client.py
           │                                 (pgvector: HNSW index,
     ┌─────┴──────┐                           cosine similarity search)
     ▼             ▼
 cypher.py     age_sql.py
 (.cypher)      (.sql)

Three Ways to Use

MCP Server

# stdio (for Claude Desktop / Claude Code)
malimgraph serve

# HTTP (for remote connections / claude.ai)
malimgraph serve --transport http --port 8080

Claude Desktop config (claude_desktop_config.json):

{
  "mcpServers": {
    "malimgraph": {
      "command": "malimgraph",
      "args": ["serve"],
      "env": { "ANTHROPIC_API_KEY": "sk-ant-..." }
    }
  }
}

Claude Code:

claude mcp add malimgraph -- malimgraph serve

CLI

# Extract knowledge graph from PDF
malimgraph extract \
  --input report.pdf \
  --output ./output/ \
  --entity-types auto \
  --format all \
  --graph-name my_graph

# Chunk for embeddings
malimgraph chunk \
  --input report.pdf \
  --output ./chunks/ \
  --chunk-size 512 \
  --overlap 64 \
  --format json

# Embed chunks into PostgreSQL pgvector
malimgraph vector load \
  --input ./chunks/chunks.json \
  --uri "postgresql://user:pass@localhost:5432/mydb" \
  --provider openai \
  --table document_chunks

# Semantic search over embedded chunks
malimgraph vector search \
  --query "What are the financial risks?" \
  --uri "postgresql://user:pass@localhost:5432/mydb" \
  --top-k 5

# Render as browsable HTML
malimgraph render \
  --input report.pdf \
  --output document.html \
  --knowledge-graph ./output/knowledge_graph.json

# Load into Neo4j
malimgraph db load \
  --input ./output/knowledge_graph.json \
  --target neo4j \
  --uri bolt://localhost:7687 \
  --user neo4j \
  --password secret

# Query the graph
malimgraph db query \
  --target neo4j \
  --uri bolt://localhost:7687 \
  --query "MATCH (n:Organization) RETURN n.label, n.source_pages LIMIT 10"

Claude Skills

Download .skill files from GitHub Releases and install in claude.ai → Settings → Skills.

Skill Trigger phrases
pdf-to-knowledge-graph "knowledge graph", "extract entities", "PDF to Cypher"
pdf-to-chunks "chunk document", "split for embeddings", "RAG chunks"
document-to-html "convert PDF to HTML", "render document", "make PDF browsable"
graph-db-admin "load into Neo4j", "Cypher query", "graph statistics"
chunks-to-pgvector "store in pgvector", "embed into PostgreSQL", "semantic search", "RAG with PostgreSQL"

Installation

# Core (knowledge graph + chunking + HTML)
pip install malimgraph

# With Neo4j support
pip install "malimgraph[neo4j]"

# With Apache AGE support
pip install "malimgraph[age]"

# With pgvector + OpenAI embeddings
pip install "malimgraph[pgvector,openai]"

# With pgvector + Voyage AI embeddings
pip install "malimgraph[pgvector,voyage]"

# With local embeddings (no API key needed)
pip install "malimgraph[pgvector,local]"

# Everything
pip install "malimgraph[all]"

Environment Variables

ANTHROPIC_API_KEY=sk-ant-...      # Required for LLM extraction
OPENAI_API_KEY=sk-...             # Required for OpenAI embeddings
VOYAGE_API_KEY=pa-...             # Required for Voyage AI embeddings
PGVECTOR_URI=postgresql://...     # PostgreSQL connection for pgvector
NEO4J_URI=bolt://localhost:7687   # Neo4j connection
NEO4J_USER=neo4j
NEO4J_PASSWORD=yourpassword
AGE_CONNECTION_URI=host=...       # Apache AGE connection

Output Schema — knowledge_graph.json

Every entity and relationship carries full citation provenance:

Field Type Description
id string Stable hash ID: e_ + MD5(type:label)[:8]
label string Canonical entity name
type string Entity type (Organization, Person, Date, …)
extraction_method enum rule / llm / hybrid
confidence enum high / medium / low
source_pages int[] PDF page numbers where found
source_text string Primary verbatim supporting quote
source_chunk_id string Processing chunk ID
citations[] object[] All supporting quotes with page refs
citation_count int Stored as property in graph DBs

pgvector — Semantic Search Schema

Chunks are stored with embeddings in PostgreSQL, enabling semantic search:

-- Find chunks most similar to a query
SELECT chunk_text, source_file, page_numbers, heading_context,
       1 - (embedding <=> '[...]'::vector) AS score
FROM document_chunks
ORDER BY embedding <=> '[...]'::vector
LIMIT 10;

-- Filter by document
SELECT * FROM document_chunks
WHERE document_id = 'annual_report_2024'
ORDER BY embedding <=> '[...]'::vector LIMIT 5;

Supported embedding providers:

Provider Default model Dimension API key
openai text-embedding-3-small 1536-d OPENAI_API_KEY
voyage voyage-3-large 1024-d VOYAGE_API_KEY
local all-MiniLM-L6-v2 384-d none (CPU)

Database Setup

Neo4j

docker run -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/yourpassword neo4j:latest

Apache AGE (PostgreSQL)

docker run -p 5432:5432 -e POSTGRES_PASSWORD=secret apache/age:latest

pgvector (PostgreSQL)

docker run -p 5432:5432 -e POSTGRES_PASSWORD=secret pgvector/pgvector:pg17

See docs/database-setup.md for full guides.


Contributing

  1. Fork the repo
  2. Create a feature branch: git checkout -b feature/my-feature
  3. Install dev deps: pip install -e ".[dev]"
  4. Run tests: make test
  5. Lint: make lint
  6. Submit a PR

Credits

Built by Malim AI Labs — AI-powered knowledge infrastructure for Southeast Asia.

Malim AI Labs Social Enterprise (003827047-U) · Kuala Lumpur, Malaysia


License

MIT — see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

malimgraph-0.1.1.tar.gz (68.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

malimgraph-0.1.1-py3-none-any.whl (42.5 kB view details)

Uploaded Python 3

File details

Details for the file malimgraph-0.1.1.tar.gz.

File metadata

  • Download URL: malimgraph-0.1.1.tar.gz
  • Upload date:
  • Size: 68.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for malimgraph-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f810c922ebaedea6df4bee51087276241902d517f2676946ed64ecc9f1eefeed
MD5 cc118b611f294f6b66f30082ebdb16f0
BLAKE2b-256 7f2d3b354589154568cbf67994c558c3829b5fb961cb9ca73d588755ea0efa5c

See more details on using hashes here.

Provenance

The following attestation bundles were made for malimgraph-0.1.1.tar.gz:

Publisher: publish.yml on malim-ai-labs/malim-graph-plugin

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file malimgraph-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: malimgraph-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 42.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for malimgraph-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 efec45242cc224161a3e8923c310a59d307547bf10ecb8df07193ea6f19ee767
MD5 64bb2de7496db1f04714f47247d7f64e
BLAKE2b-256 53b18d4f931b2ff4bd63507db8ac9a289716a637fd61a3128249cd8de367acb6

See more details on using hashes here.

Provenance

The following attestation bundles were made for malimgraph-0.1.1-py3-none-any.whl:

Publisher: publish.yml on malim-ai-labs/malim-graph-plugin

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page