Transform PDF documents into structured knowledge graphs with citation provenance

These details have not been verified by PyPI

Project links

Project description

MalimGraph

███╗   ███╗ █████╗ ██╗     ██╗███╗   ███╗ ██████╗ ██████╗  █████╗ ██████╗ ██╗  ██╗
████╗ ████║██╔══██╗██║     ██║████╗ ████║██╔════╝ ██╔══██╗██╔══██╗██╔══██╗██║  ██║
██╔████╔██║███████║██║     ██║██╔████╔██║██║  ███╗██████╔╝███████║██████╔╝███████║
██║╚██╔╝██║██╔══██║██║     ██║██║╚██╔╝██║██║   ██║██╔══██╗██╔══██║██╔═══╝ ██╔══██║
██║ ╚═╝ ██║██║  ██║███████╗██║██║ ╚═╝ ██║╚██████╔╝██║  ██║██║  ██║██║     ██║  ██║
╚═╝     ╚═╝╚═╝  ╚═╝╚══════╝╚═╝╚═╝     ╚═╝ ╚═════╝ ╚═╝  ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝  ╚═╝

From documents to knowledge graphs.

Transform PDF documents into structured knowledge graphs with full citation provenance. Every entity and relationship traces back to the exact PDF page and verbatim text that supports it.

Features

Tool	Description
`extract_knowledge_graph`	Hybrid rule + LLM extraction → entities, relationships, citations
`chunk_document`	Token-aware overlapping chunks with heading context for RAG
`render_document_html`	Structured HTML with page anchors, entity annotations, TOC + search
`manage_graph_db`	Load, query, and manage graphs in Neo4j or PostgreSQL (Apache AGE)
`embed_and_store_chunks`	Embed chunks into PostgreSQL pgvector (OpenAI / Voyage / local)

Three ways to use:

MCP Server — connect to Claude Desktop, Claude Code, or claude.ai
CLI — malimgraph extract, chunk, render, db, vector
Claude Skills — 5 installable .skill packages for claude.ai

Quick Start

Claude Code / Claude Desktop (no API key needed)

pip install malimgraph
claude mcp add malimgraph -- malimgraph-plugin

Restart Claude Code, then just ask:

"Extract a knowledge graph from report.pdf and save to ./output/"

Claude reads the PDF, extracts entities using its own intelligence, and saves the graph. No ANTHROPIC_API_KEY required.

CLI (standalone, requires API key)

pip install malimgraph
export ANTHROPIC_API_KEY=sk-ant-...

malimgraph extract --input report.pdf --output ./output/ --format all
malimgraph chunk --input report.pdf --output ./chunks/
malimgraph render --input report.pdf --output document.html

How It Works

PDF
 │
 ▼
pdf_reader.py ──────────────────────────────────────────────┐
 │  (PyMuPDF: text, headings, tables, page structure)       │
 ├──────────────────────────────────┐                        │
 ▼                                  ▼                        ▼
rule_extractor.py              llm_extractor.py          chunker.py
 │ (regex: dates, amounts,      │ (Anthropic API:         │ (sliding window
 │  emails, legal refs,         │  semantic entities,     │  with heading
 │  section numbers)            │  relationships,         │  context)
 │                              │  source_text required)  │
 └──────────────┬───────────────┘                         │
                ▼                                          ▼
          graph_builder.py                          embedder.py
           │ (merge + dedup:                        │ (OpenAI / Voyage /
           │  hybrid method,                        │  local sentence-
           │  citation accumulation,                │  transformers)
           │  stable IDs)                           │
           ▼                                        ▼
     knowledge_graph.json                    vector_client.py
           │                                 (pgvector: HNSW index,
     ┌─────┴──────┐                           cosine similarity search)
     ▼             ▼
 cypher.py     age_sql.py
 (.cypher)      (.sql)

Three Ways to Use

Claude Code Plugin (recommended — no API key)

claude mcp add malimgraph -- malimgraph-plugin

Claude Desktop (claude_desktop_config.json):

{
  "mcpServers": {
    "malimgraph": {
      "command": "malimgraph-plugin"
    }
  }
}

Claude uses its own subscription to extract entities — no ANTHROPIC_API_KEY needed. See docs/claude-code-plugin.md for full details.

MCP Server (standalone / HTTP)

# stdio — Claude Desktop / Claude Code (with API key for LLM extraction)
malimgraph serve

# HTTP — remote connections / claude.ai
malimgraph serve --transport http --port 8080

CLI

# Extract knowledge graph from PDF
malimgraph extract \
  --input report.pdf \
  --output ./output/ \
  --entity-types auto \
  --format all \
  --graph-name my_graph

# Chunk for embeddings
malimgraph chunk \
  --input report.pdf \
  --output ./chunks/ \
  --chunk-size 512 \
  --overlap 64 \
  --format json

# Embed chunks into PostgreSQL pgvector
malimgraph vector load \
  --input ./chunks/chunks.json \
  --uri "postgresql://user:pass@localhost:5432/mydb" \
  --provider openai \
  --table document_chunks

# Semantic search over embedded chunks
malimgraph vector search \
  --query "What are the financial risks?" \
  --uri "postgresql://user:pass@localhost:5432/mydb" \
  --top-k 5

# Render as browsable HTML
malimgraph render \
  --input report.pdf \
  --output document.html \
  --knowledge-graph ./output/knowledge_graph.json

# Load into Neo4j
malimgraph db load \
  --input ./output/knowledge_graph.json \
  --target neo4j \
  --uri bolt://localhost:7687 \
  --user neo4j \
  --password secret

# Query the graph
malimgraph db query \
  --target neo4j \
  --uri bolt://localhost:7687 \
  --query "MATCH (n:Organization) RETURN n.label, n.source_pages LIMIT 10"

Claude Skills

Download .skill files from GitHub Releases and install in claude.ai → Settings → Skills.

Skill	Trigger phrases
`pdf-to-knowledge-graph`	"knowledge graph", "extract entities", "PDF to Cypher"
`pdf-to-chunks`	"chunk document", "split for embeddings", "RAG chunks"
`document-to-html`	"convert PDF to HTML", "render document", "make PDF browsable"
`graph-db-admin`	"load into Neo4j", "Cypher query", "graph statistics"
`chunks-to-pgvector`	"store in pgvector", "embed into PostgreSQL", "semantic search", "RAG with PostgreSQL"

Installation

# Core (knowledge graph + chunking + HTML)
pip install malimgraph

# With Neo4j support
pip install "malimgraph[neo4j]"

# With Apache AGE support
pip install "malimgraph[age]"

# With pgvector + OpenAI embeddings
pip install "malimgraph[pgvector,openai]"

# With pgvector + Voyage AI embeddings
pip install "malimgraph[pgvector,voyage]"

# With local embeddings (no API key needed)
pip install "malimgraph[pgvector,local]"

# Everything
pip install "malimgraph[all]"

Environment Variables

ANTHROPIC_API_KEY=sk-ant-...      # Required for LLM extraction
OPENAI_API_KEY=sk-...             # Required for OpenAI embeddings
VOYAGE_API_KEY=pa-...             # Required for Voyage AI embeddings
PGVECTOR_URI=postgresql://...     # PostgreSQL connection for pgvector
NEO4J_URI=bolt://localhost:7687   # Neo4j connection
NEO4J_USER=neo4j
NEO4J_PASSWORD=yourpassword
AGE_CONNECTION_URI=host=...       # Apache AGE connection

Output Schema — `knowledge_graph.json`

Every entity and relationship carries full citation provenance:

Field	Type	Description
`id`	string	Stable hash ID: `e_` + MD5(type:label)[:8]
`label`	string	Canonical entity name
`type`	string	Entity type (Organization, Person, Date, …)
`extraction_method`	enum	`rule` / `llm` / `hybrid`
`confidence`	enum	`high` / `medium` / `low`
`source_pages`	int[]	PDF page numbers where found
`source_text`	string	Primary verbatim supporting quote
`source_chunk_id`	string	Processing chunk ID
`citations[]`	object[]	All supporting quotes with page refs
`citation_count`	int	Stored as property in graph DBs

pgvector — Semantic Search Schema

Chunks are stored with embeddings in PostgreSQL, enabling semantic search:

-- Find chunks most similar to a query
SELECT chunk_text, source_file, page_numbers, heading_context,
       1 - (embedding <=> '[...]'::vector) AS score
FROM document_chunks
ORDER BY embedding <=> '[...]'::vector
LIMIT 10;

-- Filter by document
SELECT * FROM document_chunks
WHERE document_id = 'annual_report_2024'
ORDER BY embedding <=> '[...]'::vector LIMIT 5;

Supported embedding providers:

Provider	Default model	Dimension	API key
`openai`	`text-embedding-3-small`	1536-d	`OPENAI_API_KEY`
`voyage`	`voyage-3-large`	1024-d	`VOYAGE_API_KEY`
`local`	`all-MiniLM-L6-v2`	384-d	none (CPU)

Database Setup

Neo4j

docker run -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/yourpassword neo4j:latest

Apache AGE (PostgreSQL)

docker run -p 5432:5432 -e POSTGRES_PASSWORD=secret apache/age:latest

pgvector (PostgreSQL)

docker run -p 5432:5432 -e POSTGRES_PASSWORD=secret pgvector/pgvector:pg17

See docs/database-setup.md for full guides.

Contributing

Fork the repo
Create a feature branch: git checkout -b feature/my-feature
Install dev deps: pip install -e ".[dev]"
Run tests: make test
Lint: make lint
Submit a PR

Credits

Built by Malim AI Labs — AI-powered knowledge infrastructure for Southeast Asia.

Malim AI Labs Social Enterprise (003827047-U) · Kuala Lumpur, Malaysia

License

MIT — see LICENSE

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

May 8, 2026

0.1.6

May 7, 2026

0.1.5

May 7, 2026

0.1.4

May 7, 2026

This version

0.1.2

May 7, 2026

0.1.1

May 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

malimgraph-0.1.2.tar.gz (72.1 kB view details)

Uploaded May 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

malimgraph-0.1.2-py3-none-any.whl (48.2 kB view details)

Uploaded May 7, 2026 Python 3

File details

Details for the file malimgraph-0.1.2.tar.gz.

File metadata

Download URL: malimgraph-0.1.2.tar.gz
Upload date: May 7, 2026
Size: 72.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for malimgraph-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`fb9ff5563b2041b10abc4c223b6573238f30a92d7a31f87029d96dfb3869465c`
MD5	`5e98b8f9f9e0ad654a409b130c3c306c`
BLAKE2b-256	`041bd760daf3d91abf15994ad3837d1d7f3d49223791fa71ec2168bad5506daa`

See more details on using hashes here.

Provenance

The following attestation bundles were made for malimgraph-0.1.2.tar.gz:

Publisher: publish.yml on malim-ai-labs/malim-graph-plugin

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: malimgraph-0.1.2.tar.gz
- Subject digest: fb9ff5563b2041b10abc4c223b6573238f30a92d7a31f87029d96dfb3869465c
- Sigstore transparency entry: 1458739204
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: malim-ai-labs/malim-graph-plugin@99cac9de55bb1403b0cf6bf0df1a1ac74b3a729d
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/malim-ai-labs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@99cac9de55bb1403b0cf6bf0df1a1ac74b3a729d
- Trigger Event: push

File details

Details for the file malimgraph-0.1.2-py3-none-any.whl.

File metadata

Download URL: malimgraph-0.1.2-py3-none-any.whl
Upload date: May 7, 2026
Size: 48.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for malimgraph-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`54af2bcfc69177ec8a4a9591f7dbc5c92358ab9e49b4dc0f57e7ef50cd5d909b`
MD5	`973a2e037bc2e81c4074f05431975170`
BLAKE2b-256	`d89dfa68eb9134d8b4a0b31b7f4534b9814c57af012c67e7e6c5ef8006df38a2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for malimgraph-0.1.2-py3-none-any.whl:

Publisher: publish.yml on malim-ai-labs/malim-graph-plugin

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: malimgraph-0.1.2-py3-none-any.whl
- Subject digest: 54af2bcfc69177ec8a4a9591f7dbc5c92358ab9e49b4dc0f57e7ef50cd5d909b
- Sigstore transparency entry: 1458739340
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: malim-ai-labs/malim-graph-plugin@99cac9de55bb1403b0cf6bf0df1a1ac74b3a729d
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/malim-ai-labs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@99cac9de55bb1403b0cf6bf0df1a1ac74b3a729d
- Trigger Event: push

malimgraph 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MalimGraph

Features

Quick Start

Claude Code / Claude Desktop (no API key needed)

CLI (standalone, requires API key)

How It Works

Three Ways to Use

Claude Code Plugin (recommended — no API key)

MCP Server (standalone / HTTP)

CLI

Claude Skills

Installation

Environment Variables

Output Schema — knowledge_graph.json

pgvector — Semantic Search Schema

Database Setup

Neo4j

Apache AGE (PostgreSQL)

pgvector (PostgreSQL)

Contributing

Credits

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Output Schema — `knowledge_graph.json`