A tool to convert documents into knowledge graphs using Docling.
Project description
Docling Graph
Docling-Graph converts documents into validated Pydantic objects and then into a directed knowledge graph, with exports to CSV or Cypher and both static and interactive visualizations.
This transformation of unstructured documents into validated knowledge graphs with precise semantic relationships, essential for complex domains like chemistry, finance, and legal where AI systems must understand exact entity connections (e.g., chemical compounds and their reactions, financial instruments and their dependencies, physical properties and their measurements) rather than approximate text vectors, enabling explainable reasoning over technical document collections.
The toolkit supports two extraction families: local VLM via Docling and LLM-based extraction via local (vLLM, Ollama) or API providers (Mistral, OpenAI, Gemini, IBM WatsonX), all orchestrated by a flexible, config-driven pipeline.
Key Capabilities
- 🧠 Extraction:
- Local
VLM(Docling's information extraction pipeline - ideal for small documents with key-value focus) LLM(local via vLLM/Ollama or remote via Mistral/OpenAI/Gemini/IBM WatsonX API)Hybrid ChunkingLeveraging Docling's segmentation with semantic LLM chunking for more context-aware extractionPage-wiseorwhole-documentconversion strategies for flexible processing
- Local
- 🔨 Graph Construction:
- Markdown to Graph: Convert validated Pydantic instances to a
NetworkX DiGraphwith rich edge metadata and stable node IDs - Smart Merge: Combine multi-page documents into a single Pydantic instance for unified processing
- Modular graph module with enhanced type safety and configuration
- Markdown to Graph: Convert validated Pydantic instances to a
- 📦 Export:
Docling Documentexports (JSON format with full document structure)Markdownexports (full document and per-page options)CSVcompatible withNeo4jadmin importCypherscript generation for bulk ingestionJSONexport for general-purpose graph data
- 📊 Visualization:
- Interactive
HTMLvisualization in full-page browser view with enhanced node/edge exploration - Detailed
MARKDOWNreport with graph nodes content and edges
- Interactive
Coming Soon
- 🪜 Multi-Stage Extraction: Define
extraction_stagein templates to control multi-pass extraction. - 🧩 Interactive Template Builder: Guided workflows for building Pydantic templates.
- 🧬 Ontology-Based Templates: Match content to the best Pydantic template using semantic similarity.
- ✍🏻 Flexible Inputs: Accepts
text,markdown, andDoclingDocumentdirectly. -
- Batch Optimization: Faster GPU inference with better memory handling.
- 💾 Graph Database Integration: Export data straight into
Neo4j,ArangoDB, and similar databases.
Quick Start
Requirements
- Python 3.10 or higher
- uv package manager
Installation
# Clone the repository
git clone https://github.com/IBM/docling-graph
cd docling-graph
# Install with uv (choose your option)
uv sync # Minimal: Core + VLM only
uv sync --extra all # Full: All features
uv sync --extra local # Local LLM (vLLM, Ollama)
uv sync --extra remote # Remote APIs (Mistral, OpenAI, Gemini)
uv sync --extra watsonx # IBM WatsonX support
For detailed installation instructions, see Installation Guide.
API Key Setup (Remote Inference)
export OPENAI_API_KEY="..." # OpenAI
export MISTRAL_API_KEY="..." # Mistral
export GEMINI_API_KEY="..." # Google Gemini
# IBM WatsonX
export WATSONX_API_KEY="..." # IBM WatsonX API Key
export WATSONX_PROJECT_ID="..." # IBM WatsonX Project ID
export WATSONX_URL="..." # IBM WatsonX URL (optional)
Basic Usage
Python API
from docling_graph import PipelineConfig
from docs.examples.templates.rheology_research import Research
# Create configuration
config = PipelineConfig(
source="docs/examples/data/research_paper/rheology.pdf",
template=Research,
backend="llm",
inference="remote",
processing_mode="many-to-one",
provider_override="mistral",
model_override="mistral-medium-latest",
use_chunking=True,
output_dir="outputs/research"
)
# Run pipeline
config.run()
CLI
# Initialize configuration
uv run docling-graph init
# Convert document
uv run docling-graph convert "document.pdf" \
--template "templates.MyTemplate" \
--output-dir "outputs/my_graph"
# Visualize results
uv run docling-graph inspect outputs/my_graph
For more examples, see Examples.
Pydantic Templates
Templates define both the extraction schema and the resulting graph structure.
from pydantic import BaseModel, Field
from docling_graph.utils import edge
class Person(BaseModel):
"""Person entity with stable ID."""
model_config = {
'is_entity': True,
'graph_id_fields': ['last_name', 'date_of_birth']
}
first_name: str = Field(description="Person's first name")
last_name: str = Field(description="Person's last name")
date_of_birth: str = Field(description="Date of birth (YYYY-MM-DD)")
class Organization(BaseModel):
"""Organization entity."""
model_config = {'is_entity': True}
name: str = Field(description="Organization name")
employees: list[Person] = edge("EMPLOYS", description="List of employees")
For complete guidance, see:
Documentation
Comprehensive documentation can be found on Docling Graph's Page.
Documentation Structure
The documentation follows the docling-graph pipeline stages:
- Introduction - Overview and core concepts
- Installation - Setup and environment configuration
- Schema Definition - Creating Pydantic templates
- Pipeline Configuration - Configuring the extraction pipeline
- Extraction Process - Document conversion and extraction
- Graph Management - Exporting and visualizing graphs
- CLI Reference - Command-line interface guide
- Python API - Programmatic usage
- Examples - Working code examples
- Advanced Topics - Performance, testing, error handling
- API Reference - Detailed API documentation
- Development - Contributing and development guide
Examples
Explore working examples in docs/examples/:
- VLM Extraction: Image | PDF
- LLM Extraction: Remote API | Local Ollama
- Advanced: Consolidation | One-to-One
- CLI Recipes: Common Workflows
Example Templates
- Invoice - Financial document extraction
- ID Card - Identity document parsing
- Insurance - Insurance policy extraction
- Research Paper - Scientific document analysis
Contributing
We welcome contributions! Please see:
- Contributing Guidelines - How to contribute
- Development Guide - Development setup
- GitHub Workflow - Branch strategy and CI/CD
Development Setup
# Clone and setup
git clone https://github.com/IBM/docling-graph
cd docling-graph
# Install with dev dependencies
uv sync --extra all --extra dev
# Run Execute pre-commit checks
uv run pre-commit run --all-files
License
MIT License - see LICENSE for details.
Acknowledgments
- Powered by Docling for advanced document processing
- Uses Pydantic for data validation
- Graph generation powered by NetworkX
- Visualizations powered by Cytoscape.js
- CLI powered by Typer and Rich
IBM ❤️ Open Source AI
Docling Graph has been brought to you by IBM.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docling_graph-0.4.1.tar.gz.
File metadata
- Download URL: docling_graph-0.4.1.tar.gz
- Upload date:
- Size: 81.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa3eca21d725d8f6738506793565996ba2d002f334b218d4497ddaa2ee7dfaa4
|
|
| MD5 |
4089b916642a74c07bfa903765aab8ff
|
|
| BLAKE2b-256 |
3a3b7fe1aaac6bca7b8f53a042482fee61eca7c3ecb0651bed31dc05216f89e6
|
Provenance
The following attestation bundles were made for docling_graph-0.4.1.tar.gz:
Publisher:
release.yml on IBM/docling-graph
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docling_graph-0.4.1.tar.gz -
Subject digest:
fa3eca21d725d8f6738506793565996ba2d002f334b218d4497ddaa2ee7dfaa4 - Sigstore transparency entry: 845455037
- Sigstore integration time:
-
Permalink:
IBM/docling-graph@ffcbb7ed0c66af72f3fd851b6c75a559ad2821c3 -
Branch / Tag:
refs/tags/v0.4.1 - Owner: https://github.com/IBM
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@ffcbb7ed0c66af72f3fd851b6c75a559ad2821c3 -
Trigger Event:
push
-
Statement type:
File details
Details for the file docling_graph-0.4.1-py3-none-any.whl.
File metadata
- Download URL: docling_graph-0.4.1-py3-none-any.whl
- Upload date:
- Size: 103.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b26e1d00a44ab82e77f1736df57f53282c7ccecb7c11591746130f5b34f0059
|
|
| MD5 |
8a4a376e005411e30b064cfcec3f679f
|
|
| BLAKE2b-256 |
fe3ea769f232c0245466498807d9ffc32421d903cb7d56d613e9b8e420f195ba
|
Provenance
The following attestation bundles were made for docling_graph-0.4.1-py3-none-any.whl:
Publisher:
release.yml on IBM/docling-graph
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docling_graph-0.4.1-py3-none-any.whl -
Subject digest:
2b26e1d00a44ab82e77f1736df57f53282c7ccecb7c11591746130f5b34f0059 - Sigstore transparency entry: 845455039
- Sigstore integration time:
-
Permalink:
IBM/docling-graph@ffcbb7ed0c66af72f3fd851b6c75a559ad2821c3 -
Branch / Tag:
refs/tags/v0.4.1 - Owner: https://github.com/IBM
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@ffcbb7ed0c66af72f3fd851b6c75a559ad2821c3 -
Trigger Event:
push
-
Statement type: