A tool to convert documents into knowledge graphs using Docling.
Project description
Docling Graph
Docling-Graph turns documents into validated Pydantic objects, then builds a directed knowledge graph with explicit semantic relationships.
This transformation enables high-precision use cases in chemistry, finance, and legal domains, where AI must capture exact entity connections (compounds and reactions, instruments and dependencies, properties and measurements) rather than rely on approximate text embeddings.
This toolkit supports two extraction paths: local VLM extraction via Docling, and LLM-based extraction routed through LiteLLM for local runtimes (vLLM, Ollama) and API providers (Mistral, OpenAI, Gemini, IBM WatsonX), all orchestrated through a flexible, config-driven pipeline.
Key Capabilities
-
✍🏻 Input formats: Docling’s supported inputs: PDF, images, markdown, Office, HTML, and more.
-
🧠 Extraction: LLM or VLM backends, with chunking and processing modes.
-
💎 Graphs: Pydantic → NetworkX directed graphs with stable IDs and edge metadata.
-
🔍 Visualization: Interactive HTML and Markdown reports.
Latest Changes
-
🪜 Multi-pass extraction: Delta and staged contracts (experimental).
-
📐 Structured extraction: LLM output is schema-enforced by default; see CLI and API to disable.
-
✨ LiteLLM: Single interface for vLLM, OpenAI, Mistral, WatsonX, and more.
-
🐛 Trace capture: Debug exports for extraction and fallback diagnostics.
Coming Soon
-
🧩 Interactive Template Builder: Guided workflows for building Pydantic templates.
-
🧲 Ontology-Based Templates: Match content to the best Pydantic template using semantic similarity.
-
💾 Graph Database Integration: Export data straight into
Neo4j,ArangoDB, and similar databases.
Quick Start
Requirements
- Python 3.10 or higher
Installation
pip install docling-graph
This installs the core package with VLM support and LiteLLM for LLM providers. For detailed installation instructions (including optional extras and GPU setup), see Installation Guide.
API Key Setup (Remote Inference)
export OPENAI_API_KEY="..." # OpenAI
export MISTRAL_API_KEY="..." # Mistral
export GEMINI_API_KEY="..." # Google Gemini
# IBM WatsonX
export WATSONX_API_KEY="..." # IBM WatsonX API Key
export WATSONX_PROJECT_ID="..." # IBM WatsonX Project ID
export WATSONX_URL="..." # IBM WatsonX URL (optional)
Basic Usage
CLI
# Initialize configuration
docling-graph init
# Convert document from URL (each line except the last must end with \)
docling-graph convert "https://arxiv.org/pdf/2207.02720" \
--template "docs.examples.templates.rheology_research.ScholarlyRheologyPaper" \
--processing-mode "many-to-one" \
--extraction-contract "staged" \
--debug
# Visualize results
docling-graph inspect outputs
Python API - Default Behavior
from docling_graph import run_pipeline, PipelineContext
from docs.examples.templates.rheology_research import ScholarlyRheologyPaper
# Create configuration
config = {
"source": "https://arxiv.org/pdf/2207.02720",
"template": ScholarlyRheologyPaper,
"backend": "llm",
"inference": "remote",
"processing_mode": "many-to-one",
"extraction_contract": "staged", # robust for smaller models
"provider_override": "mistral",
"model_override": "mistral-medium-latest",
"structured_output": True, # default
"use_chunking": True,
}
# Run pipeline - returns data directly, no files written to disk
context: PipelineContext = run_pipeline(config)
# Access results
graph = context.knowledge_graph
models = context.extracted_models
metadata = context.graph_metadata
print(f"Extracted {len(models)} model(s)")
print(f"Graph: {graph.number_of_nodes()} nodes, {graph.number_of_edges()} edges")
For debugging, use --debug with the CLI to save intermediate artifacts to disk; see Trace Data & Debugging. For more examples, see Examples.
Pydantic Templates
Templates define both the extraction schema and the resulting graph structure.
from pydantic import BaseModel, Field
from docling_graph.utils import edge
class Person(BaseModel):
"""Person entity with stable ID."""
model_config = {
'is_entity': True,
'graph_id_fields': ['last_name', 'date_of_birth']
}
first_name: str = Field(description="Person's first name")
last_name: str = Field(description="Person's last name")
date_of_birth: str = Field(description="Date of birth (YYYY-MM-DD)")
class Organization(BaseModel):
"""Organization entity."""
model_config = {'is_entity': True}
name: str = Field(description="Organization name")
employees: list[Person] = edge("EMPLOYS", description="List of employees")
For complete guidance, see:
Documentation
Comprehensive documentation can be found on Docling Graph's Page.
Documentation Structure
The documentation follows the docling-graph pipeline stages:
- Introduction - Overview and core concepts
- Installation - Setup and environment configuration
- Schema Definition - Creating Pydantic templates
- Pipeline Configuration - Configuring the extraction pipeline
- Extraction Process - Document conversion and extraction
- Graph Management - Exporting and visualizing graphs
- CLI Reference - Command-line interface guide
- Python API - Programmatic usage
- Examples - Working code examples
- Advanced Topics - Performance, testing, error handling
- API Reference - Detailed API documentation
- Community - Contributing and development guide
Contributing
We welcome contributions! Please see:
- Contributing Guidelines - How to contribute
- Development Guide - Development setup
Development Setup
# Clone and setup
git clone https://github.com/docling-project/docling-graph
cd docling-graph
# Install with dev dependencies
uv sync --extra dev
# Run Execute pre-commit checks
uv run pre-commit run --all-files
License
MIT License - see LICENSE for details.
Acknowledgments
Docling Graph builds on outstanding open-source projects:
- Docling - document conversion and VLM extraction
- Pydantic - schema definition and validation
- NetworkX - graph construction and analysis
- LiteLLM - unified LLM provider interface
- SpaCy - semantic entity resolution in delta extraction
- Cytoscape - interactive graph visualization
IBM ❤️ Open Source AI
Docling Graph has been brought to you by IBM.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docling_graph-1.5.0.tar.gz.
File metadata
- Download URL: docling_graph-1.5.0.tar.gz
- Upload date:
- Size: 192.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b832fe3295a98c7ce98c68b3fcd7e68a4e0d25f286ff80acfff354789909292b
|
|
| MD5 |
cf5b950fe6ce7c489f9d2fc41d75046d
|
|
| BLAKE2b-256 |
3507d45b379d7286ee58248af65c5c71376a8aade0487c0dd6ad2e9a4088997c
|
Provenance
The following attestation bundles were made for docling_graph-1.5.0.tar.gz:
Publisher:
release.yml on docling-project/docling-graph
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docling_graph-1.5.0.tar.gz -
Subject digest:
b832fe3295a98c7ce98c68b3fcd7e68a4e0d25f286ff80acfff354789909292b - Sigstore transparency entry: 1340583193
- Sigstore integration time:
-
Permalink:
docling-project/docling-graph@990b1872210e7c276399511a90bd0fea9748d242 -
Branch / Tag:
refs/tags/v1.5.0 - Owner: https://github.com/docling-project
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@990b1872210e7c276399511a90bd0fea9748d242 -
Trigger Event:
push
-
Statement type:
File details
Details for the file docling_graph-1.5.0-py3-none-any.whl.
File metadata
- Download URL: docling_graph-1.5.0-py3-none-any.whl
- Upload date:
- Size: 224.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b3931875d56e591a0ef96ecac87cf8afe4424b2ef6f59fbee8ecc7d8bb024d8
|
|
| MD5 |
88e286e0864683a76171ae4bc7f78549
|
|
| BLAKE2b-256 |
a6b1e8e6c5f5edb3c174937a0ebf62567284e37139baeb2c0b965a477c8cc04a
|
Provenance
The following attestation bundles were made for docling_graph-1.5.0-py3-none-any.whl:
Publisher:
release.yml on docling-project/docling-graph
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docling_graph-1.5.0-py3-none-any.whl -
Subject digest:
8b3931875d56e591a0ef96ecac87cf8afe4424b2ef6f59fbee8ecc7d8bb024d8 - Sigstore transparency entry: 1340583201
- Sigstore integration time:
-
Permalink:
docling-project/docling-graph@990b1872210e7c276399511a90bd0fea9748d242 -
Branch / Tag:
refs/tags/v1.5.0 - Owner: https://github.com/docling-project
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@990b1872210e7c276399511a90bd0fea9748d242 -
Trigger Event:
push
-
Statement type: