Skip to main content

A tool to convert documents into knowledge graphs using Docling.

Project description


Docling Graph

Docling Graph

Docs Docling PyPI version Python 3.10 | 3.11 | 3.12 uv Ruff NetworkX Pydantic v2 Typer Rich vLLM Ollama LF AI & Data License MIT OpenSSF Best Practices

Docling-Graph converts documents into validated Pydantic objects and then into a directed knowledge graph, with exports to CSV or Cypher and both static and interactive visualizations.

This transformation of unstructured documents into validated knowledge graphs with precise semantic relationships, essential for complex domains like chemistry, finance, and legal where AI systems must understand exact entity connections (e.g., chemical compounds and their reactions, financial instruments and their dependencies, physical properties and their measurements) rather than approximate text vectors, enabling explainable reasoning over technical document collections.

The toolkit supports two extraction families: local VLM via Docling and LLM-based extraction via local (vLLM, Ollama) or API providers (Mistral, OpenAI, Gemini, IBM WatsonX), all orchestrated by a flexible, config-driven pipeline.

Key Capabilities

  • 🧠 Extraction:
    • Local VLM (Docling's information extraction pipeline - ideal for small documents with key-value focus)
    • LLM (local via vLLM/Ollama or remote via Mistral/OpenAI/Gemini/IBM WatsonX API)
    • Hybrid Chunking Leveraging Docling's segmentation with semantic LLM chunking for more context-aware extraction
    • Page-wise or whole-document conversion strategies for flexible processing
  • 🔨 Graph Construction:
    • Markdown to Graph: Convert validated Pydantic instances to a NetworkX DiGraph with rich edge metadata and stable node IDs
    • Smart Merge: Combine multi-page documents into a single Pydantic instance for unified processing
    • Modular graph module with enhanced type safety and configuration
  • 📦 Export:
    • Docling Document exports (JSON format with full document structure)
    • Markdown exports (full document and per-page options)
    • CSV compatible with Neo4j admin import
    • Cypher script generation for bulk ingestion
    • JSON export for general-purpose graph data
  • 📊 Visualization:
    • Interactive HTML visualization in full-page browser view with enhanced node/edge exploration
    • Detailed MARKDOWN report with graph nodes content and edges

Coming Soon

  • 🪜 Multi-Stage Extraction: Define extraction_stage in templates to control multi-pass extraction.
  • 🧩 Interactive Template Builder: Guided workflows for building Pydantic templates.
  • 🧬 Ontology-Based Templates: Match content to the best Pydantic template using semantic similarity.
  • ✍🏻 Flexible Inputs: Accepts text, markdown, and DoclingDocument directly.
    • Batch Optimization: Faster GPU inference with better memory handling.
  • 💾 Graph Database Integration: Export data straight into Neo4j, ArangoDB, and similar databases.

Quick Start

Requirements

  • Python 3.10 or higher
  • uv package manager

Installation

# Clone the repository
git clone https://github.com/IBM/docling-graph
cd docling-graph

# Install with uv (choose your option)
uv sync                    # Minimal: Core + VLM only
uv sync --extra all        # Full: All features
uv sync --extra local      # Local LLM (vLLM, Ollama)
uv sync --extra remote     # Remote APIs (Mistral, OpenAI, Gemini)
uv sync --extra watsonx    # IBM WatsonX support

For detailed installation instructions, see Installation Guide.

API Key Setup (Remote Inference)

export OPENAI_API_KEY="..."        # OpenAI
export MISTRAL_API_KEY="..."       # Mistral
export GEMINI_API_KEY="..."        # Google Gemini

# IBM WatsonX
export WATSONX_API_KEY="..."       # IBM WatsonX API Key
export WATSONX_PROJECT_ID="..."    # IBM WatsonX Project ID
export WATSONX_URL="..."           # IBM WatsonX URL (optional)

Basic Usage

Python API

from docling_graph import PipelineConfig
from docs.examples.templates.rheology_research import Research

# Create configuration
config = PipelineConfig(
    source="docs/examples/data/research_paper/rheology.pdf",
    template=Research,
    backend="llm",
    inference="remote",
    processing_mode="many-to-one",
    provider_override="mistral",
    model_override="mistral-medium-latest",
    use_chunking=True,
    output_dir="outputs/research"
)

# Run pipeline
config.run()

CLI

# Initialize configuration
uv run docling-graph init

# Convert document
uv run docling-graph convert "document.pdf" \
    --template "templates.MyTemplate" \
    --output-dir "outputs/my_graph"

# Visualize results
uv run docling-graph inspect outputs/my_graph

For more examples, see Examples.

Pydantic Templates

Templates define both the extraction schema and the resulting graph structure.

from pydantic import BaseModel, Field
from docling_graph.utils import edge

class Person(BaseModel):
    """Person entity with stable ID."""
    model_config = {
        'is_entity': True,
        'graph_id_fields': ['last_name', 'date_of_birth']
    }
    
    first_name: str = Field(description="Person's first name")
    last_name: str = Field(description="Person's last name")
    date_of_birth: str = Field(description="Date of birth (YYYY-MM-DD)")

class Organization(BaseModel):
    """Organization entity."""
    model_config = {'is_entity': True}
    
    name: str = Field(description="Organization name")
    employees: list[Person] = edge("EMPLOYS", description="List of employees")

For complete guidance, see:

Documentation

Comprehensive documentation can be found on Docling Graph's Page.

Documentation Structure

The documentation follows the docling-graph pipeline stages:

  1. Introduction - Overview and core concepts
  2. Installation - Setup and environment configuration
  3. Schema Definition - Creating Pydantic templates
  4. Pipeline Configuration - Configuring the extraction pipeline
  5. Extraction Process - Document conversion and extraction
  6. Graph Management - Exporting and visualizing graphs
  7. CLI Reference - Command-line interface guide
  8. Python API - Programmatic usage
  9. Examples - Working code examples
  10. Advanced Topics - Performance, testing, error handling
  11. API Reference - Detailed API documentation
  12. Development - Contributing and development guide

Examples

Explore working examples in docs/examples/:

Example Templates

Contributing

We welcome contributions! Please see:

Development Setup

# Clone and setup
git clone https://github.com/IBM/docling-graph
cd docling-graph

# Install with dev dependencies
uv sync --extra all --extra dev

# Run Execute pre-commit checks
uv run pre-commit run --all-files

License

MIT License - see LICENSE for details.

Acknowledgments

IBM ❤️ Open Source AI

Docling Graph has been brought to you by IBM.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_graph-0.4.1.tar.gz (81.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docling_graph-0.4.1-py3-none-any.whl (103.8 kB view details)

Uploaded Python 3

File details

Details for the file docling_graph-0.4.1.tar.gz.

File metadata

  • Download URL: docling_graph-0.4.1.tar.gz
  • Upload date:
  • Size: 81.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docling_graph-0.4.1.tar.gz
Algorithm Hash digest
SHA256 fa3eca21d725d8f6738506793565996ba2d002f334b218d4497ddaa2ee7dfaa4
MD5 4089b916642a74c07bfa903765aab8ff
BLAKE2b-256 3a3b7fe1aaac6bca7b8f53a042482fee61eca7c3ecb0651bed31dc05216f89e6

See more details on using hashes here.

Provenance

The following attestation bundles were made for docling_graph-0.4.1.tar.gz:

Publisher: release.yml on IBM/docling-graph

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docling_graph-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: docling_graph-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 103.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docling_graph-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2b26e1d00a44ab82e77f1736df57f53282c7ccecb7c11591746130f5b34f0059
MD5 8a4a376e005411e30b064cfcec3f679f
BLAKE2b-256 fe3ea769f232c0245466498807d9ffc32421d903cb7d56d613e9b8e420f195ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for docling_graph-0.4.1-py3-none-any.whl:

Publisher: release.yml on IBM/docling-graph

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page