Skip to main content

A tool to convert documents into knowledge graphs using Docling.

Project description


Docling Graph

Docling Graph

Docs Docling PyPI version Python 3.10 | 3.11 | 3.12 uv Ruff NetworkX Pydantic v2 Typer Rich vLLM Ollama LF AI & Data License MIT OpenSSF Best Practices

Docling-Graph converts documents into validated Pydantic objects and then into a directed knowledge graph, with exports to CSV or Cypher and both static and interactive visualizations.

This transformation of unstructured documents into validated knowledge graphs with precise semantic relationships—essential for complex domains like chemistry, finance, and physics where AI systems must understand exact entity connections (e.g., chemical compounds and their reactions, financial instruments and their dependencies, physical properties and their measurements) rather than approximate text vectors, enabling explainable reasoning over technical document collections.

The toolkit supports two extraction families: local VLM via Docling and LLM-based extraction via local (vLLM, Ollama) or API providers (Mistral, OpenAI, Gemini, IBM WatsonX), all orchestrated by a flexible, config-driven pipeline.

Key Capabilities

  • 🧠 Extraction:
    • Local VLM (Docling's information extraction pipeline - ideal for small documents with key-value focus)
    • LLM (local via vLLM/Ollama or remote via Mistral/OpenAI/Gemini/IBM WatsonX API)
    • Hybrid Chunking Leveraging Docling's segmentation with semantic LLM chunking for more context-aware extraction
    • Page-wise or whole-document conversion strategies for flexible processing
  • 🔨 Graph Construction:
    • Markdown to Graph: Convert validated Pydantic instances to a NetworkX DiGraph with rich edge metadata and stable node IDs
    • Smart Merge: Combine multi-page documents into a single Pydantic instance for unified processing
    • Modular graph module with enhanced type safety and configuration
  • 📦 Export:
    • Docling Document exports (JSON format with full document structure)
    • Markdown exports (full document and per-page options)
    • CSV compatible with Neo4j admin import
    • Cypher script generation for bulk ingestion
    • JSON export for general-purpose graph data
  • 📊 Visualization:
    • Interactive HTML visualization in full-page browser view with enhanced node/edge exploration
    • Detailed MARKDOWN report with graph nodes content and edges

Coming Soon

  • 🪜 Multi-Stage Extraction: Define extraction_stage in templates to control multi-pass extraction.
  • 🧩 Interactive Template Builder: Guided workflows for building Pydantic templates.
  • 🧬 Ontology-Based Templates: Match content to the best Pydantic template using semantic similarity.
  • ✍🏻 Flexible Inputs: Accepts text, markdown, and DoclingDocument directly.
  • Batch Optimization: Faster GPU inference with better memory handling.
  • 💾 Graph Database Integration: Export data straight into Neo4j, ArangoDB, and similar databases.

Initial Setup

Requirements

  • Python 3.10 or higher
  • UV package manager

Installation

1. Clone the Repository

git clone https://github.com/IBM/docling-graph
cd docling-graph

2. Install Dependencies

Choose the installation option that matches your use case:

Option Command Description
Minimal uv sync Includes core VLM features (Docling), no LLM inference
Full uv sync --extra all Includes all features, VLM, and all local/remote LLM providers
Local LLM uv sync --extra local Adds support for vLLM and Ollama (requires GPU for vLLM)
Remote API uv sync --extra remote Adds support for Mistral, OpenAI, Gemini, and IBM WatsonX APIs
WatsonX uv sync --extra watsonx Adds support for IBM WatsonX foundation models (Granite, Llama, Mixtral)

3. OPTIONAL - GPU Support (PyTorch)

Follow the steps in this guide to install PyTorch with NVIDIA GPU (CUDA) support.

API Key Setup (for Remote Inference)

If you're using remote/cloud inference, set your API keys for the providers you plan to use:

export OPENAI_API_KEY="..."        # OpenAI
export MISTRAL_API_KEY="..."       # Mistral
export GEMINI_API_KEY="..."        # Google Gemini
export WATSONX_API_KEY="..."       # IBM WatsonX
export WATSONX_PROJECT_ID="..."    # IBM WatsonX Project ID
export WATSONX_URL="..."           # IBM WatsonX URL (optional, defaults to US South)

On Windows, replace export with set in Command Prompt or $env: in PowerShell.

Alternatively, add them to your .env file.

Note: For IBM WatsonX setup and available models, see the WatsonX Integration Guide.

Getting Started

Docling Graph is primarily driven by its CLI, but you can easily integrate the core pipeline into Python scripts.

1. Python Example

To run a conversion programmatically, you define a configuration dictionary and pass it to the run_pipeline function. This example uses a remote LLM API in a many-to-one mode for a single multi-page document:

from docling_graph import run_pipeline, PipelineConfig
from docs.examples.templates.rheology_research import Research  # Pydantic model to use as an extraction template

# Create typed config
config = PipelineConfig(
    source="docs/examples/data/research_paper/rheology.pdf",
    template=Research,
    backend="llm",
    inference="remote",
    processing_mode="many-to-one",
    provider_override="mistral",              # Specify your preferred provider and ensure its API key is set
    model_override="mistral-medium-latest",   # Specify your preferred LLM model
    use_chunking=True,                        # Enable docling's hybrid chunker
    llm_consolidation=False,                  # If False, programmatically merge batch-extracted dictionaries
    output_dir="outputs/battery_research"
)

try:
    run_pipeline(config)
    print(f"\nExtraction complete! Graph data saved to: {config.output_dir}")
except Exception as e:
    print(f"An error occurred: {e}")

2. CLI Example

Use the command-line interface for quick conversions and inspections. The following command runs the conversion using the local VLM backend and outputs a graph ready for Neo4j import:

2.1. Initialize Configuration

A wizard will walk you through setting up the right configfor your use case.

uv run docling-graph init

Note: This command may take a little longer to start on the first run, as it checks for installed dependencies.

2.2. Run Conversion

You can use: docling-graph convert --help to see the full list of available options and usage details

# uv run docling-graph convert <SOURCE_FILE_PATH> --template "<TEMPLATE_DOTTED_PATH>" [OPTIONS]

uv run docling-graph convert "docs/examples/data/research_paper/rheology.pdf" \
    --template "docs.examples.templates.rheology_research.Research" \
    --output-dir "outputs/battery_research"  \
    --processing-mode "many-to-one" \
    --use-chunking \
    --no-llm-consolidation 

2.3. Run Conversion

# uv run docling-graph inspect <CONVERT_OUTPUT_PATH> [OPTIONS]

uv run docling-graph inspect outputs/battery_research

Pydantic Templates

Templates are the foundation of Docling Graph, defining both the extraction schema and the resulting graph structure.

  • Use is_entity=True in model_config to explicitly mark a class as a graph node.
  • Leverage model_config.graph_id_fields to create stable, readable node IDs (natural keys).
  • Use the Edge() helper to define explicit relationships between entities.

Example:

from pydantic import BaseModel, Field
from typing import Optional

class Person(BaseModel):
    """Person entity with stable ID based on name and DOB."""
    model_config = {
        'is_entity': True,
        'graph_id_fields': ['last_name', 'date_of_birth']
    }
    
    first_name: str = Field(description="Person's first name")
    last_name: str = Field(description="Person's last name")
    date_of_birth: str = Field(description="Date of birth (YYYY-MM-DD)")

Reference Pydantic templates are available to help you get started quickly.

For complete guidance, see: Pydantic Templates for Knowledge Graph Extraction

Documentation

  • Work In Progress...

Examples

Get hands-on with Docling Graph examples to convert documents into knowledge graphs through VLM or LLM-based processing.

License

MIT License - see LICENSE for details.

Acknowledgments

IBM ❤️ Open Source AI

Docling Graph has been brought to you by IBM.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_graph-0.2.5.tar.gz (76.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docling_graph-0.2.5-py3-none-any.whl (97.2 kB view details)

Uploaded Python 3

File details

Details for the file docling_graph-0.2.5.tar.gz.

File metadata

  • Download URL: docling_graph-0.2.5.tar.gz
  • Upload date:
  • Size: 76.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docling_graph-0.2.5.tar.gz
Algorithm Hash digest
SHA256 83477398f0de2be74006a9e19357a153857829885d65de7d4c4ba5500dbb3043
MD5 8766bd091ce89673129cbc88b8fb36a7
BLAKE2b-256 025d1e5e0531229a07eeac20e35aa44916364eb97353230537e95e332b65a170

See more details on using hashes here.

Provenance

The following attestation bundles were made for docling_graph-0.2.5.tar.gz:

Publisher: release.yml on IBM/docling-graph

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docling_graph-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: docling_graph-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 97.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docling_graph-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 8c470d61cc57fe597adf5b8f6ca8cccbf2d9335014ed2277a9b511fe3ec18923
MD5 8b2d56774ef91e52ea76d948f38f2c42
BLAKE2b-256 e3f1771a3d6ce3a93f1618510068eea9868dd0782c83c2114f2b7cb637f2ffe0

See more details on using hashes here.

Provenance

The following attestation bundles were made for docling_graph-0.2.5-py3-none-any.whl:

Publisher: release.yml on IBM/docling-graph

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page