Docling for Synthetic Data Generation (SDG) provides a set of tools to create artificial data from documents, leveraging generative AI and docling's parsing capabilities.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ibm-deepsearch-core

These details have not been verified by PyPI

Project description

Docling SDG

Docling for Synthetic Data Generation (SDG) provides a set of tools to create artificial data from documents, leveraging generative AI and Docling's parsing capabilities.

Features

🧬 Generation of question-answering pairs from passages of [multiple document formats][supported_formats] including PDF, HTML, or DOCX, leveraging Docling's parsing capabilities
⚖️ LLM as a judge for high quality question-answering pairs
💻 Simple and convenient CLI

Coming soon

📝 Integrations with Llama Stack and vLLM
📝 SDG on tabular data
📝 Documentation

Installation

To use Docling SDG, simply install docling-sdg from your package manager, e.g., pip:

pip install docling-sdg

Alternatively, you can clone this repository and use uv for creating a virtual environment, installing the packages, and running the project commands.

git clone git@github.com:docling-project/docling-sdg.git
cd docling-sdg
uv sync

Getting started

You can create synthetically-generated questions and answers from relevant parts of one or several documents. These question-answer pairs may be used in AI applications, such as evaluating a RAG application or generating ground truth to train a language model.

Sample

Generating and judging data with LLMs may be computationally intense. Since document collections may be large, you may want to chunk the documents into passages, filter them based on length and content criteria, and sample a bunch of them to have a manageable dataset.

from docling_sdg.qa.sample import PassageSampler

source = "https://en.wikipedia.org/wiki/Duck"
passage_sampler = PassageSampler()
print(passage_sampler.sample(source))

By default, the results will be exported to the file docling_sdg_sample.jsonl. Every line represents a document passage.

Generate

For each passage created in the previous step, we can prompt an LLM and generate 3 different questions of the following types: simple fact, summary, and reasoning.

The GenerateOptions class controls which model provider is used for Q&A generation by setting the provider attribute, as shown below. Three options are available:

LlmProvider.WATSONX for watsonx.ai;, you will need to provide a watsonx.ai instance ID and an API key.
LlmProvider.OPENAI for OpenAI; you will need to provide an OpenAI API key
LlmProvider.OPENAI_LIKE for any model provider with OpenAI compatible APIs; if no API key is needed (such as when running against ollama locally), set api_key to any string, e.g. "fake"

import os
from docling_sdg.qa.base import GenerateOptions, LlmProvider
from docling_sdg.qa.generate import Generator
from pathlib import Path

options = GenerateOptions(
    provider=LlmProvider.WATSONX,
    project_id=os.environ.get("WATSONX_PROJECT_ID"),
    api_key=os.environ.get("WATSONX_APIKEY"),
    url=os.environ.get("WATSONX_URL"),
)

generator = Generator(generate_options=options)
print(generator.generate_from_sample(Path("docling_sdg_sample.jsonl")))

By default, the results will be exported to the file docling_sdg_generated_qac.jsonl. Every line represents a generated question-answer-context item with additional information like the question type.

Critique

Certain applications may require certain quality in the generated data. The last step consists of using an LLM to judge the generated data and provide both qualitative and quantiative evaluations of the question-answer-context items. Using those evaluations, we can filter the generated dataset to the required quality levels.

import os
from docling_sdg.qa.base import CritiqueOptions, LlmProvider
from docling_sdg.qa.critique import Judge
from pathlib import Path

options = CritiqueOptions(
    provider=LlmProvider.WATSONX,
    project_id=os.environ.get("WATSONX_PROJECT_ID"),
    api_key=os.environ.get("WATSONX_APIKEY"),
    url=os.environ.get("WATSONX_URL"),
)

judge = Judge(critique_options=options)
print(judge.critique(Path("docling_sdg_generated_qac.jsonl")))

By default, the results will be exported to the file docling_sdg_critiqued_qac.jsonl. The file content is similar to the one created in the Generate step, but it additionally contains the critique evaluation on several dimensions such as question to context groundness, question feasibility or context usefulness.

CLI

Docling SDG has a built-in CLI to run the 3 steps of the question-answering data generation.

docling-sdg qa sample https://en.wikipedia.org/wiki/Duck
docling-sdg qa generate docling_sdg_sample.jsonl
docling-sdg qa critique docling_sdg_generated.jsonl

Find out more about optional parameters with the help argument. For instance:

docling-sdg qa generate --help

Get help and support

Please feel free to connect with us using the discussion section.

Technical report

For more details on Docling SDG's inner workings, check out the paper Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG System, as well as Docling Technical Report.

Contributing

Please read Contributing to Docling SDG for details.

References

If you use Docling SDG in your projects, please consider citing the following:

@inproceedings{teixeira-de-lima-etal-2025-know,
    title={Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems}, 
    author={Rafael Teixeira de Lima and Shubham Gupta and Cesar Berrospi and Lokesh Mishra and Michele Dolfi and Peter Staar and Panagiotis Vagenas},
    year={2025},
    month={jan},
    booktitle={Proceedings of the 31st International Conference on Computational Linguistics: Industry Track},
    publisher={Association for Computational Linguistics},
    url={https://aclanthology.org/2025.coling-industry.4/}
}

License

The Docling SDG codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

LF AI & Data

Docling is hosted as a project in the LF AI & Data Foundation.

IBM ❤️ Open Source AI

The project was started by the AI for knowledge team at IBM Research Zurich.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ibm-deepsearch-core

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.0

Aug 15, 2025

0.2.2

May 8, 2025

0.2.1

Apr 24, 2025

0.2.0

Apr 23, 2025

0.1.3

Mar 27, 2025

0.1.2

Mar 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_sdg-0.4.0.tar.gz (29.6 kB view details)

Uploaded Aug 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docling_sdg-0.4.0-py3-none-any.whl (32.9 kB view details)

Uploaded Aug 15, 2025 Python 3

File details

Details for the file docling_sdg-0.4.0.tar.gz.

File metadata

Download URL: docling_sdg-0.4.0.tar.gz
Upload date: Aug 15, 2025
Size: 29.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for docling_sdg-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`bfce5d69e7115f4481f87524da33e806ed0e60e505518ce654810897b74fea92`
MD5	`76e9c4e96cae1bbc2be775b25a386263`
BLAKE2b-256	`1a224686a98d4074c49873a9bc1338922d96f74e09344ac493566271721678e4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docling_sdg-0.4.0.tar.gz:

Publisher: pypi.yml on docling-project/docling-sdg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docling_sdg-0.4.0.tar.gz
- Subject digest: bfce5d69e7115f4481f87524da33e806ed0e60e505518ce654810897b74fea92
- Sigstore transparency entry: 397607993
- Sigstore integration time: Aug 15, 2025
Source repository:
- Permalink: docling-project/docling-sdg@6b9b7dab2a2b70c573801149aa77e1c37d718f4e
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/docling-project
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yml@6b9b7dab2a2b70c573801149aa77e1c37d718f4e
- Trigger Event: release

File details

Details for the file docling_sdg-0.4.0-py3-none-any.whl.

File metadata

Download URL: docling_sdg-0.4.0-py3-none-any.whl
Upload date: Aug 15, 2025
Size: 32.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for docling_sdg-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f815d66e42c6f7f8d3600b3e03f0439c1cfb796a7dad7a8e0dcb61be7cdadc8b`
MD5	`92ddf8e4a314079f1f7e5e4a053075cf`
BLAKE2b-256	`34f4aab8e66ec8cd3fa995b6f701befb3ac7d396f1d9d9ea65a203384f4268be`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docling_sdg-0.4.0-py3-none-any.whl:

Publisher: pypi.yml on docling-project/docling-sdg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docling_sdg-0.4.0-py3-none-any.whl
- Subject digest: f815d66e42c6f7f8d3600b3e03f0439c1cfb796a7dad7a8e0dcb61be7cdadc8b
- Sigstore transparency entry: 397608005
- Sigstore integration time: Aug 15, 2025
Source repository:
- Permalink: docling-project/docling-sdg@6b9b7dab2a2b70c573801149aa77e1c37d718f4e
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/docling-project
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yml@6b9b7dab2a2b70c573801149aa77e1c37d718f4e
- Trigger Event: release

docling-sdg 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Docling SDG

Features

Coming soon

Installation

Getting started

Sample

Generate

Critique

CLI

Get help and support

Technical report

Contributing

References

License

LF AI & Data

IBM ❤️ Open Source AI

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance