Running a distributed job processing documents with Docling.
Project description
Docling Jobkit
Running a distributed job processing documents with Docling.
How to use it
Kubeflow pipeline with Docling Jobkit
Using Kubeflow pipeline web dashboard UI
- From the main page, open "Pipelines" section on the left
- Press on "Upload pipeline" button at top-right
- Give pipeline a name and in "Upload a file" menu point to location of
docling-jobkit/docling_jobkit/kfp_pipeline/docling-s3in-s3out.yaml
file - Now you can press "Create run" button at the top-right to create an instance of the pipeline
- Customize required inputs according to provided examples and press "Start" to start pipeline run
Using OpenshiftAI web dashboard UI
- From the main page of Red Hat Openshift AI open "Data Science Pipelines -> Pipelines" section on the left side
- Switch "Project" to namespace where you plan to run pipelines
- Press on "Import Pipeline", provide a name and upload the
docling-jobkit/docling_jobkit/kfp_pipeline/docling-s3in-s3out.yaml
file - From the selected/created pipeline interface, you can start new run by pressing "Actions -> Create Run"
- Customize required inputs according to provided examples and press "Start" to start pipeline run
Customizing pipeline to specifics of your infrastructure
Some customizations, such as paralelism level, node selector or tollerations, require changing source script and compiling new yaml manifest.
Source script is located at docling-jobkit/docling_jobkit/kfp_pipeline/docling-s3in-s3out.py
.
If you use web UI to run pipelines, then python script need to be compiled into yaml and new version of yaml uploaded to pipeline. For example, you can use poetry to handle python environment and run following command:
uv run python semantic-ingest-batches.py
The yaml file will be generated in the local folder from where you execute command. Now in the web UI, you can open existing pipeline and upload new version of the script using "Upload version" at top-right.
By defaul, paralelism is set to 20 instances, this can be change in the source docling-jobkit/docling_jobkit/kfp_pipeline/docling-s3in-s3out.py
script, look for this line with dsl.ParallelFor(batches.outputs["batch_indices"], parallelism=20) as subbatch:
.
By default, the resources requests/limits for the document convertion component are set to following:
converter.set_memory_request("1G")
converter.set_memory_limit("7G")
converter.set_cpu_request("200m")
converter.set_cpu_limit("1")
By default, the resource request/limit are not set for the nodes with GPU, you can uncomment following lines in the inputs_s3in_s3out
pipeline function to enable it:
converter.set_accelerator_type("nvidia.com/gpu")
converter.set_accelerator_limit("1")
The node selector and tollerations can be enabled with following commands, customize actual values to your infrastructure:
from kfp import kubernetes
kubernetes.add_node_selector(
task=converter,
label_key="nvidia.com/gpu.product",
label_value="NVIDIA-A10",
)
kubernetes.add_toleration(
task=converter,
key="gpu_compute",
operator="Equal",
value="true",
effect="NoSchedule",
)
Running pipeline programatically
At the end of the script file you can find an example code for submitting pipeline run programatically.
You can provide your custom values as environment variables in an .env
file and bind it during execution:
uv run --env-file .env python docling-s3in-s3out.py
Ray runtime with Docling Jobkit
Make sure your Ray cluster has docling-jobkit
installed, then submit the job.
ray job submit --no-wait --working-dir . --runtime-env runtime_env.yml -- docling-ray-job
Custom runtime environment
-
Create a file
runtime_env.yml
:# Expected environment if clean ray image is used. Take into account that ray worker can timeout before it finishes installing modules. pip: - docling-jobkit
-
Submit the job using the custom runtime env:
ray job submit --no-wait --runtime-env runtime_env.yml -- docling-ray-job
More examples and customization are provided in docs/ray-job/.
Custom image with all dependencies
Coming soon. Initial instruction from OpenShift AI docs.
Get help and support
Please feel free to connect with us using the discussion section of the main Docling repository.
Contributing
Please read Contributing to Docling Serve for details.
References
If you use Docling in your projects, please consider citing the following:
@techreport{Docling,
author = {Deep Search Team},
month = {1},
title = {Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion},
url = {https://arxiv.org/abs/2501.17887},
eprint = {2501.17887},
doi = {10.48550/arXiv.2501.17887},
version = {2.0.0},
year = {2025}
}
License
The Docling Serve codebase is under MIT license.
LF AI & Data
Docling is hosted as a project in the LF AI & Data Foundation.
IBM ❤️ Open Source AI
The project was started by the AI for Knowledge team at IBM Research Zurich.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file docling_jobkit-1.4.1.tar.gz
.
File metadata
- Download URL: docling_jobkit-1.4.1.tar.gz
- Upload date:
- Size: 40.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
2030666b3d25b60fc508a64560844dac673cd6a669755b62849f06beeb66467b
|
|
MD5 |
9f78ea1fdcd191581264bc5d863824b9
|
|
BLAKE2b-256 |
1d52f86812cdaaf9e70187646a7e87483f75cc5027a9ac6ef05f304f0fdfc871
|
Provenance
The following attestation bundles were made for docling_jobkit-1.4.1.tar.gz
:
Publisher:
pypi.yml
on docling-project/docling-jobkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1
-
Predicate type:
https://docs.pypi.org/attestations/publish/v1
-
Subject name:
docling_jobkit-1.4.1.tar.gz
-
Subject digest:
2030666b3d25b60fc508a64560844dac673cd6a669755b62849f06beeb66467b
- Sigstore transparency entry: 409441185
- Sigstore integration time:
-
Permalink:
docling-project/docling-jobkit@11edd0543a088f915e959c34514013ce5342b0fc
-
Branch / Tag:
refs/tags/v1.4.1
- Owner: https://github.com/docling-project
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com
-
Runner Environment:
github-hosted
-
Publication workflow:
pypi.yml@11edd0543a088f915e959c34514013ce5342b0fc
-
Trigger Event:
release
-
Statement type:
File details
Details for the file docling_jobkit-1.4.1-py3-none-any.whl
.
File metadata
- Download URL: docling_jobkit-1.4.1-py3-none-any.whl
- Upload date:
- Size: 51.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
b55fdcc0b0cbd8bc8ba848df9c0a0135b5c1c3c26aa3e5545e21cacef51897b1
|
|
MD5 |
6a07945abdd268571e298edf92d95c9b
|
|
BLAKE2b-256 |
6f8769c48f743f09321df7d86270d89ed432ee0d0974f22141d0d644969764fa
|
Provenance
The following attestation bundles were made for docling_jobkit-1.4.1-py3-none-any.whl
:
Publisher:
pypi.yml
on docling-project/docling-jobkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1
-
Predicate type:
https://docs.pypi.org/attestations/publish/v1
-
Subject name:
docling_jobkit-1.4.1-py3-none-any.whl
-
Subject digest:
b55fdcc0b0cbd8bc8ba848df9c0a0135b5c1c3c26aa3e5545e21cacef51897b1
- Sigstore transparency entry: 409441187
- Sigstore integration time:
-
Permalink:
docling-project/docling-jobkit@11edd0543a088f915e959c34514013ce5342b0fc
-
Branch / Tag:
refs/tags/v1.4.1
- Owner: https://github.com/docling-project
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com
-
Runner Environment:
github-hosted
-
Publication workflow:
pypi.yml@11edd0543a088f915e959c34514013ce5342b0fc
-
Trigger Event:
release
-
Statement type: