Running a distributed job processing documents with Docling.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ibm-deepsearch-core

These details have not been verified by PyPI

Project description

Docling Jobkit

Running a distributed job processing documents with Docling.

How to use it

Kubeflow pipeline with Docling Jobkit

Using Kubeflow pipeline web dashboard UI

From the main page, open "Pipelines" section on the left
Press on "Upload pipeline" button at top-right
Give pipeline a name and in "Upload a file" menu point to location of docling-jobkit/docling_jobkit/kfp_pipeline/docling-s3in-s3out.yaml file
Now you can press "Create run" button at the top-right to create an instance of the pipeline
Customize required inputs according to provided examples and press "Start" to start pipeline run

Using OpenshiftAI web dashboard UI

From the main page of Red Hat Openshift AI open "Data Science Pipelines -> Pipelines" section on the left side
Switch "Project" to namespace where you plan to run pipelines
Press on "Import Pipeline", provide a name and upload the docling-jobkit/docling_jobkit/kfp_pipeline/docling-s3in-s3out.yaml file
From the selected/created pipeline interface, you can start new run by pressing "Actions -> Create Run"
Customize required inputs according to provided examples and press "Start" to start pipeline run

Customizing pipeline to specifics of your infrastructure

Some customizations, such as paralelism level, node selector or tollerations, require changing source script and compiling new yaml manifest. Source script is located at docling-jobkit/docling_jobkit/kfp_pipeline/docling-s3in-s3out.py.

If you use web UI to run pipelines, then python script need to be compiled into yaml and new version of yaml uploaded to pipeline. For example, you can use poetry to handle python environment and run following command:

uv run python semantic-ingest-batches.py

The yaml file will be generated in the local folder from where you execute command. Now in the web UI, you can open existing pipeline and upload new version of the script using "Upload version" at top-right.

By defaul, paralelism is set to 20 instances, this can be change in the source docling-jobkit/docling_jobkit/kfp_pipeline/docling-s3in-s3out.py script, look for this line with dsl.ParallelFor(batches.outputs["batch_indices"], parallelism=20) as subbatch:.

By default, the resources requests/limits for the document convertion component are set to following:

converter.set_memory_request("1G")
converter.set_memory_limit("7G")
converter.set_cpu_request("200m")
converter.set_cpu_limit("1")

By default, the resource request/limit are not set for the nodes with GPU, you can uncomment following lines in the inputs_s3in_s3out pipeline function to enable it:

converter.set_accelerator_type("nvidia.com/gpu")
converter.set_accelerator_limit("1")

The node selector and tollerations can be enabled with following commands, customize actual values to your infrastructure:

from kfp import kubernetes

kubernetes.add_node_selector(
  task=converter,
  label_key="nvidia.com/gpu.product",
  label_value="NVIDIA-A10",
)

kubernetes.add_toleration(
  task=converter,
  key="gpu_compute",
  operator="Equal",
  value="true",
  effect="NoSchedule",
)

Running pipeline programatically

At the end of the script file you can find an example code for submitting pipeline run programatically. You can provide your custom values as environment variables in an .env file and bind it during execution:

uv run --env-file .env python docling-s3in-s3out.py

Ray runtime with Docling Jobkit

Make sure your Ray cluster has docling-jobkit installed, then submit the job.

ray job submit --no-wait --working-dir . --runtime-env runtime_env.yml -- docling-ray-job

Custom runtime environment

Create a file runtime_env.yml:

# Expected environment if clean ray image is used. Take into account that ray worker can timeout before it finishes installing modules.
pip:
- docling-jobkit

Submit the job using the custom runtime env:

ray job submit --no-wait --runtime-env runtime_env.yml -- docling-ray-job

More examples and customization are provided in docs/ray-job/.

Custom image with all dependencies

Coming soon. Initial instruction from OpenShift AI docs.

Get help and support

Please feel free to connect with us using the discussion section of the main Docling repository.

Contributing

Please read Contributing to Docling Serve for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Deep Search Team},
  month = {1},
  title = {Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion},
  url = {https://arxiv.org/abs/2501.17887},
  eprint = {2501.17887},
  doi = {10.48550/arXiv.2501.17887},
  version = {2.0.0},
  year = {2025}
}

License

The Docling Serve codebase is under MIT license.

LF AI & Data

Docling is hosted as a project in the LF AI & Data Foundation.

IBM ❤️ Open Source AI

The project was started by the AI for Knowledge team at IBM Research Zurich.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ibm-deepsearch-core

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.4.1

Aug 19, 2025

1.4.0

Aug 13, 2025

1.3.1

Aug 12, 2025

1.3.0

Aug 6, 2025

1.2.0

Jul 24, 2025

1.1.1

Jul 18, 2025

1.1.0

Jul 14, 2025

1.0.0

Jul 7, 2025

0.2.0

Jun 25, 2025

0.1.0

May 13, 2025

0.0.2

Apr 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_jobkit-1.4.1.tar.gz (40.7 kB view details)

Uploaded Aug 19, 2025 Source

Built Distribution

docling_jobkit-1.4.1-py3-none-any.whl (51.3 kB view details)

Uploaded Aug 19, 2025 Python 3

File details

Details for the file docling_jobkit-1.4.1.tar.gz.

File metadata

Download URL: docling_jobkit-1.4.1.tar.gz
Upload date: Aug 19, 2025
Size: 40.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for docling_jobkit-1.4.1.tar.gz
Algorithm	Hash digest
SHA256	`2030666b3d25b60fc508a64560844dac673cd6a669755b62849f06beeb66467b`
MD5	`9f78ea1fdcd191581264bc5d863824b9`
BLAKE2b-256	`1d52f86812cdaaf9e70187646a7e87483f75cc5027a9ac6ef05f304f0fdfc871`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docling_jobkit-1.4.1.tar.gz:

Publisher: pypi.yml on docling-project/docling-jobkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docling_jobkit-1.4.1.tar.gz
- Subject digest: 2030666b3d25b60fc508a64560844dac673cd6a669755b62849f06beeb66467b
- Sigstore transparency entry: 409441185
- Sigstore integration time: Aug 19, 2025
Source repository:
- Permalink: docling-project/docling-jobkit@11edd0543a088f915e959c34514013ce5342b0fc
- Branch / Tag: refs/tags/v1.4.1
- Owner: https://github.com/docling-project
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yml@11edd0543a088f915e959c34514013ce5342b0fc
- Trigger Event: release

File details

Details for the file docling_jobkit-1.4.1-py3-none-any.whl.

File metadata

Download URL: docling_jobkit-1.4.1-py3-none-any.whl
Upload date: Aug 19, 2025
Size: 51.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for docling_jobkit-1.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b55fdcc0b0cbd8bc8ba848df9c0a0135b5c1c3c26aa3e5545e21cacef51897b1`
MD5	`6a07945abdd268571e298edf92d95c9b`
BLAKE2b-256	`6f8769c48f743f09321df7d86270d89ed432ee0d0974f22141d0d644969764fa`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docling_jobkit-1.4.1-py3-none-any.whl:

Publisher: pypi.yml on docling-project/docling-jobkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docling_jobkit-1.4.1-py3-none-any.whl
- Subject digest: b55fdcc0b0cbd8bc8ba848df9c0a0135b5c1c3c26aa3e5545e21cacef51897b1
- Sigstore transparency entry: 409441187
- Sigstore integration time: Aug 19, 2025
Source repository:
- Permalink: docling-project/docling-jobkit@11edd0543a088f915e959c34514013ce5342b0fc
- Branch / Tag: refs/tags/v1.4.1
- Owner: https://github.com/docling-project
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yml@11edd0543a088f915e959c34514013ce5342b0fc
- Trigger Event: release

docling-jobkit 1.4.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Docling Jobkit

How to use it

Kubeflow pipeline with Docling Jobkit

Using Kubeflow pipeline web dashboard UI

Using OpenshiftAI web dashboard UI

Customizing pipeline to specifics of your infrastructure

Running pipeline programatically

Ray runtime with Docling Jobkit

Custom runtime environment

Custom image with all dependencies

Get help and support

Contributing

References

License

LF AI & Data

IBM ❤️ Open Source AI

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance