Skip to main content

Package to perform document conversion using Docling

Project description



Sinapsis Docling

Templates for simple and custom document conversion using Docling

🐍 Installation 🚀 Features 📚 Usage example📙 Documentation🔍 License

This Sinapsis Docling package provides templates for integrating, configuring, and running document conversion workflows powered by Docling.

🐍 Installation

[!IMPORTANT] Sinapsis project requires Python 3.10 or higher.

Install using your favourite package manager. We strongly encourage the use of uv, although any other package manager should work too. If you need to install uv please see the official documentation.

Example with uv:

  uv pip install sinapsis-docling --extra-index-url https://pypi.sinapsis.tech

or with raw pip:

  pip install sinapsis-docling --extra-index-url https://pypi.sinapsis.tech

🚀 Features

Templates Supported

  • DoclingSimpleConversion: Template for simple document conversions using the Docling framework.

    Attributes
    • convert_options(Optional): Configuration for document conversion, such as error handling, page range, and file size limits (default: {}).
    • export_format(Optional): Format for document export (default: export_to_markdown). Options: export_to_dict, export_to_doctags, export_to_element_tree, export_to_html, export_to_markdown,export_to_text.
    • image_mode(Optional): Image handling mode (default: placeholder). Options: placeholder, embedded, referenced.
    • output_dir(Optional): Directory for saving the converted document(s) (default: SINAPSIS_CACHE_DIR/docling/documents).
    • save_in_container(Optional): Whether to store the converted document(s) in the container (default: True).
    • save_locally(Optional): Whether to save the converted document(s) locally (default: False).
    • save_format(Optional): Format for saving the document(s) (default: save_as_markdown). Options: save_as_doctags, save_as_html, save_as_json, save_as_markdown, save_as_yaml.
    • path_to_doc(Required): The source document(s) to convert. This can be a file path, a URL, or a list of file paths or URLs (default: None).
  • DoclingCustomConversion: Template for advanced document conversions using the Docling framework.

    Attributes
    • accelerator_options(Optional): Options for the accelerator, including num_threads, device, cuda_use_flash_attention2 (default: {}).
    • convert_options(Optional): Configuration for document conversion, such as error handling, page range, and file size limits (default: {}).
    • export_format(Optional): Format for document export (default: export_to_markdown). Options: export_to_dict, export_to_doctags, export_to_element_tree, export_to_html, export_to_markdown,export_to_text.
    • image_mode(Optional): Image handling mode (default: placeholder). Options: placeholder, embedded, referenced.
    • ocr_engine(Optional): OCR engine to use (default: easyocr). Options: easyocr, ocrmac, rapidocr, tesserocr, tesseract.
    • ocr_options(Optional): OCR engine configuration options (default: {}).
    • output_dir(Optional): Directory for saving the converted document(s) (default: SINAPSIS_CACHE_DIR/docling/documents).
    • pipeline_options(Optional): Conversion pipeline options (default: {}).
    • save_in_container(Optional): Whether to store the converted document(s) in the container (default: True).
    • save_locally(Optional): Whether to save the converted document(s) locally (default: False).
    • save_format(Optional): Format for saving the document(s) (default: save_as_markdown). Options: save_as_doctags, save_as_html, save_as_json, save_as_markdown, save_as_yaml.
    • path_to_doc(Required): The source document(s) to convert. This can be a file path, a URL, or a list of file paths or URLs (default: None).

    For detailed documentation on setting accelerator, OCR, and pipeline options, refer to the Docling reference.

[!TIP] Use CLI command sinapsis info --example-template-config TEMPLATE_NAME to produce an example Agent config for the Template specified in TEMPLATE_NAME.

For example, for DoclingCustomConversion use sinapsis info --example-template-config DoclingCustomConversion to produce an example config like:

Config
agent:
  name: my_test_agent
templates:
- template_name: InputTemplate
  class_name: InputTemplate
  attributes: {}
- template_name: DoclingCustomConversion
  class_name: DoclingCustomConversion
  template_input: InputTemplate
  attributes:
    convert_options:
      headers: null
      raises_on_error: true
      max_num_pages: 90
      max_file_size: 1000
      page_range:
      - 1
      - 90
    export_format: export_to_markdown
    image_mode: placeholder
    output_dir: ~.cache/sinapsis/docling/documents
    save_in_container: true
    save_locally: false
    save_format: save_as_markdown
    path_to_doc: 'document.pdf'
    accelerator_options:
      num_threads: 4
      device: auto
      cuda_use_flash_attention2: false
    ocr_engine: easyocr
    ocr_options: 
    pipeline_options:
      create_legacy_output: true
      document_timeout: null
      accelerator_options:
        num_threads: 4
        device: auto
        cuda_use_flash_attention2: false
      enable_remote_services: false
      allow_external_plugins: false
      artifacts_path: null
      images_scale: 1.0
      generate_page_images: false
      generate_picture_images: false
      do_table_structure: true
      do_ocr: true
      do_code_enrichment: false
      do_formula_enrichment: false
      do_picture_classification: false
      do_picture_description: false
      force_backend_text: false
      table_structure_options:
        do_cell_matching: true
        mode: accurate
      ocr_options:
        lang: '`replace_me:typing.List[str]`'
        force_full_page_ocr: false
        bitmap_area_threshold: 0.05
      picture_description_options:
        batch_size: 8
        scale: 2
        picture_area_threshold: 0.05
      generate_table_images: false
      generate_parsed_pages: false

📚 Usage example

This example shows how to use the DoclingCustomConversion template to export and save PDF files as Markdown.

Config
agent:
  name: documet_conversion
  description: document conversion agent using docling

templates:

- template_name: InputTemplate
  class_name: InputTemplate
  attributes: {}

- template_name: DoclingCustomConversion
  class_name: DoclingCustomConversion
  template_input: InputTemplate
  attributes:
    export_format: export_to_markdown
    save_locally: True
    save_format: save_as_markdown
    path_to_doc: ["https://arxiv.org/pdf/2408.09869", "https://arxiv.org/pdf/2206.01062"]
    pipeline_options:
      do_ocr: True
      do_table_structure: True
      force_full_page_ocr: True
      table_structure_options:
        do_cell_matching: False
      accelerator_options:
        num_threads: 8
    ocr_options:
      lang: ["es"]

This configuration defines an agent and a sequence of templates for document conversion, using Docling.

To run the config, use the CLI:

sinapsis run name_of_config.yml

📙 Documentation

Documentation is available on the sinapsis website

Tutorials for different projects within sinapsis are available at sinapsis tutorials page

🔍 License

This project is licensed under the AGPLv3 license, which encourages open collaboration and sharing. For more details, please refer to the LICENSE file.

For commercial use, please refer to our official Sinapsis website for information on obtaining a commercial license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sinapsis_docling-0.1.3.tar.gz (23.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sinapsis_docling-0.1.3-py3-none-any.whl (24.0 kB view details)

Uploaded Python 3

File details

Details for the file sinapsis_docling-0.1.3.tar.gz.

File metadata

  • Download URL: sinapsis_docling-0.1.3.tar.gz
  • Upload date:
  • Size: 23.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.16

File hashes

Hashes for sinapsis_docling-0.1.3.tar.gz
Algorithm Hash digest
SHA256 801e94f20258d1c612ba1179d39b5f5d4b755322ea78c9fe3c4a1f7c464c4bc4
MD5 1616c0b76fa1040c94bd9a076e94357c
BLAKE2b-256 8d41cc062733931031cf993946cd26cb4d02e4b55eea17b8e285cc521dabbf77

See more details on using hashes here.

File details

Details for the file sinapsis_docling-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for sinapsis_docling-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 74a4931fdc45ce6ffee3c46c9fb529a9524c1dd5c23ba74b1c92e26f7b14a2d2
MD5 6fc495b2a11c8f4effd596be45c927f9
BLAKE2b-256 7971ce6e1fdb0f6c05f2ae057e0a04815941317e21816de1e38d18832616fad7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page