Skip to main content

Official Python SDK for the DocDigitizer document processing API

Project description

DocDigitizer Python SDK

Official Python client for the DocDigitizer document processing API.

Upload PDF documents and get structured data back — invoices, receipts, contracts, CVs, ID documents, and bank statements.

Installation

pip install docdigitizer

Quick Start

from docdigitizer import DocDigitizer

dd = DocDigitizer(api_key="your-api-key")

result = dd.process("invoice.pdf")

if result.is_completed:
    for extraction in result.extractions:
        print(f"Type: {extraction.document_type}")
        print(f"Confidence: {extraction.confidence}")
        print(f"Country: {extraction.country_code}")
        print(f"Data: {extraction.data}")

Async Usage

import asyncio
from docdigitizer import AsyncDocDigitizer

async def main():
    async with AsyncDocDigitizer(api_key="your-api-key") as dd:
        result = await dd.process("invoice.pdf")
        print(result.extractions[0].data)

asyncio.run(main())

File Input Options

The process() method accepts multiple file input types:

# From file path (string or Path)
result = dd.process("path/to/invoice.pdf")
result = dd.process(Path("path/to/invoice.pdf"))

# From bytes
with open("invoice.pdf", "rb") as f:
    result = dd.process(f.read(), filename="invoice.pdf")

# From file-like object
with open("invoice.pdf", "rb") as f:
    result = dd.process(f)

Configuration

dd = DocDigitizer(
    api_key="your-api-key",       # or set DOCDIGITIZER_API_KEY env var
    base_url="https://...",        # or set DOCDIGITIZER_BASE_URL env var
    timeout=300,                   # request timeout in seconds (default: 300)
    max_retries=3,                 # retries on 5xx/429 (default: 3)
)

Environment Variables

Variable Description
DOCDIGITIZER_API_KEY API key (used if api_key arg not provided)
DOCDIGITIZER_BASE_URL Base URL override

Processing Options

result = dd.process(
    "invoice.pdf",
    pipeline="MainPipelineWithOCR",  # or MainPipelineWithFile, SingleDocPipelineWithOCR
    id="custom-uuid",                # document ID (auto-generated if omitted)
    context_id="batch-uuid",         # grouping ID (auto-generated if omitted)
    request_token="ABC1234",         # trace token, max 7 chars
)

Response Models

result = dd.process("invoice.pdf")

result.state           # "COMPLETED", "PROCESSING", or "ERROR"
result.trace_id        # "ABC1234" — unique request identifier
result.pipeline        # "MainPipelineWithOCR"
result.num_pages       # 2
result.is_completed    # True
result.is_error        # False
result.messages        # ["Document processed successfully"]
result.timers          # {"DocIngester": {"total": 2345.67}}

# Extractions
for ext in result.extractions:
    ext.document_type  # "Invoice"
    ext.confidence     # 0.95
    ext.country_code   # "PT"
    ext.page_range     # PageRange(start=1, end=2)
    ext.data           # {"invoiceNumber": "INV-001", "totalAmount": 1250.00, ...}

Error Handling

from docdigitizer import DocDigitizer
from docdigitizer.exceptions import (
    AuthenticationError,
    ValidationError,
    ServerError,
    TimeoutError,
    ServiceUnavailableError,
    RateLimitError,
)

dd = DocDigitizer(api_key="your-api-key")

try:
    result = dd.process("invoice.pdf")
except AuthenticationError:
    print("Invalid API key")
except ValidationError as e:
    print(f"Bad request: {e.messages}")
except TimeoutError:
    print("Processing took too long")
except ServerError as e:
    print(f"Server error (trace: {e.trace_id})")

All exceptions inherit from DocDigitizerError and carry:

  • status_code — HTTP status code
  • trace_id — request trace ID (for support)
  • messages — error detail messages
  • timers — processing time metrics

Health Check

status = dd.health_check()  # "I am alive"

Supported Operations

This SDK supports the following API operations (defined in sdk-manifest.yaml):

Method API Operation Description
dd.process() processDocument Upload and process a PDF
dd.health_check() checkHealth Check API availability

Development

cd sdks/python
pip install -e ".[dev]"

# Run tests
pytest tests/ -m "not integration" -v

# Run with live API
DD_API_KEY=your-key pytest tests/test_integration.py -v

# Lint
ruff check src/ tests/
ruff format src/ tests/

Requirements

  • Python >= 3.8
  • httpx >= 0.24.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docdigitizer-0.1.0.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docdigitizer-0.1.0-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file docdigitizer-0.1.0.tar.gz.

File metadata

  • Download URL: docdigitizer-0.1.0.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docdigitizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 082fa4a79dc2d8e40a46f164c0bece8666af62447560090c817f029d4e26f482
MD5 e66850b14d17eb05d25948fe37a5b721
BLAKE2b-256 a7c8630d08af2ec5dae07ca28662405f01e407d3e0f06ae3966b3af915d5a554

See more details on using hashes here.

Provenance

The following attestation bundles were made for docdigitizer-0.1.0.tar.gz:

Publisher: publish-python-sdk.yml on DocDigitizer/dd-v3-integrations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docdigitizer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: docdigitizer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docdigitizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9c8a05746aab83c642efeb52ab5c8dca77f6c35c3e6dc9eee5516e51550c06af
MD5 2c428b71286841b08022f35049671288
BLAKE2b-256 9ed5c037514873b08c8d7289d92522b5deb107f11af94c18f58091e4fd61fa6d

See more details on using hashes here.

Provenance

The following attestation bundles were made for docdigitizer-0.1.0-py3-none-any.whl:

Publisher: publish-python-sdk.yml on DocDigitizer/dd-v3-integrations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page