Skip to main content

Official Python SDK for the DocDigitizer document processing API

Project description

DocDigitizer Python SDK

Official Python client for the DocDigitizer document processing API.

Upload PDF documents and get structured data back — invoices, receipts, contracts, CVs, ID documents, and bank statements.

Installation

pip install docdigitizer

Quick Start

from docdigitizer import DocDigitizer

dd = DocDigitizer(api_key="your-api-key")

result = dd.process("invoice.pdf")

if result.is_completed:
    for extraction in result.extractions:
        print(f"Type: {extraction.document_type}")
        print(f"Confidence: {extraction.confidence}")
        print(f"Country: {extraction.country_code}")
        print(f"Data: {extraction.data}")

Async Usage

import asyncio
from docdigitizer import AsyncDocDigitizer

async def main():
    async with AsyncDocDigitizer(api_key="your-api-key") as dd:
        result = await dd.process("invoice.pdf")
        print(result.extractions[0].data)

asyncio.run(main())

File Input Options

The process() method accepts multiple file input types:

# From file path (string or Path)
result = dd.process("path/to/invoice.pdf")
result = dd.process(Path("path/to/invoice.pdf"))

# From bytes
with open("invoice.pdf", "rb") as f:
    result = dd.process(f.read(), filename="invoice.pdf")

# From file-like object
with open("invoice.pdf", "rb") as f:
    result = dd.process(f)

Configuration

dd = DocDigitizer(
    api_key="your-api-key",       # or set DOCDIGITIZER_API_KEY env var
    base_url="https://...",        # or set DOCDIGITIZER_BASE_URL env var
    timeout=300,                   # request timeout in seconds (default: 300)
    max_retries=3,                 # retries on 5xx/429 (default: 3)
)

Environment Variables

Variable Description
DOCDIGITIZER_API_KEY API key (used if api_key arg not provided)
DOCDIGITIZER_BASE_URL Base URL override

Processing Options

result = dd.process(
    "invoice.pdf",
    pipeline="MainPipelineWithOCR",  # or MainPipelineWithFile, SingleDocPipelineWithOCR
    id="custom-uuid",                # document ID (auto-generated if omitted)
    context_id="batch-uuid",         # grouping ID (auto-generated if omitted)
    request_token="ABC1234",         # trace token, max 7 chars
)

Response Models

result = dd.process("invoice.pdf")

result.state           # "COMPLETED", "PROCESSING", or "ERROR"
result.trace_id        # "ABC1234" — unique request identifier
result.pipeline        # "MainPipelineWithOCR"
result.num_pages       # 2
result.is_completed    # True
result.is_error        # False
result.messages        # ["Document processed successfully"]
result.timers          # {"DocIngester": {"total": 2345.67}}

# Extractions
for ext in result.extractions:
    ext.document_type  # "Invoice"
    ext.confidence     # 0.95
    ext.country_code   # "PT"
    ext.page_range     # PageRange(start=1, end=2)
    ext.data           # {"invoiceNumber": "INV-001", "totalAmount": 1250.00, ...}

Error Handling

from docdigitizer import DocDigitizer
from docdigitizer.exceptions import (
    AuthenticationError,
    ValidationError,
    ServerError,
    TimeoutError,
    ServiceUnavailableError,
    RateLimitError,
)

dd = DocDigitizer(api_key="your-api-key")

try:
    result = dd.process("invoice.pdf")
except AuthenticationError:
    print("Invalid API key")
except ValidationError as e:
    print(f"Bad request: {e.messages}")
except TimeoutError:
    print("Processing took too long")
except ServerError as e:
    print(f"Server error (trace: {e.trace_id})")

All exceptions inherit from DocDigitizerError and carry:

  • status_code — HTTP status code
  • trace_id — request trace ID (for support)
  • messages — error detail messages
  • timers — processing time metrics

Health Check

status = dd.health_check()  # "I am alive"

Supported Operations

This SDK supports the following API operations (defined in sdk-manifest.yaml):

Method API Operation Description
dd.process() processDocument Upload and process a PDF
dd.health_check() checkHealth Check API availability

Development

cd sdks/python
pip install -e ".[dev]"

# Run tests
pytest tests/ -m "not integration" -v

# Run with live API
DD_API_KEY=your-key pytest tests/test_integration.py -v

# Lint
ruff check src/ tests/
ruff format src/ tests/

Requirements

  • Python >= 3.8
  • httpx >= 0.24.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docdigitizer-0.2.0.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docdigitizer-0.2.0-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file docdigitizer-0.2.0.tar.gz.

File metadata

  • Download URL: docdigitizer-0.2.0.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docdigitizer-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ef1ac2ebdfb71539825746461c936ceac77a49afda775cbc8d103c55f5b64532
MD5 726517a2fd8af6a17468a5e211540dce
BLAKE2b-256 43e4fd632b7c862ad1230b27df0f6a72ca72568e97b043e2b37cad51faa09752

See more details on using hashes here.

Provenance

The following attestation bundles were made for docdigitizer-0.2.0.tar.gz:

Publisher: publish-python-sdk.yml on DocDigitizer/dd-v3-integrations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docdigitizer-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: docdigitizer-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 12.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docdigitizer-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 56b48dd3ed3d531735043f1bc484a7ed8cb53000d584bd506f5dbd52a39040bc
MD5 c88f6e90a322f3d88231335549fa1694
BLAKE2b-256 12c76985bab95fcfed167ca63623e3759057c91cdef630e495178153f465bd30

See more details on using hashes here.

Provenance

The following attestation bundles were made for docdigitizer-0.2.0-py3-none-any.whl:

Publisher: publish-python-sdk.yml on DocDigitizer/dd-v3-integrations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page