Official Python SDK for the DocDigitizer document processing API
Project description
DocDigitizer Python SDK
Official Python client for the DocDigitizer document processing API.
Upload PDF documents and get structured data back — invoices, receipts, contracts, CVs, ID documents, and bank statements.
Installation
pip install docdigitizer
Quick Start
from docdigitizer import DocDigitizer
dd = DocDigitizer(api_key="your-api-key")
result = dd.process("invoice.pdf")
if result.is_completed:
for extraction in result.extractions:
print(f"Type: {extraction.document_type}")
print(f"Confidence: {extraction.confidence}")
print(f"Country: {extraction.country_code}")
print(f"Data: {extraction.data}")
Async Usage
import asyncio
from docdigitizer import AsyncDocDigitizer
async def main():
async with AsyncDocDigitizer(api_key="your-api-key") as dd:
result = await dd.process("invoice.pdf")
print(result.extractions[0].data)
asyncio.run(main())
File Input Options
The process() method accepts multiple file input types:
# From file path (string or Path)
result = dd.process("path/to/invoice.pdf")
result = dd.process(Path("path/to/invoice.pdf"))
# From bytes
with open("invoice.pdf", "rb") as f:
result = dd.process(f.read(), filename="invoice.pdf")
# From file-like object
with open("invoice.pdf", "rb") as f:
result = dd.process(f)
Configuration
dd = DocDigitizer(
api_key="your-api-key", # or set DOCDIGITIZER_API_KEY env var
base_url="https://...", # or set DOCDIGITIZER_BASE_URL env var
timeout=300, # request timeout in seconds (default: 300)
max_retries=3, # retries on 5xx/429 (default: 3)
)
Environment Variables
| Variable | Description |
|---|---|
DOCDIGITIZER_API_KEY |
API key (used if api_key arg not provided) |
DOCDIGITIZER_BASE_URL |
Base URL override |
Processing Options
result = dd.process(
"invoice.pdf",
pipeline="MainPipelineWithOCR", # or MainPipelineWithFile, SingleDocPipelineWithOCR
id="custom-uuid", # document ID (auto-generated if omitted)
context_id="batch-uuid", # grouping ID (auto-generated if omitted)
request_token="ABC1234", # trace token, max 7 chars
)
Response Models
result = dd.process("invoice.pdf")
result.state # "COMPLETED", "PROCESSING", or "ERROR"
result.trace_id # "ABC1234" — unique request identifier
result.pipeline # "MainPipelineWithOCR"
result.num_pages # 2
result.is_completed # True
result.is_error # False
result.messages # ["Document processed successfully"]
result.timers # {"DocIngester": {"total": 2345.67}}
# Extractions
for ext in result.extractions:
ext.document_type # "Invoice"
ext.confidence # 0.95
ext.country_code # "PT"
ext.page_range # PageRange(start=1, end=2)
ext.data # {"invoiceNumber": "INV-001", "totalAmount": 1250.00, ...}
Error Handling
from docdigitizer import DocDigitizer
from docdigitizer.exceptions import (
AuthenticationError,
ValidationError,
ServerError,
TimeoutError,
ServiceUnavailableError,
RateLimitError,
)
dd = DocDigitizer(api_key="your-api-key")
try:
result = dd.process("invoice.pdf")
except AuthenticationError:
print("Invalid API key")
except ValidationError as e:
print(f"Bad request: {e.messages}")
except TimeoutError:
print("Processing took too long")
except ServerError as e:
print(f"Server error (trace: {e.trace_id})")
All exceptions inherit from DocDigitizerError and carry:
status_code— HTTP status codetrace_id— request trace ID (for support)messages— error detail messagestimers— processing time metrics
Health Check
status = dd.health_check() # "I am alive"
Supported Operations
This SDK supports the following API operations (defined in sdk-manifest.yaml):
| Method | API Operation | Description |
|---|---|---|
dd.process() |
processDocument |
Upload and process a PDF |
dd.health_check() |
checkHealth |
Check API availability |
Development
cd sdks/python
pip install -e ".[dev]"
# Run tests
pytest tests/ -m "not integration" -v
# Run with live API
DD_API_KEY=your-key pytest tests/test_integration.py -v
# Lint
ruff check src/ tests/
ruff format src/ tests/
Requirements
- Python >= 3.8
- httpx >= 0.24.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docdigitizer-0.2.0.tar.gz.
File metadata
- Download URL: docdigitizer-0.2.0.tar.gz
- Upload date:
- Size: 13.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef1ac2ebdfb71539825746461c936ceac77a49afda775cbc8d103c55f5b64532
|
|
| MD5 |
726517a2fd8af6a17468a5e211540dce
|
|
| BLAKE2b-256 |
43e4fd632b7c862ad1230b27df0f6a72ca72568e97b043e2b37cad51faa09752
|
Provenance
The following attestation bundles were made for docdigitizer-0.2.0.tar.gz:
Publisher:
publish-python-sdk.yml on DocDigitizer/dd-v3-integrations
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docdigitizer-0.2.0.tar.gz -
Subject digest:
ef1ac2ebdfb71539825746461c936ceac77a49afda775cbc8d103c55f5b64532 - Sigstore transparency entry: 1019088760
- Sigstore integration time:
-
Permalink:
DocDigitizer/dd-v3-integrations@874edbc0960d4cf5ded85e2f7c530cdf717e6c0b -
Branch / Tag:
refs/tags/python-v0.2.0 - Owner: https://github.com/DocDigitizer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-python-sdk.yml@874edbc0960d4cf5ded85e2f7c530cdf717e6c0b -
Trigger Event:
push
-
Statement type:
File details
Details for the file docdigitizer-0.2.0-py3-none-any.whl.
File metadata
- Download URL: docdigitizer-0.2.0-py3-none-any.whl
- Upload date:
- Size: 12.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56b48dd3ed3d531735043f1bc484a7ed8cb53000d584bd506f5dbd52a39040bc
|
|
| MD5 |
c88f6e90a322f3d88231335549fa1694
|
|
| BLAKE2b-256 |
12c76985bab95fcfed167ca63623e3759057c91cdef630e495178153f465bd30
|
Provenance
The following attestation bundles were made for docdigitizer-0.2.0-py3-none-any.whl:
Publisher:
publish-python-sdk.yml on DocDigitizer/dd-v3-integrations
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docdigitizer-0.2.0-py3-none-any.whl -
Subject digest:
56b48dd3ed3d531735043f1bc484a7ed8cb53000d584bd506f5dbd52a39040bc - Sigstore transparency entry: 1019088774
- Sigstore integration time:
-
Permalink:
DocDigitizer/dd-v3-integrations@874edbc0960d4cf5ded85e2f7c530cdf717e6c0b -
Branch / Tag:
refs/tags/python-v0.2.0 - Owner: https://github.com/DocDigitizer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-python-sdk.yml@874edbc0960d4cf5ded85e2f7c530cdf717e6c0b -
Trigger Event:
push
-
Statement type: