LlamaIndex reader for the DocDigitizer document processing API

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

docdigitizer

These details have not been verified by PyPI

Project description

llama-index-readers-docdigitizer

LlamaIndex reader for the DocDigitizer document processing API.

v0.1.x is deprecated. Upgrade to v0.2.0+ for the new API endpoint. The previous endpoint (https://apix.docdigitizer.com/sync) will be removed in a future release.

Installation

pip install llama-index-readers-docdigitizer

Usage

from llama_index.readers.docdigitizer import DocDigitizerReader

# Load a single PDF
reader = DocDigitizerReader(api_key="dd_live_...")
documents = reader.load_data(file_path="invoice.pdf")

print(documents[0].text)          # JSON with extracted fields
print(documents[0].metadata)      # document_type, confidence, etc.

# Load all PDFs from a directory
documents = reader.load_data(file_path="invoices/")

# Use in a RAG pipeline
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is the invoice total?")

Configuration

Parameter	Environment Variable	Default
`api_key`	`DOCDIGITIZER_API_KEY`	—
`base_url`	`DOCDIGITIZER_BASE_URL`	`https://api.docdigitizer.com/v3/docingester`
`timeout`	`DOCDIGITIZER_TIMEOUT`	`300`
`max_retries`	—	`3`
`pipeline`	—	`None`
`content_format`	—	`"json"`

Content Formats

"json" (default): Document text is a JSON string of extracted fields
"text": Key-value pairs separated by newlines (key: value)
"kv": key=value pairs separated by newlines

Document Metadata

Each LlamaIndex Document includes metadata:

Field	Type	Description
`source`	`str`	File path of the processed PDF
`document_type`	`str`	Detected document type (e.g., "Invoice")
`confidence`	`float`	Classification confidence (0-1)
`country_code`	`str`	Detected country code (e.g., "PT")
`pages`	`list[int]`	Page numbers where document was found
`trace_id`	`str`	Unique trace identifier

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

docdigitizer

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Mar 3, 2026

0.1.0

Feb 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_docdigitizer-0.2.0.tar.gz (5.3 kB view details)

Uploaded Mar 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llama_index_readers_docdigitizer-0.2.0-py3-none-any.whl (4.3 kB view details)

Uploaded Mar 3, 2026 Python 3

File details

Details for the file llama_index_readers_docdigitizer-0.2.0.tar.gz.

File metadata

Download URL: llama_index_readers_docdigitizer-0.2.0.tar.gz
Upload date: Mar 3, 2026
Size: 5.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llama_index_readers_docdigitizer-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`55fbed725e673d244ff7d1d2c96c7ed76e84e570625f420602277f1a3d804442`
MD5	`b8ed3bded66be158944873e49d57e416`
BLAKE2b-256	`9bc2e791c5022bd3296d020a486db4f601c9cd6b5ce2f8706e3647441c1cbe17`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llama_index_readers_docdigitizer-0.2.0.tar.gz:

Publisher: publish-llamaindex.yml on DocDigitizer/dd-v3-integrations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llama_index_readers_docdigitizer-0.2.0.tar.gz
- Subject digest: 55fbed725e673d244ff7d1d2c96c7ed76e84e570625f420602277f1a3d804442
- Sigstore transparency entry: 1019095893
- Sigstore integration time: Mar 3, 2026
Source repository:
- Permalink: DocDigitizer/dd-v3-integrations@8a69332042c192053d9f5cdd71ba70d60813077f
- Branch / Tag: refs/tags/llamaindex-v0.2.0
- Owner: https://github.com/DocDigitizer
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-llamaindex.yml@8a69332042c192053d9f5cdd71ba70d60813077f
- Trigger Event: push

File details

Details for the file llama_index_readers_docdigitizer-0.2.0-py3-none-any.whl.

File metadata

Download URL: llama_index_readers_docdigitizer-0.2.0-py3-none-any.whl
Upload date: Mar 3, 2026
Size: 4.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llama_index_readers_docdigitizer-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e561d7298399f4ec5631e9587a5c03c34aae9a16bda0a9dfef2bf05d760f53bf`
MD5	`d24a7f67a5f1d83383b329b58137871b`
BLAKE2b-256	`b7a5c0e1c165abe42f111a4b16ee5bd7c98505df9c51cedb31143958f7d8b6cf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llama_index_readers_docdigitizer-0.2.0-py3-none-any.whl:

Publisher: publish-llamaindex.yml on DocDigitizer/dd-v3-integrations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llama_index_readers_docdigitizer-0.2.0-py3-none-any.whl
- Subject digest: e561d7298399f4ec5631e9587a5c03c34aae9a16bda0a9dfef2bf05d760f53bf
- Sigstore transparency entry: 1019095898
- Sigstore integration time: Mar 3, 2026
Source repository:
- Permalink: DocDigitizer/dd-v3-integrations@8a69332042c192053d9f5cdd71ba70d60813077f
- Branch / Tag: refs/tags/llamaindex-v0.2.0
- Owner: https://github.com/DocDigitizer
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-llamaindex.yml@8a69332042c192053d9f5cdd71ba70d60813077f
- Trigger Event: push

llama-index-readers-docdigitizer 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

llama-index-readers-docdigitizer

Installation

Usage

Configuration

Content Formats

Document Metadata

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance