Skip to main content

LlamaIndex reader for the DocDigitizer document processing API

Project description

llama-index-readers-docdigitizer

LlamaIndex reader for the DocDigitizer document processing API.

Installation

pip install llama-index-readers-docdigitizer

Usage

from llama_index.readers.docdigitizer import DocDigitizerReader

# Load a single PDF
reader = DocDigitizerReader(api_key="dd_live_...")
documents = reader.load_data(file_path="invoice.pdf")

print(documents[0].text)          # JSON with extracted fields
print(documents[0].metadata)      # document_type, confidence, etc.

# Load all PDFs from a directory
documents = reader.load_data(file_path="invoices/")

# Use in a RAG pipeline
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is the invoice total?")

Configuration

Parameter Environment Variable Default
api_key DOCDIGITIZER_API_KEY
base_url DOCDIGITIZER_BASE_URL https://apix.docdigitizer.com/sync
timeout DOCDIGITIZER_TIMEOUT 300
max_retries 3
pipeline None
content_format "json"

Content Formats

  • "json" (default): Document text is a JSON string of extracted fields
  • "text": Key-value pairs separated by newlines (key: value)
  • "kv": key=value pairs separated by newlines

Document Metadata

Each LlamaIndex Document includes metadata:

Field Type Description
source str File path of the processed PDF
document_type str Detected document type (e.g., "Invoice")
confidence float Classification confidence (0-1)
country_code str Detected country code (e.g., "PT")
pages list[int] Page numbers where document was found
trace_id str Unique trace identifier

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_docdigitizer-0.1.0.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file llama_index_readers_docdigitizer-0.1.0.tar.gz.

File metadata

File hashes

Hashes for llama_index_readers_docdigitizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 896cc21b2c37145c86aea517b81c93e985c19fb7c2bfa82e583c03e82718cb45
MD5 14b7566742c104fb151f3f51e887bf74
BLAKE2b-256 517e83d9efa3f7e37d5a0d5082269f05a19c44525768b83e99ba8d0529ca12db

See more details on using hashes here.

Provenance

The following attestation bundles were made for llama_index_readers_docdigitizer-0.1.0.tar.gz:

Publisher: publish-llamaindex.yml on DocDigitizer/dd-v3-integrations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llama_index_readers_docdigitizer-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_readers_docdigitizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2b8a4e213f5f0963bcc826059b76701bdf970c10d34ccd6a8aab9e82f9c81bf3
MD5 a3634a4fb172f851f8ed66c417ca19ca
BLAKE2b-256 a24016cdb7d05b65f9ab6ab7166d6c39d34a5fb13999691721f63d412c9dc8a5

See more details on using hashes here.

Provenance

The following attestation bundles were made for llama_index_readers_docdigitizer-0.1.0-py3-none-any.whl:

Publisher: publish-llamaindex.yml on DocDigitizer/dd-v3-integrations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page