Skip to main content

LlamaIndex reader for the DocDigitizer document processing API

Project description

llama-index-readers-docdigitizer

LlamaIndex reader for the DocDigitizer document processing API.

v0.1.x is deprecated. Upgrade to v0.2.0+ for the new API endpoint. The previous endpoint (https://apix.docdigitizer.com/sync) will be removed in a future release.

Installation

pip install llama-index-readers-docdigitizer

Usage

from llama_index.readers.docdigitizer import DocDigitizerReader

# Load a single PDF
reader = DocDigitizerReader(api_key="dd_live_...")
documents = reader.load_data(file_path="invoice.pdf")

print(documents[0].text)          # JSON with extracted fields
print(documents[0].metadata)      # document_type, confidence, etc.

# Load all PDFs from a directory
documents = reader.load_data(file_path="invoices/")

# Use in a RAG pipeline
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is the invoice total?")

Configuration

Parameter Environment Variable Default
api_key DOCDIGITIZER_API_KEY
base_url DOCDIGITIZER_BASE_URL https://api.docdigitizer.com/v3/docingester
timeout DOCDIGITIZER_TIMEOUT 300
max_retries 3
pipeline None
content_format "json"

Content Formats

  • "json" (default): Document text is a JSON string of extracted fields
  • "text": Key-value pairs separated by newlines (key: value)
  • "kv": key=value pairs separated by newlines

Document Metadata

Each LlamaIndex Document includes metadata:

Field Type Description
source str File path of the processed PDF
document_type str Detected document type (e.g., "Invoice")
confidence float Classification confidence (0-1)
country_code str Detected country code (e.g., "PT")
pages list[int] Page numbers where document was found
trace_id str Unique trace identifier

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_docdigitizer-0.2.0.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file llama_index_readers_docdigitizer-0.2.0.tar.gz.

File metadata

File hashes

Hashes for llama_index_readers_docdigitizer-0.2.0.tar.gz
Algorithm Hash digest
SHA256 55fbed725e673d244ff7d1d2c96c7ed76e84e570625f420602277f1a3d804442
MD5 b8ed3bded66be158944873e49d57e416
BLAKE2b-256 9bc2e791c5022bd3296d020a486db4f601c9cd6b5ce2f8706e3647441c1cbe17

See more details on using hashes here.

Provenance

The following attestation bundles were made for llama_index_readers_docdigitizer-0.2.0.tar.gz:

Publisher: publish-llamaindex.yml on DocDigitizer/dd-v3-integrations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llama_index_readers_docdigitizer-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_readers_docdigitizer-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e561d7298399f4ec5631e9587a5c03c34aae9a16bda0a9dfef2bf05d760f53bf
MD5 d24a7f67a5f1d83383b329b58137871b
BLAKE2b-256 b7a5c0e1c165abe42f111a4b16ee5bd7c98505df9c51cedb31143958f7d8b6cf

See more details on using hashes here.

Provenance

The following attestation bundles were made for llama_index_readers_docdigitizer-0.2.0-py3-none-any.whl:

Publisher: publish-llamaindex.yml on DocDigitizer/dd-v3-integrations

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page