Skip to main content

Document extraction pipeline — extract structured data from PDF/DOCX/PPTX/XLSX/JSON/CSV/MD and fill PDFs via the mapper module

Project description

pdf-autofillr-doc-upload

Extract structured data from any document format using an LLM, then optionally fill a blank PDF via the mapper module.

Supported document formats

Format Extension
PDF .pdf
Word .docx
PowerPoint .pptx
Excel .xlsx, .xls
CSV .csv
JSON .json
Markdown .md, .markdown
Plain text .txt
HTML .html, .htm
XML .xml

Supported LLM providers (via LiteLLM)

Any model LiteLLM supports — OpenAI, Anthropic, Groq, Ollama, AWS Bedrock, Azure OpenAI, Google Vertex AI, and more.


Installation

# Step 1 — create venv
python -m venv venv
.\venv\Scripts\activate   # Windows
# source venv/bin/activate  # Linux/Mac

# Step 2 — install litellm (pinned)
pip install "litellm==1.59.12" --no-cache-dir

# Step 3 — install mapper (from sibling modules3/mapper/)
pip install -e ../mapper --no-cache-dir --no-deps
pip install PyMuPDF tiktoken pydantic pydantic-settings python-dotenv tenacity requests aiohttp httpx numpy tqdm python-json-logger --no-cache-dir

# Step 4 — install extractor
pip install -e . --no-cache-dir --no-deps
pip install python-docx python-pptx openpyxl --no-cache-dir

# Verify
python -c "import pdf_autofillr_doc_upload; print('extractor ok')"
python -c "import pdf_autofillr_mapper; print('mapper ok')"

Setup

1. Copy sample configs

python -c "import pdf_autofillr_doc_upload; pdf_autofillr_doc_upload.copy_sample_configs('.')"

2. Create .env

cp .env.example .env
# Edit .env — set DOC_UPLOAD_LLM_MODEL and DOC_UPLOAD_LLM_API_KEY at minimum

Minimal .env:

DOC_UPLOAD_LLM_MODEL=openai/gpt-4.1-mini
DOC_UPLOAD_LLM_API_KEY=sk-...
DOC_UPLOAD_STORAGE=local
DOC_UPLOAD_DATA_PATH=./extractor_data
DOC_UPLOAD_CONFIG_PATH=./configs

Running

Interactive local runner

python -m entrypoints.local

Non-interactive (single document)

python -m entrypoints.local --document investor.pdf --schema configs/form_keys.json --output output/filled.json

CLI

doc-upload-cli --document investor.pdf --schema configs/form_keys.json --output filled.json --report

FastAPI server

doc-upload-server
# or
uvicorn entrypoints.fastapi_app:app --reload --port 8001

Then POST to http://localhost:8001/extract:

{
  "document_path": "/path/to/investor_profile.pdf",
  "schema_path": "configs/form_keys.json"
}

Storage backends

Value Description
local Local filesystem (default, for dev)
s3 AWS S3
gcp Google Cloud Storage
azure Azure Blob Storage

Set DOC_UPLOAD_STORAGE=s3 and the matching bucket env vars.


PDF Filling (mapper integration)

Set DOC_UPLOAD_PDF_FILLER=mapper and the Lambda URL:

DOC_UPLOAD_PDF_FILLER=mapper
DOC_UPLOAD_FILL_PDF_LAMBDA_URL=https://xyz.lambda-url.us-east-1.on.aws
DOC_UPLOAD_PDF_API_KEY=my-api-key

The client runs extraction and embed-file preparation in parallel, then calls fill_pdf once both complete — identical to the Lambda main.py pipeline.


Telemetry

Value Description
off Disabled (default, zero overhead)
local Append events to ./extractor_telemetry/events.jsonl
managed (stub) HTTP POST to DOC_UPLOAD_TELEMETRY_ENDPOINT

Field values are never included in telemetry. Only metadata (counts, latencies, file extensions) is logged. Job IDs are one-way SHA-256 hashed.


Entrypoints

File Use case
entrypoints/local.py Interactive development REPL
entrypoints/cli.py doc-upload-cli command
entrypoints/server.py doc-upload-server (uvicorn)
entrypoints/fastapi_app.py FastAPI app (mount or standalone)
entrypoints/aws_lambda.py AWS Lambda handler
entrypoints/gcp_function.py GCP Cloud Functions handler
entrypoints/azure_function.py Azure Functions handler

Programmatic API

from pdf_autofillr_doc_upload import DocUploadClient

client = DocUploadClient()

# Extract only
result = client.run(
    document_path="investor_profile.pdf",
    schema_path="configs/form_keys.json",
    output_path="output/filled.json",
)

print(result["output_flat"])   # flat dot-notation dict
print(result["output_nested"]) # nested dict matching schema

# Extract + fill PDF
result = client.run(
    document_path="investor_profile.pdf",
    schema_path="configs/form_keys.json",
    user_id="42",
    pdf_doc_id="99",
    session_id="sess_abc",
    investor_type="Individual",
)

Module structure

extractor/
├── pyproject.toml
├── .env.example
├── README.md
├── config_samples/
│   └── form_keys.json
├── entrypoints/
│   ├── local.py             ← interactive REPL
│   ├── cli.py               ← doc-upload-cli
│   ├── server.py            ← doc-upload-server
│   ├── fastapi_app.py       ← FastAPI app
│   ├── aws_lambda.py        ← AWS Lambda
│   ├── gcp_function.py      ← GCP Cloud Functions
│   └── azure_function.py    ← Azure Functions
└── src/pdf_autofillr_doc_upload/
    ├── __init__.py
    ├── client.py            ← DocUploadClient (main API)
    ├── config/
    │   └── settings.py      ← all env var config
    ├── storage/
    │   ├── base.py          ← abstract interface
    │   ├── local_storage.py
    │   ├── s3_storage.py
    │   ├── gcp_storage.py
    │   ├── azure_storage.py
    │   └── factory.py
    ├── extraction/
    │   ├── document_reader.py  ← PDF/DOCX/PPTX/XLSX/CSV/JSON/MD/HTML/XML
    │   ├── llm_client.py       ← LiteLLM wrapper
    │   └── extractor.py        ← full pipeline
    ├── pdf/
    │   ├── interface.py
    │   ├── api_handler.py      ← HTTP client for Lambda
    │   └── mapper_filler.py    ← mapper integration
    ├── logging/
    │   └── logger.py           ← ExecutionLogger
    ├── telemetry/
    │   ├── collector.py
    │   └── config.py
    └── managed/
        └── __init__.py         ← stub for future managed service

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_autofillr_doc_upload-0.1.5.tar.gz (363.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_autofillr_doc_upload-0.1.5-py3-none-any.whl (54.2 kB view details)

Uploaded Python 3

File details

Details for the file pdf_autofillr_doc_upload-0.1.5.tar.gz.

File metadata

  • Download URL: pdf_autofillr_doc_upload-0.1.5.tar.gz
  • Upload date:
  • Size: 363.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdf_autofillr_doc_upload-0.1.5.tar.gz
Algorithm Hash digest
SHA256 182b7aa7cafc0927ccdf89302b014c4f29748a03c6f5068d5f20476fac4d8a86
MD5 f16bf7dd95c6e3c45674967a071f0449
BLAKE2b-256 851f30ca3f3ad330f1a11a88be36ab21c006981dc36fac930b9baa648e9b17cc

See more details on using hashes here.

File details

Details for the file pdf_autofillr_doc_upload-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_autofillr_doc_upload-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 dfb4fbb5a9f584946804a235c16bddad54ed507880dce54c4e88c4bc1c94c72c
MD5 4977542759d411ac2f925b55840624fe
BLAKE2b-256 3ce7a26755c31e45aea44b59e44e0350ba4823b2fc3462e38d852461e227e312

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page