Document extraction pipeline — extract structured data from PDF/DOCX/PPTX/XLSX/JSON/CSV/MD and fill PDFs via the mapper module

These details have not been verified by PyPI

Project links

Homepage

Project description

pdf-autofillr-doc-upload

Extract structured data from any document format using an LLM, then optionally fill a blank PDF via the mapper module.

Supported document formats

Format	Extension
PDF	`.pdf`
Word	`.docx`
PowerPoint	`.pptx`
Excel	`.xlsx`, `.xls`
CSV	`.csv`
JSON	`.json`
Markdown	`.md`, `.markdown`
Plain text	`.txt`
HTML	`.html`, `.htm`
XML	`.xml`

Supported LLM providers (via LiteLLM)

Any model LiteLLM supports — OpenAI, Anthropic, Groq, Ollama, AWS Bedrock, Azure OpenAI, Google Vertex AI, and more.

Installation

# Step 1 — create venv
python -m venv venv
.\venv\Scripts\activate   # Windows
# source venv/bin/activate  # Linux/Mac

# Step 2 — install litellm (pinned)
pip install "litellm==1.59.12" --no-cache-dir

# Step 3 — install mapper (from sibling modules3/mapper/)
pip install -e ../mapper --no-cache-dir --no-deps
pip install PyMuPDF tiktoken pydantic pydantic-settings python-dotenv tenacity requests aiohttp httpx numpy tqdm python-json-logger --no-cache-dir

# Step 4 — install extractor
pip install -e . --no-cache-dir --no-deps
pip install python-docx python-pptx openpyxl --no-cache-dir

# Verify
python -c "import pdf_autofillr_doc_upload; print('extractor ok')"
python -c "import pdf_autofillr_mapper; print('mapper ok')"

Setup

1. Copy sample configs

python -c "import pdf_autofillr_doc_upload; pdf_autofillr_doc_upload.copy_sample_configs('.')"

2. Create `.env`

cp .env.example .env
# Edit .env — set DOC_UPLOAD_LLM_MODEL and DOC_UPLOAD_LLM_API_KEY at minimum

Minimal .env:

DOC_UPLOAD_LLM_MODEL=openai/gpt-4.1-mini
DOC_UPLOAD_LLM_API_KEY=sk-...
DOC_UPLOAD_STORAGE=local
DOC_UPLOAD_DATA_PATH=./extractor_data
DOC_UPLOAD_CONFIG_PATH=./configs

Running

Interactive local runner

python -m entrypoints.local

Non-interactive (single document)

python -m entrypoints.local --document investor.pdf --schema configs/form_keys.json --output output/filled.json

CLI

doc-upload-cli --document investor.pdf --schema configs/form_keys.json --output filled.json --report

FastAPI server

doc-upload-server
# or
uvicorn entrypoints.fastapi_app:app --reload --port 8001

Then POST to http://localhost:8001/extract:

{
  "document_path": "/path/to/investor_profile.pdf",
  "schema_path": "configs/form_keys.json"
}

Storage backends

Value	Description
`local`	Local filesystem (default, for dev)
`s3`	AWS S3
`gcp`	Google Cloud Storage
`azure`	Azure Blob Storage

Set DOC_UPLOAD_STORAGE=s3 and the matching bucket env vars.

PDF Filling (mapper integration)

Set DOC_UPLOAD_PDF_FILLER=mapper and the Lambda URL:

DOC_UPLOAD_PDF_FILLER=mapper
DOC_UPLOAD_FILL_PDF_LAMBDA_URL=https://xyz.lambda-url.us-east-1.on.aws
DOC_UPLOAD_PDF_API_KEY=my-api-key

The client runs extraction and embed-file preparation in parallel, then calls fill_pdf once both complete — identical to the Lambda main.py pipeline.

Telemetry

Value	Description
`off`	Disabled (default, zero overhead)
`local`	Append events to `./extractor_telemetry/events.jsonl`
`managed`	(stub) HTTP POST to `DOC_UPLOAD_TELEMETRY_ENDPOINT`

Field values are never included in telemetry. Only metadata (counts, latencies, file extensions) is logged. Job IDs are one-way SHA-256 hashed.

Entrypoints

File	Use case
`entrypoints/local.py`	Interactive development REPL
`entrypoints/cli.py`	`doc-upload-cli` command
`entrypoints/server.py`	`doc-upload-server` (uvicorn)
`entrypoints/fastapi_app.py`	FastAPI app (mount or standalone)
`entrypoints/aws_lambda.py`	AWS Lambda handler
`entrypoints/gcp_function.py`	GCP Cloud Functions handler
`entrypoints/azure_function.py`	Azure Functions handler

Programmatic API

from pdf_autofillr_doc_upload import DocUploadClient

client = DocUploadClient()

# Extract only
result = client.run(
    document_path="investor_profile.pdf",
    schema_path="configs/form_keys.json",
    output_path="output/filled.json",
)

print(result["output_flat"])   # flat dot-notation dict
print(result["output_nested"]) # nested dict matching schema

# Extract + fill PDF
result = client.run(
    document_path="investor_profile.pdf",
    schema_path="configs/form_keys.json",
    user_id="42",
    pdf_doc_id="99",
    session_id="sess_abc",
    investor_type="Individual",
)

Module structure

extractor/
├── pyproject.toml
├── .env.example
├── README.md
├── config_samples/
│   └── form_keys.json
├── entrypoints/
│   ├── local.py             ← interactive REPL
│   ├── cli.py               ← doc-upload-cli
│   ├── server.py            ← doc-upload-server
│   ├── fastapi_app.py       ← FastAPI app
│   ├── aws_lambda.py        ← AWS Lambda
│   ├── gcp_function.py      ← GCP Cloud Functions
│   └── azure_function.py    ← Azure Functions
└── src/pdf_autofillr_doc_upload/
    ├── __init__.py
    ├── client.py            ← DocUploadClient (main API)
    ├── config/
    │   └── settings.py      ← all env var config
    ├── storage/
    │   ├── base.py          ← abstract interface
    │   ├── local_storage.py
    │   ├── s3_storage.py
    │   ├── gcp_storage.py
    │   ├── azure_storage.py
    │   └── factory.py
    ├── extraction/
    │   ├── document_reader.py  ← PDF/DOCX/PPTX/XLSX/CSV/JSON/MD/HTML/XML
    │   ├── llm_client.py       ← LiteLLM wrapper
    │   └── extractor.py        ← full pipeline
    ├── pdf/
    │   ├── interface.py
    │   ├── api_handler.py      ← HTTP client for Lambda
    │   └── mapper_filler.py    ← mapper integration
    ├── logging/
    │   └── logger.py           ← ExecutionLogger
    ├── telemetry/
    │   ├── collector.py
    │   └── config.py
    └── managed/
        └── __init__.py         ← stub for future managed service

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.5

May 16, 2026

0.1.4

Apr 28, 2026

0.1.3

Apr 3, 2026

This version

0.1.2

Apr 2, 2026

0.1.1

Apr 2, 2026

0.1.0

Apr 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_autofillr_doc_upload-0.1.2.tar.gz (54.1 kB view details)

Uploaded Apr 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_autofillr_doc_upload-0.1.2-py3-none-any.whl (53.7 kB view details)

Uploaded Apr 2, 2026 Python 3

File details

Details for the file pdf_autofillr_doc_upload-0.1.2.tar.gz.

File metadata

Download URL: pdf_autofillr_doc_upload-0.1.2.tar.gz
Upload date: Apr 2, 2026
Size: 54.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdf_autofillr_doc_upload-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`87aafffd107ff829664cb3681ef4c8348f813199c7f4b4481ae0eb98c8fe56d5`
MD5	`e57d23248f337b2718b48f3ae7ec5bf6`
BLAKE2b-256	`3eda5280b89f4e0f26db72d136b0294d2bb013ffa1438003973382dd688c35c6`

See more details on using hashes here.

File details

Details for the file pdf_autofillr_doc_upload-0.1.2-py3-none-any.whl.

File metadata

Download URL: pdf_autofillr_doc_upload-0.1.2-py3-none-any.whl
Upload date: Apr 2, 2026
Size: 53.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdf_autofillr_doc_upload-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`295ff6513eea4c6c8d2b0300b565bcde6ff1ce10092951fe690b626a1321a0ce`
MD5	`8c390dc02509f90ec373cc08be8fbc96`
BLAKE2b-256	`18ec3efded8cc4aa96e982b024c2521cec82e2b1662c35a84df991950f8be6f9`

See more details on using hashes here.

pdf-autofillr-doc-upload 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

pdf-autofillr-doc-upload

Supported document formats

Supported LLM providers (via LiteLLM)

Installation

Setup

1. Copy sample configs

2. Create .env

Running

Interactive local runner

Non-interactive (single document)

CLI

FastAPI server

Storage backends

PDF Filling (mapper integration)

Telemetry

Entrypoints

Programmatic API

Module structure

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

2. Create `.env`