Skip to main content

Document extraction pipeline — extract structured data from PDF/DOCX/PPTX/XLSX/JSON/CSV/MD and fill PDFs via the mapper module

Project description

pdf-autofillr-doc-upload

Extract structured data from any document format using an LLM, then optionally fill a blank PDF via the mapper module.

Supported document formats

Format Extension
PDF .pdf
Word .docx
PowerPoint .pptx
Excel .xlsx, .xls
CSV .csv
JSON .json
Markdown .md, .markdown
Plain text .txt
HTML .html, .htm
XML .xml

Supported LLM providers (via LiteLLM)

Any model LiteLLM supports — OpenAI, Anthropic, Groq, Ollama, AWS Bedrock, Azure OpenAI, Google Vertex AI, and more.


Installation

# Step 1 — create venv
python -m venv venv
.\venv\Scripts\activate   # Windows
# source venv/bin/activate  # Linux/Mac

# Step 2 — install litellm (pinned)
pip install "litellm==1.59.12" --no-cache-dir

# Step 3 — install mapper (from sibling modules3/mapper/)
pip install -e ../mapper --no-cache-dir --no-deps
pip install PyMuPDF tiktoken pydantic pydantic-settings python-dotenv tenacity requests aiohttp httpx numpy tqdm python-json-logger --no-cache-dir

# Step 4 — install extractor
pip install -e . --no-cache-dir --no-deps
pip install python-docx python-pptx openpyxl --no-cache-dir

# Verify
python -c "import pdf_autofillr_doc_upload; print('extractor ok')"
python -c "import pdf_autofillr_mapper; print('mapper ok')"

Setup

1. Copy sample configs

python -c "import pdf_autofillr_doc_upload; pdf_autofillr_doc_upload.copy_sample_configs('.')"

2. Create .env

cp .env.example .env
# Edit .env — set DOC_UPLOAD_LLM_MODEL and DOC_UPLOAD_LLM_API_KEY at minimum

Minimal .env:

DOC_UPLOAD_LLM_MODEL=openai/gpt-4.1-mini
DOC_UPLOAD_LLM_API_KEY=sk-...
DOC_UPLOAD_STORAGE=local
DOC_UPLOAD_DATA_PATH=./extractor_data
DOC_UPLOAD_CONFIG_PATH=./configs

Running

Interactive local runner

python -m entrypoints.local

Non-interactive (single document)

python -m entrypoints.local --document investor.pdf --schema configs/form_keys.json --output output/filled.json

CLI

doc-upload-cli --document investor.pdf --schema configs/form_keys.json --output filled.json --report

FastAPI server

doc-upload-server
# or
uvicorn entrypoints.fastapi_app:app --reload --port 8001

Then POST to http://localhost:8001/extract:

{
  "document_path": "/path/to/investor_profile.pdf",
  "schema_path": "configs/form_keys.json"
}

Storage backends

Value Description
local Local filesystem (default, for dev)
s3 AWS S3
gcp Google Cloud Storage
azure Azure Blob Storage

Set DOC_UPLOAD_STORAGE=s3 and the matching bucket env vars.


PDF Filling (mapper integration)

Set DOC_UPLOAD_PDF_FILLER=mapper and the Lambda URL:

DOC_UPLOAD_PDF_FILLER=mapper
DOC_UPLOAD_FILL_PDF_LAMBDA_URL=https://xyz.lambda-url.us-east-1.on.aws
DOC_UPLOAD_PDF_API_KEY=my-api-key

The client runs extraction and embed-file preparation in parallel, then calls fill_pdf once both complete — identical to the Lambda main.py pipeline.


Telemetry

Value Description
off Disabled (default, zero overhead)
local Append events to ./extractor_telemetry/events.jsonl
managed (stub) HTTP POST to DOC_UPLOAD_TELEMETRY_ENDPOINT

Field values are never included in telemetry. Only metadata (counts, latencies, file extensions) is logged. Job IDs are one-way SHA-256 hashed.


Entrypoints

File Use case
entrypoints/local.py Interactive development REPL
entrypoints/cli.py doc-upload-cli command
entrypoints/server.py doc-upload-server (uvicorn)
entrypoints/fastapi_app.py FastAPI app (mount or standalone)
entrypoints/aws_lambda.py AWS Lambda handler
entrypoints/gcp_function.py GCP Cloud Functions handler
entrypoints/azure_function.py Azure Functions handler

Programmatic API

from pdf_autofillr_doc_upload import DocUploadClient

client = DocUploadClient()

# Extract only
result = client.run(
    document_path="investor_profile.pdf",
    schema_path="configs/form_keys.json",
    output_path="output/filled.json",
)

print(result["output_flat"])   # flat dot-notation dict
print(result["output_nested"]) # nested dict matching schema

# Extract + fill PDF
result = client.run(
    document_path="investor_profile.pdf",
    schema_path="configs/form_keys.json",
    user_id="42",
    pdf_doc_id="99",
    session_id="sess_abc",
    investor_type="Individual",
)

Module structure

extractor/
├── pyproject.toml
├── .env.example
├── README.md
├── config_samples/
│   └── form_keys.json
├── entrypoints/
│   ├── local.py             ← interactive REPL
│   ├── cli.py               ← doc-upload-cli
│   ├── server.py            ← doc-upload-server
│   ├── fastapi_app.py       ← FastAPI app
│   ├── aws_lambda.py        ← AWS Lambda
│   ├── gcp_function.py      ← GCP Cloud Functions
│   └── azure_function.py    ← Azure Functions
└── src/pdf_autofillr_doc_upload/
    ├── __init__.py
    ├── client.py            ← DocUploadClient (main API)
    ├── config/
    │   └── settings.py      ← all env var config
    ├── storage/
    │   ├── base.py          ← abstract interface
    │   ├── local_storage.py
    │   ├── s3_storage.py
    │   ├── gcp_storage.py
    │   ├── azure_storage.py
    │   └── factory.py
    ├── extraction/
    │   ├── document_reader.py  ← PDF/DOCX/PPTX/XLSX/CSV/JSON/MD/HTML/XML
    │   ├── llm_client.py       ← LiteLLM wrapper
    │   └── extractor.py        ← full pipeline
    ├── pdf/
    │   ├── interface.py
    │   ├── api_handler.py      ← HTTP client for Lambda
    │   └── mapper_filler.py    ← mapper integration
    ├── logging/
    │   └── logger.py           ← ExecutionLogger
    ├── telemetry/
    │   ├── collector.py
    │   └── config.py
    └── managed/
        └── __init__.py         ← stub for future managed service

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_autofillr_doc_upload-0.1.3.tar.gz (54.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_autofillr_doc_upload-0.1.3-py3-none-any.whl (53.8 kB view details)

Uploaded Python 3

File details

Details for the file pdf_autofillr_doc_upload-0.1.3.tar.gz.

File metadata

File hashes

Hashes for pdf_autofillr_doc_upload-0.1.3.tar.gz
Algorithm Hash digest
SHA256 1f8f138ce1eec4f7447ab64b717d72ccaa43f1123d05732b31da26d1d1fedb30
MD5 2fe9c246b51e89e443104e0597a92f6c
BLAKE2b-256 c83fd57ded95244ed802f8f024bc9e89a626232bb7334ba1a71a8b3f670de1b0

See more details on using hashes here.

File details

Details for the file pdf_autofillr_doc_upload-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_autofillr_doc_upload-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 afd8e107827848f2411c472aa85b60b38254db956f7c1d4a1013f9af71723c20
MD5 161b2334e5be68094c65956b22cd0ad7
BLAKE2b-256 59336e8a7ef62b4776173e1ad3c1576e7252494db162c585d46d4f7c24a10251

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page