Skip to main content

Document extraction pipeline — extract structured data from PDF/DOCX/PPTX/XLSX/JSON/CSV/MD and fill PDFs via the mapper module

Project description

pdf-autofillr-doc-upload

Extract structured data from any document format using an LLM, then optionally fill a blank PDF via the mapper module.

Supported document formats

Format Extension
PDF .pdf
Word .docx
PowerPoint .pptx
Excel .xlsx, .xls
CSV .csv
JSON .json
Markdown .md, .markdown
Plain text .txt
HTML .html, .htm
XML .xml

Supported LLM providers (via LiteLLM)

Any model LiteLLM supports — OpenAI, Anthropic, Groq, Ollama, AWS Bedrock, Azure OpenAI, Google Vertex AI, and more.


Installation

# Step 1 — create venv
python -m venv venv
.\venv\Scripts\activate   # Windows
# source venv/bin/activate  # Linux/Mac

# Step 2 — install litellm (pinned)
pip install "litellm==1.59.12" --no-cache-dir

# Step 3 — install mapper (from sibling modules3/mapper/)
pip install -e ../mapper --no-cache-dir --no-deps
pip install PyMuPDF tiktoken pydantic pydantic-settings python-dotenv tenacity requests aiohttp httpx numpy tqdm python-json-logger --no-cache-dir

# Step 4 — install extractor
pip install -e . --no-cache-dir --no-deps
pip install python-docx python-pptx openpyxl --no-cache-dir

# Verify
python -c "import pdf_autofillr_doc_upload; print('extractor ok')"
python -c "import pdf_autofillr_mapper; print('mapper ok')"

Setup

1. Copy sample configs

python -c "import pdf_autofillr_doc_upload; pdf_autofillr_doc_upload.copy_sample_configs('.')"

2. Create .env

cp .env.example .env
# Edit .env — set DOC_UPLOAD_LLM_MODEL and DOC_UPLOAD_LLM_API_KEY at minimum

Minimal .env:

DOC_UPLOAD_LLM_MODEL=openai/gpt-4.1-mini
DOC_UPLOAD_LLM_API_KEY=sk-...
DOC_UPLOAD_STORAGE=local
DOC_UPLOAD_DATA_PATH=./extractor_data
DOC_UPLOAD_CONFIG_PATH=./configs

Running

Interactive local runner

python -m entrypoints.local

Non-interactive (single document)

python -m entrypoints.local --document investor.pdf --schema configs/form_keys.json --output output/filled.json

CLI

doc-upload-cli --document investor.pdf --schema configs/form_keys.json --output filled.json --report

FastAPI server

doc-upload-server
# or
uvicorn entrypoints.fastapi_app:app --reload --port 8001

Then POST to http://localhost:8001/extract:

{
  "document_path": "/path/to/investor_profile.pdf",
  "schema_path": "configs/form_keys.json"
}

Storage backends

Value Description
local Local filesystem (default, for dev)
s3 AWS S3
gcp Google Cloud Storage
azure Azure Blob Storage

Set DOC_UPLOAD_STORAGE=s3 and the matching bucket env vars.


PDF Filling (mapper integration)

Set DOC_UPLOAD_PDF_FILLER=mapper and the Lambda URL:

DOC_UPLOAD_PDF_FILLER=mapper
DOC_UPLOAD_FILL_PDF_LAMBDA_URL=https://xyz.lambda-url.us-east-1.on.aws
DOC_UPLOAD_PDF_API_KEY=my-api-key

The client runs extraction and embed-file preparation in parallel, then calls fill_pdf once both complete — identical to the Lambda main.py pipeline.


Telemetry

Value Description
off Disabled (default, zero overhead)
local Append events to ./extractor_telemetry/events.jsonl
managed (stub) HTTP POST to DOC_UPLOAD_TELEMETRY_ENDPOINT

Field values are never included in telemetry. Only metadata (counts, latencies, file extensions) is logged. Job IDs are one-way SHA-256 hashed.


Entrypoints

File Use case
entrypoints/local.py Interactive development REPL
entrypoints/cli.py doc-upload-cli command
entrypoints/server.py doc-upload-server (uvicorn)
entrypoints/fastapi_app.py FastAPI app (mount or standalone)
entrypoints/aws_lambda.py AWS Lambda handler
entrypoints/gcp_function.py GCP Cloud Functions handler
entrypoints/azure_function.py Azure Functions handler

Programmatic API

from pdf_autofillr_doc_upload import DocUploadClient

client = DocUploadClient()

# Extract only
result = client.run(
    document_path="investor_profile.pdf",
    schema_path="configs/form_keys.json",
    output_path="output/filled.json",
)

print(result["output_flat"])   # flat dot-notation dict
print(result["output_nested"]) # nested dict matching schema

# Extract + fill PDF
result = client.run(
    document_path="investor_profile.pdf",
    schema_path="configs/form_keys.json",
    user_id="42",
    pdf_doc_id="99",
    session_id="sess_abc",
    investor_type="Individual",
)

Module structure

extractor/
├── pyproject.toml
├── .env.example
├── README.md
├── config_samples/
│   └── form_keys.json
├── entrypoints/
│   ├── local.py             ← interactive REPL
│   ├── cli.py               ← doc-upload-cli
│   ├── server.py            ← doc-upload-server
│   ├── fastapi_app.py       ← FastAPI app
│   ├── aws_lambda.py        ← AWS Lambda
│   ├── gcp_function.py      ← GCP Cloud Functions
│   └── azure_function.py    ← Azure Functions
└── src/pdf_autofillr_doc_upload/
    ├── __init__.py
    ├── client.py            ← DocUploadClient (main API)
    ├── config/
    │   └── settings.py      ← all env var config
    ├── storage/
    │   ├── base.py          ← abstract interface
    │   ├── local_storage.py
    │   ├── s3_storage.py
    │   ├── gcp_storage.py
    │   ├── azure_storage.py
    │   └── factory.py
    ├── extraction/
    │   ├── document_reader.py  ← PDF/DOCX/PPTX/XLSX/CSV/JSON/MD/HTML/XML
    │   ├── llm_client.py       ← LiteLLM wrapper
    │   └── extractor.py        ← full pipeline
    ├── pdf/
    │   ├── interface.py
    │   ├── api_handler.py      ← HTTP client for Lambda
    │   └── mapper_filler.py    ← mapper integration
    ├── logging/
    │   └── logger.py           ← ExecutionLogger
    ├── telemetry/
    │   ├── collector.py
    │   └── config.py
    └── managed/
        └── __init__.py         ← stub for future managed service

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_autofillr_doc_upload-0.1.2.tar.gz (54.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_autofillr_doc_upload-0.1.2-py3-none-any.whl (53.7 kB view details)

Uploaded Python 3

File details

Details for the file pdf_autofillr_doc_upload-0.1.2.tar.gz.

File metadata

File hashes

Hashes for pdf_autofillr_doc_upload-0.1.2.tar.gz
Algorithm Hash digest
SHA256 87aafffd107ff829664cb3681ef4c8348f813199c7f4b4481ae0eb98c8fe56d5
MD5 e57d23248f337b2718b48f3ae7ec5bf6
BLAKE2b-256 3eda5280b89f4e0f26db72d136b0294d2bb013ffa1438003973382dd688c35c6

See more details on using hashes here.

File details

Details for the file pdf_autofillr_doc_upload-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_autofillr_doc_upload-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 295ff6513eea4c6c8d2b0300b565bcde6ff1ce10092951fe690b626a1321a0ce
MD5 8c390dc02509f90ec373cc08be8fbc96
BLAKE2b-256 18ec3efded8cc4aa96e982b024c2521cec82e2b1662c35a84df991950f8be6f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page