Skip to main content

Document extraction pipeline — extract structured data from PDF/DOCX/PPTX/XLSX/JSON/CSV/MD and fill PDFs via the mapper module

Project description

pdf-autofillr-doc-upload

Extract structured data from any document format using an LLM, then optionally fill a blank PDF via the mapper module.

Supported document formats

Format Extension
PDF .pdf
Word .docx
PowerPoint .pptx
Excel .xlsx, .xls
CSV .csv
JSON .json
Markdown .md, .markdown
Plain text .txt
HTML .html, .htm
XML .xml

Supported LLM providers (via LiteLLM)

Any model LiteLLM supports — OpenAI, Anthropic, Groq, Ollama, AWS Bedrock, Azure OpenAI, Google Vertex AI, and more.


Installation

# Step 1 — create venv
python -m venv venv
.\venv\Scripts\activate   # Windows
# source venv/bin/activate  # Linux/Mac

# Step 2 — install litellm (pinned)
pip install "litellm==1.59.12" --no-cache-dir

# Step 3 — install mapper (from sibling modules3/mapper/)
pip install -e ../mapper --no-cache-dir --no-deps
pip install PyMuPDF tiktoken pydantic pydantic-settings python-dotenv tenacity requests aiohttp httpx numpy tqdm python-json-logger --no-cache-dir

# Step 4 — install extractor
pip install -e . --no-cache-dir --no-deps
pip install python-docx python-pptx openpyxl --no-cache-dir

# Verify
python -c "import pdf_autofillr_doc_upload; print('extractor ok')"
python -c "import pdf_autofillr_mapper; print('mapper ok')"

Setup

1. Copy sample configs

python -c "import pdf_autofillr_doc_upload; pdf_autofillr_doc_upload.copy_sample_configs('.')"

2. Create .env

cp .env.example .env
# Edit .env — set DOC_UPLOAD_LLM_MODEL and DOC_UPLOAD_LLM_API_KEY at minimum

Minimal .env:

DOC_UPLOAD_LLM_MODEL=openai/gpt-4.1-mini
DOC_UPLOAD_LLM_API_KEY=sk-...
DOC_UPLOAD_STORAGE=local
DOC_UPLOAD_DATA_PATH=./extractor_data
DOC_UPLOAD_CONFIG_PATH=./configs

Running

Interactive local runner

python -m entrypoints.local

Non-interactive (single document)

python -m entrypoints.local --document investor.pdf --schema configs/form_keys.json --output output/filled.json

CLI

doc-upload-cli --document investor.pdf --schema configs/form_keys.json --output filled.json --report

FastAPI server

doc-upload-server
# or
uvicorn entrypoints.fastapi_app:app --reload --port 8001

Then POST to http://localhost:8001/extract:

{
  "document_path": "/path/to/investor_profile.pdf",
  "schema_path": "configs/form_keys.json"
}

Storage backends

Value Description
local Local filesystem (default, for dev)
s3 AWS S3
gcp Google Cloud Storage
azure Azure Blob Storage

Set DOC_UPLOAD_STORAGE=s3 and the matching bucket env vars.


PDF Filling (mapper integration)

Set DOC_UPLOAD_PDF_FILLER=mapper and the Lambda URL:

DOC_UPLOAD_PDF_FILLER=mapper
DOC_UPLOAD_FILL_PDF_LAMBDA_URL=https://xyz.lambda-url.us-east-1.on.aws
DOC_UPLOAD_PDF_API_KEY=my-api-key

The client runs extraction and embed-file preparation in parallel, then calls fill_pdf once both complete — identical to the Lambda main.py pipeline.


Telemetry

Value Description
off Disabled (default, zero overhead)
local Append events to ./extractor_telemetry/events.jsonl
managed (stub) HTTP POST to DOC_UPLOAD_TELEMETRY_ENDPOINT

Field values are never included in telemetry. Only metadata (counts, latencies, file extensions) is logged. Job IDs are one-way SHA-256 hashed.


Entrypoints

File Use case
entrypoints/local.py Interactive development REPL
entrypoints/cli.py doc-upload-cli command
entrypoints/server.py doc-upload-server (uvicorn)
entrypoints/fastapi_app.py FastAPI app (mount or standalone)
entrypoints/aws_lambda.py AWS Lambda handler
entrypoints/gcp_function.py GCP Cloud Functions handler
entrypoints/azure_function.py Azure Functions handler

Programmatic API

from pdf_autofillr_doc_upload import DocUploadClient

client = DocUploadClient()

# Extract only
result = client.run(
    document_path="investor_profile.pdf",
    schema_path="configs/form_keys.json",
    output_path="output/filled.json",
)

print(result["output_flat"])   # flat dot-notation dict
print(result["output_nested"]) # nested dict matching schema

# Extract + fill PDF
result = client.run(
    document_path="investor_profile.pdf",
    schema_path="configs/form_keys.json",
    user_id="42",
    pdf_doc_id="99",
    session_id="sess_abc",
    investor_type="Individual",
)

Module structure

extractor/
├── pyproject.toml
├── .env.example
├── README.md
├── config_samples/
│   └── form_keys.json
├── entrypoints/
│   ├── local.py             ← interactive REPL
│   ├── cli.py               ← doc-upload-cli
│   ├── server.py            ← doc-upload-server
│   ├── fastapi_app.py       ← FastAPI app
│   ├── aws_lambda.py        ← AWS Lambda
│   ├── gcp_function.py      ← GCP Cloud Functions
│   └── azure_function.py    ← Azure Functions
└── src/pdf_autofillr_doc_upload/
    ├── __init__.py
    ├── client.py            ← DocUploadClient (main API)
    ├── config/
    │   └── settings.py      ← all env var config
    ├── storage/
    │   ├── base.py          ← abstract interface
    │   ├── local_storage.py
    │   ├── s3_storage.py
    │   ├── gcp_storage.py
    │   ├── azure_storage.py
    │   └── factory.py
    ├── extraction/
    │   ├── document_reader.py  ← PDF/DOCX/PPTX/XLSX/CSV/JSON/MD/HTML/XML
    │   ├── llm_client.py       ← LiteLLM wrapper
    │   └── extractor.py        ← full pipeline
    ├── pdf/
    │   ├── interface.py
    │   ├── api_handler.py      ← HTTP client for Lambda
    │   └── mapper_filler.py    ← mapper integration
    ├── logging/
    │   └── logger.py           ← ExecutionLogger
    ├── telemetry/
    │   ├── collector.py
    │   └── config.py
    └── managed/
        └── __init__.py         ← stub for future managed service

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_autofillr_doc_upload-0.1.4.tar.gz (363.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_autofillr_doc_upload-0.1.4-py3-none-any.whl (54.2 kB view details)

Uploaded Python 3

File details

Details for the file pdf_autofillr_doc_upload-0.1.4.tar.gz.

File metadata

  • Download URL: pdf_autofillr_doc_upload-0.1.4.tar.gz
  • Upload date:
  • Size: 363.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdf_autofillr_doc_upload-0.1.4.tar.gz
Algorithm Hash digest
SHA256 e3f3fbc991a59a2236aba7086858e6b79d255e13df9844ab48bee2d7b9a003b4
MD5 767540b5682f7a1ab28ae0af4d51d8e3
BLAKE2b-256 00891b266651090e885d6d588243d38365575abc085e63373e1c733cefd604f4

See more details on using hashes here.

File details

Details for the file pdf_autofillr_doc_upload-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_autofillr_doc_upload-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 eff5492b7c1367f9ea2c6f607d9f7d494ee855c235587b02bd61a227f95052fa
MD5 54be38e354fdd9ee4a4ef760998a5b00
BLAKE2b-256 cea0aaade73b63e63909b823d28cc86ce3dfd171859735d23f0c64545ac96678

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page