Document extraction pipeline — extract structured data from PDF/DOCX/PPTX/XLSX/JSON/CSV/MD and fill PDFs via the mapper module
Project description
pdf-autofillr-doc-upload
Extract structured data from any document format using an LLM, then optionally fill a blank PDF via the mapper module.
Supported document formats
| Format | Extension |
|---|---|
.pdf |
|
| Word | .docx |
| PowerPoint | .pptx |
| Excel | .xlsx, .xls |
| CSV | .csv |
| JSON | .json |
| Markdown | .md, .markdown |
| Plain text | .txt |
| HTML | .html, .htm |
| XML | .xml |
Supported LLM providers (via LiteLLM)
Any model LiteLLM supports — OpenAI, Anthropic, Groq, Ollama, AWS Bedrock, Azure OpenAI, Google Vertex AI, and more.
Installation
# Step 1 — create venv
python -m venv venv
.\venv\Scripts\activate # Windows
# source venv/bin/activate # Linux/Mac
# Step 2 — install litellm (pinned)
pip install "litellm==1.59.12" --no-cache-dir
# Step 3 — install mapper (from sibling modules3/mapper/)
pip install -e ../mapper --no-cache-dir --no-deps
pip install PyMuPDF tiktoken pydantic pydantic-settings python-dotenv tenacity requests aiohttp httpx numpy tqdm python-json-logger --no-cache-dir
# Step 4 — install extractor
pip install -e . --no-cache-dir --no-deps
pip install python-docx python-pptx openpyxl --no-cache-dir
# Verify
python -c "import pdf_autofillr_doc_upload; print('extractor ok')"
python -c "import pdf_autofillr_mapper; print('mapper ok')"
Setup
1. Copy sample configs
python -c "import pdf_autofillr_doc_upload; pdf_autofillr_doc_upload.copy_sample_configs('.')"
2. Create .env
cp .env.example .env
# Edit .env — set DOC_UPLOAD_LLM_MODEL and DOC_UPLOAD_LLM_API_KEY at minimum
Minimal .env:
DOC_UPLOAD_LLM_MODEL=openai/gpt-4.1-mini
DOC_UPLOAD_LLM_API_KEY=sk-...
DOC_UPLOAD_STORAGE=local
DOC_UPLOAD_DATA_PATH=./extractor_data
DOC_UPLOAD_CONFIG_PATH=./configs
Running
Interactive local runner
python -m entrypoints.local
Non-interactive (single document)
python -m entrypoints.local --document investor.pdf --schema configs/form_keys.json --output output/filled.json
CLI
doc-upload-cli --document investor.pdf --schema configs/form_keys.json --output filled.json --report
FastAPI server
doc-upload-server
# or
uvicorn entrypoints.fastapi_app:app --reload --port 8001
Then POST to http://localhost:8001/extract:
{
"document_path": "/path/to/investor_profile.pdf",
"schema_path": "configs/form_keys.json"
}
Storage backends
| Value | Description |
|---|---|
local |
Local filesystem (default, for dev) |
s3 |
AWS S3 |
gcp |
Google Cloud Storage |
azure |
Azure Blob Storage |
Set DOC_UPLOAD_STORAGE=s3 and the matching bucket env vars.
PDF Filling (mapper integration)
Set DOC_UPLOAD_PDF_FILLER=mapper and the Lambda URL:
DOC_UPLOAD_PDF_FILLER=mapper
DOC_UPLOAD_FILL_PDF_LAMBDA_URL=https://xyz.lambda-url.us-east-1.on.aws
DOC_UPLOAD_PDF_API_KEY=my-api-key
The client runs extraction and embed-file preparation in parallel, then calls fill_pdf once both complete — identical to the Lambda main.py pipeline.
Telemetry
| Value | Description |
|---|---|
off |
Disabled (default, zero overhead) |
local |
Append events to ./extractor_telemetry/events.jsonl |
managed |
(stub) HTTP POST to DOC_UPLOAD_TELEMETRY_ENDPOINT |
Field values are never included in telemetry. Only metadata (counts, latencies, file extensions) is logged. Job IDs are one-way SHA-256 hashed.
Entrypoints
| File | Use case |
|---|---|
entrypoints/local.py |
Interactive development REPL |
entrypoints/cli.py |
doc-upload-cli command |
entrypoints/server.py |
doc-upload-server (uvicorn) |
entrypoints/fastapi_app.py |
FastAPI app (mount or standalone) |
entrypoints/aws_lambda.py |
AWS Lambda handler |
entrypoints/gcp_function.py |
GCP Cloud Functions handler |
entrypoints/azure_function.py |
Azure Functions handler |
Programmatic API
from pdf_autofillr_doc_upload import DocUploadClient
client = DocUploadClient()
# Extract only
result = client.run(
document_path="investor_profile.pdf",
schema_path="configs/form_keys.json",
output_path="output/filled.json",
)
print(result["output_flat"]) # flat dot-notation dict
print(result["output_nested"]) # nested dict matching schema
# Extract + fill PDF
result = client.run(
document_path="investor_profile.pdf",
schema_path="configs/form_keys.json",
user_id="42",
pdf_doc_id="99",
session_id="sess_abc",
investor_type="Individual",
)
Module structure
extractor/
├── pyproject.toml
├── .env.example
├── README.md
├── config_samples/
│ └── form_keys.json
├── entrypoints/
│ ├── local.py ← interactive REPL
│ ├── cli.py ← doc-upload-cli
│ ├── server.py ← doc-upload-server
│ ├── fastapi_app.py ← FastAPI app
│ ├── aws_lambda.py ← AWS Lambda
│ ├── gcp_function.py ← GCP Cloud Functions
│ └── azure_function.py ← Azure Functions
└── src/pdf_autofillr_doc_upload/
├── __init__.py
├── client.py ← DocUploadClient (main API)
├── config/
│ └── settings.py ← all env var config
├── storage/
│ ├── base.py ← abstract interface
│ ├── local_storage.py
│ ├── s3_storage.py
│ ├── gcp_storage.py
│ ├── azure_storage.py
│ └── factory.py
├── extraction/
│ ├── document_reader.py ← PDF/DOCX/PPTX/XLSX/CSV/JSON/MD/HTML/XML
│ ├── llm_client.py ← LiteLLM wrapper
│ └── extractor.py ← full pipeline
├── pdf/
│ ├── interface.py
│ ├── api_handler.py ← HTTP client for Lambda
│ └── mapper_filler.py ← mapper integration
├── logging/
│ └── logger.py ← ExecutionLogger
├── telemetry/
│ ├── collector.py
│ └── config.py
└── managed/
└── __init__.py ← stub for future managed service
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_autofillr_doc_upload-0.1.1.tar.gz.
File metadata
- Download URL: pdf_autofillr_doc_upload-0.1.1.tar.gz
- Upload date:
- Size: 54.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d7439b9fbcc1f43e2e70ecbb64ac8c39d1b2996d0e4bac112310447347920ab
|
|
| MD5 |
f53347263f64e553487a7db288cc8a85
|
|
| BLAKE2b-256 |
7c66bb6b0b371cf2ebae30c0c57030ded38d612a2df2654d6aa0d67cfd67141a
|
File details
Details for the file pdf_autofillr_doc_upload-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pdf_autofillr_doc_upload-0.1.1-py3-none-any.whl
- Upload date:
- Size: 53.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8a3494cf127e1ad67c5172cc593e620904cd367b05b2b69985a489b620f1f49
|
|
| MD5 |
abf27506218776a4246db5b8b163caf7
|
|
| BLAKE2b-256 |
59552303054d6f58a3ad9efc2c65dc0d83c6e422b873590015f6822b03f8dd58
|