PDF field extraction, semantic mapping, embedding and filling engine
Project description
PDF Mapper Module
The core PDF field extraction, mapping, embedding, and filling engine.
๐ Quick Start
1. Configure the Module
IMPORTANT: You must configure before running!
# Copy configuration templates
cp .env.example .env
cp config.ini.example config.ini
# Edit .env - Add your API keys
nano .env
# Edit config.ini - Set your storage paths
nano config.ini
See SETUP_GUIDE.md for detailed configuration instructions.
2. Install Dependencies
# Core dependencies
pip install -r requirements.txt
# For API server
pip install -r requirements-api.txt
3. Run API Server
python api_server.py
Server will be available at: http://localhost:8000
Interactive docs at: http://localhost:8000/docs
๐ File Structure
modules/mapper/
โโโ .env.example โ Copy to .env (add your API keys)
โโโ config.ini.example โ Copy to config.ini (configure paths)
โโโ config.ini โ Active configuration (DO NOT COMMIT)
โโโ .env โ Active environment (DO NOT COMMIT)
โโโ SETUP_GUIDE.md โ Detailed setup instructions
โโโ API_SERVER.md โ API server documentation
โโโ api_server.py โ FastAPI server (run this!)
โโโ requirements.txt โ Python dependencies
โโโ requirements-api.txt โ API server dependencies
โโโ setup.py โ Package setup
โโโ src/ โ Core source code
โโโ orchestrator.py โ Main orchestration logic
โโโ extractors/ โ PDF field extraction
โโโ mappers/ โ Field mapping (LLM)
โโโ embedders/ โ Metadata embedding
โโโ fillers/ โ PDF filling
โโโ chunkers/ โ Document chunking
โโโ groupers/ โ Field grouping
โโโ headers/ โ Header detection
โโโ validators/ โ Field validation
โโโ core/ โ Configuration & logging
๐ฏ What This Module Does
- Extract - Extracts form fields from PDF files
- Map - Maps extracted fields to your data schema using LLM
- Embed - Embeds mapping metadata into PDF for reuse
- Fill - Fills embedded PDFs with actual data
Operations
| Operation | Input | Output | Use Case |
|---|---|---|---|
| extract | PDF file | Field list JSON | Discover what fields exist |
| map | Extracted fields | Mapping JSON | Create field-to-schema mapping |
| embed | PDF + mapping | Embedded PDF | Prepare PDF for filling |
| fill | Embedded PDF + data | Filled PDF | Generate completed forms |
| make-embed | PDF file | Embedded PDF | One-step: extract+map+embed |
| run-all | PDF + data | Filled PDF | Complete pipeline |
๐ง Configuration Overview
Required Configuration
In .env:
# Choose one
CLOUD_PROVIDER=local # For local development
# CLOUD_PROVIDER=aws # For AWS deployment
# CLOUD_PROVIDER=azure # For Azure deployment
# Add your LLM API key
OPENAI_API_KEY=sk-your-key-here
In config.ini:
[general]
source_type = local
[mapping]
llm_model = gpt-4o
use_second_mapper = false
[local]
cache_registry_path = /path/to/cache/hash_registry.json
output_base_path = /path/to/output
See SETUP_GUIDE.md for complete details.
๐ Running as API Server
# Start server
python api_server.py
# In another terminal, test it
curl http://localhost:8000/health
Available endpoints:
GET /- API infoGET /health- Health checkPOST /mapper/extract- Extract fieldsPOST /mapper/map- Map fieldsPOST /mapper/embed- Embed metadataPOST /mapper/fill- Fill PDFPOST /mapper/make-embed- Extract+Map+EmbedPOST /mapper/fill-pdf- Fill embedded PDFPOST /mapper/check-embed-file- Check if PDF has embeddingsPOST /mapper/run-all- Complete pipeline
See API_SERVER.md for API documentation.
๐ฆ Using as Python Module
from src.orchestrator import run_extraction, run_mapping, run_embedding, run_filling
# Extract fields
extracted = run_extraction(pdf_path, user_id, pdf_doc_id)
# Map fields
mapped = run_mapping(user_id, pdf_doc_id)
# Embed metadata
embedded = run_embedding(pdf_path, user_id, pdf_doc_id)
# Fill PDF
filled = run_filling(embedded_pdf_path, user_id, pdf_doc_id, input_data)
๐งช Testing
# Run all tests
pytest
# Run specific test
pytest tests/test_extract.py
# With coverage
pytest --cov=src
๐ณ Deployment Options
Local Development
python api_server.py
Docker
docker build -t pdf-mapper .
docker run -p 8000:8000 --env-file .env pdf-mapper
AWS Lambda
See deployment/aws/ for Lambda deployment scripts.
Azure Functions
See deployment/azure/ for Azure deployment scripts.
GCP Cloud Functions
See deployment/gcp/ for GCP deployment scripts.
๐ Integration with SDK
Once the API server is running, install the SDK:
cd ../../sdks/python
pip install -e .
# Use CLI
pdf-autofiller --api-url http://localhost:8000 extract input.pdf
# Or Python
from pdf_autofiller import PDFMapperClient
client = PDFMapperClient("http://localhost:8000")
result = client.extract("input.pdf", 1, 100)
๐ Documentation
- SETUP_GUIDE.md - Configuration setup
- API_SERVER.md - API documentation
- INSTALLATION_GUIDE.md - Installation details
- ../../docs/ - Complete project documentation
๐ Troubleshooting
Module 'boto3' not found
# In config.ini, set:
[general]
rag_api_url =
# Leave empty to disable RAG
API key errors
# Make sure .env has your key:
OPENAI_API_KEY=sk-your-actual-key-here
Import errors
# Install dependencies:
pip install -r requirements.txt
Server won't start
# Install API dependencies:
pip install -r requirements-api.txt
โ๏ธ Environment Variables
Key environment variables (set in .env):
| Variable | Required | Description |
|---|---|---|
CLOUD_PROVIDER |
โ | local, aws, azure, or gcp |
OPENAI_API_KEY |
โ | OpenAI API key |
ANTHROPIC_API_KEY |
๐ท | Claude/Anthropic key (if using Claude) |
AWS_ACCESS_KEY_ID |
๐ท | AWS credentials (if using AWS) |
AWS_SECRET_ACCESS_KEY |
๐ท | AWS credentials (if using AWS) |
AZURE_STORAGE_CONNECTION_STRING |
๐ท | Azure credentials (if using Azure) |
๐ License
See LICENSE file in project root.
๐ค Contributing
See CONTRIBUTING.md for guidelines.
Quick Command Reference
# Setup
cp .env.example .env
cp config.ini.example config.ini
pip install -r requirements.txt requirements-api.txt
# Run server
python api_server.py
# Test
curl http://localhost:8000/health
pytest
# Install SDK (for client usage)
cd ../../sdks/python && pip install -e .
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_autofillr_mapper-1.0.3.tar.gz.
File metadata
- Download URL: pdf_autofillr_mapper-1.0.3.tar.gz
- Upload date:
- Size: 15.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b149f403b98bb0f3e6d3d0fe58590794f96042c6d1e347e1dc682111a5bf28a8
|
|
| MD5 |
bc4f1592749396331a2d8b8d76643372
|
|
| BLAKE2b-256 |
e4e2e7cf155d4a66a963945483874b933b4dce0b32529eb32af00b392cfb4cbf
|
File details
Details for the file pdf_autofillr_mapper-1.0.3-py3-none-any.whl.
File metadata
- Download URL: pdf_autofillr_mapper-1.0.3-py3-none-any.whl
- Upload date:
- Size: 15.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e12aea654341b86a5dfbca08b82618098ef59afa5b46f2297fae4e503fbb0c9
|
|
| MD5 |
7256b7ed10d78ed181e6f4ac391b7dc1
|
|
| BLAKE2b-256 |
71d82caa71178fc3e9b1fa64a8b9059af7edc2ba9113a36614b45fcbf4c5f500
|