PDF field extraction, semantic mapping, embedding and filling engine

These details have not been verified by PyPI

Project links

Project description

PDF Mapper Module

The core PDF field extraction, mapping, embedding, and filling engine.

🚀 Quick Start

1. Configure the Module

IMPORTANT: You must configure before running!

# Copy configuration templates
cp .env.example .env
cp config.ini.example config.ini

# Edit .env - Add your API keys
nano .env

# Edit config.ini - Set your storage paths
nano config.ini

See SETUP_GUIDE.md for detailed configuration instructions.

2. Install Dependencies

# Core dependencies
pip install -r requirements.txt

# For API server
pip install -r requirements-api.txt

3. Run API Server

python api_server.py

Server will be available at: http://localhost:8000

Interactive docs at: http://localhost:8000/docs

📁 File Structure

modules/mapper/
├── .env.example            ← Copy to .env (add your API keys)
├── config.ini.example      ← Copy to config.ini (configure paths)
├── config.ini              ← Active configuration (DO NOT COMMIT)
├── .env                    ← Active environment (DO NOT COMMIT)
├── SETUP_GUIDE.md          ← Detailed setup instructions
├── API_SERVER.md           ← API server documentation
├── api_server.py           ← FastAPI server (run this!)
├── requirements.txt        ← Python dependencies
├── requirements-api.txt    ← API server dependencies
├── setup.py                ← Package setup
└── src/                    ← Core source code
    ├── orchestrator.py     ← Main orchestration logic
    ├── extractors/         ← PDF field extraction
    ├── mappers/            ← Field mapping (LLM)
    ├── embedders/          ← Metadata embedding
    ├── fillers/            ← PDF filling
    ├── chunkers/           ← Document chunking
    ├── groupers/           ← Field grouping
    ├── headers/            ← Header detection
    ├── validators/         ← Field validation
    └── core/               ← Configuration & logging

🎯 What This Module Does

Extract - Extracts form fields from PDF files
Map - Maps extracted fields to your data schema using LLM
Embed - Embeds mapping metadata into PDF for reuse
Fill - Fills embedded PDFs with actual data

Operations

Operation	Input	Output	Use Case
extract	PDF file	Field list JSON	Discover what fields exist
map	Extracted fields	Mapping JSON	Create field-to-schema mapping
embed	PDF + mapping	Embedded PDF	Prepare PDF for filling
fill	Embedded PDF + data	Filled PDF	Generate completed forms
make-embed	PDF file	Embedded PDF	One-step: extract+map+embed
run-all	PDF + data	Filled PDF	Complete pipeline

🔧 Configuration Overview

Required Configuration

In .env:

# Choose one
CLOUD_PROVIDER=local          # For local development
# CLOUD_PROVIDER=aws          # For AWS deployment
# CLOUD_PROVIDER=azure        # For Azure deployment

# Add your LLM API key
OPENAI_API_KEY=sk-your-key-here

In config.ini:

[general]
source_type = local

[mapping]
llm_model = gpt-4o
use_second_mapper = false

[local]
cache_registry_path = /path/to/cache/hash_registry.json
output_base_path = /path/to/output

See SETUP_GUIDE.md for complete details.

🌐 Running as API Server

# Start server
python api_server.py

# In another terminal, test it
curl http://localhost:8000/health

Available endpoints:

GET / - API info
GET /health - Health check
POST /mapper/extract - Extract fields
POST /mapper/map - Map fields
POST /mapper/embed - Embed metadata
POST /mapper/fill - Fill PDF
POST /mapper/make-embed - Extract+Map+Embed
POST /mapper/fill-pdf - Fill embedded PDF
POST /mapper/check-embed-file - Check if PDF has embeddings
POST /mapper/run-all - Complete pipeline

See API_SERVER.md for API documentation.

📦 Using as Python Module

from src.orchestrator import run_extraction, run_mapping, run_embedding, run_filling

# Extract fields
extracted = run_extraction(pdf_path, user_id, pdf_doc_id)

# Map fields
mapped = run_mapping(user_id, pdf_doc_id)

# Embed metadata
embedded = run_embedding(pdf_path, user_id, pdf_doc_id)

# Fill PDF
filled = run_filling(embedded_pdf_path, user_id, pdf_doc_id, input_data)

🧪 Testing

# Run all tests
pytest

# Run specific test
pytest tests/test_extract.py

# With coverage
pytest --cov=src

🐳 Deployment Options

Local Development

python api_server.py

Docker

docker build -t pdf-mapper .
docker run -p 8000:8000 --env-file .env pdf-mapper

AWS Lambda

See deployment/aws/ for Lambda deployment scripts.

Azure Functions

See deployment/azure/ for Azure deployment scripts.

GCP Cloud Functions

See deployment/gcp/ for GCP deployment scripts.

🔗 Integration with SDK

Once the API server is running, install the SDK:

cd ../../sdks/python
pip install -e .

# Use CLI
pdf-autofiller --api-url http://localhost:8000 extract input.pdf

# Or Python
from pdf_autofiller import PDFMapperClient
client = PDFMapperClient("http://localhost:8000")
result = client.extract("input.pdf", 1, 100)

📚 Documentation

SETUP_GUIDE.md - Configuration setup
API_SERVER.md - API documentation
INSTALLATION_GUIDE.md - Installation details
../../docs/ - Complete project documentation

🔍 Troubleshooting

Module 'boto3' not found

# In config.ini, set:
[general]
rag_api_url = 
# Leave empty to disable RAG

API key errors

# Make sure .env has your key:
OPENAI_API_KEY=sk-your-actual-key-here

Import errors

# Install dependencies:
pip install -r requirements.txt

Server won't start

# Install API dependencies:
pip install -r requirements-api.txt

⚙️ Environment Variables

Key environment variables (set in .env):

Variable	Required	Description
`CLOUD_PROVIDER`	✅	`local`, `aws`, `azure`, or `gcp`
`OPENAI_API_KEY`	✅	OpenAI API key
`ANTHROPIC_API_KEY`	🔷	Claude/Anthropic key (if using Claude)
`AWS_ACCESS_KEY_ID`	🔷	AWS credentials (if using AWS)
`AWS_SECRET_ACCESS_KEY`	🔷	AWS credentials (if using AWS)
`AZURE_STORAGE_CONNECTION_STRING`	🔷	Azure credentials (if using Azure)

📝 License

See LICENSE file in project root.

🤝 Contributing

See CONTRIBUTING.md for guidelines.

Quick Command Reference

# Setup
cp .env.example .env
cp config.ini.example config.ini
pip install -r requirements.txt requirements-api.txt

# Run server
python api_server.py

# Test
curl http://localhost:8000/health
pytest

# Install SDK (for client usage)
cd ../../sdks/python && pip install -e .

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.10

May 16, 2026

1.0.9

May 16, 2026

1.0.8

Apr 29, 2026

1.0.7

Apr 28, 2026

This version

1.0.6

Apr 3, 2026

1.0.5

Apr 3, 2026

1.0.4

Apr 3, 2026

1.0.3

Apr 3, 2026

1.0.2

Apr 3, 2026

1.0.1

Apr 2, 2026

1.0.0

Apr 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_autofillr_mapper-1.0.6.tar.gz (15.5 MB view details)

Uploaded Apr 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_autofillr_mapper-1.0.6-py3-none-any.whl (15.5 MB view details)

Uploaded Apr 3, 2026 Python 3

File details

Details for the file pdf_autofillr_mapper-1.0.6.tar.gz.

File metadata

Download URL: pdf_autofillr_mapper-1.0.6.tar.gz
Upload date: Apr 3, 2026
Size: 15.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdf_autofillr_mapper-1.0.6.tar.gz
Algorithm	Hash digest
SHA256	`12014c1d40005e67dd53c174dde314a90f790a6cb22781cde2d4f3f8aa08e056`
MD5	`2814e1c715c1f9e0e5393ee3307e4225`
BLAKE2b-256	`da24e7f5a1236539da4b4d55cfca37e77b22884ae4e27fea4483085568f4fd23`

See more details on using hashes here.

File details

Details for the file pdf_autofillr_mapper-1.0.6-py3-none-any.whl.

File metadata

Download URL: pdf_autofillr_mapper-1.0.6-py3-none-any.whl
Upload date: Apr 3, 2026
Size: 15.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdf_autofillr_mapper-1.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6d654bfa6870459110b4de1dea4614612f04de56298c644a634eb56f274bffe1`
MD5	`f819147c7fc6cc182797952743889089`
BLAKE2b-256	`6bceb9fe6a8cb5e21287a5bee079b700500d7609bdc050a867862b4edc605795`

See more details on using hashes here.

pdf-autofillr-mapper 1.0.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDF Mapper Module

🚀 Quick Start

1. Configure the Module

2. Install Dependencies

3. Run API Server

📁 File Structure

🎯 What This Module Does

Operations

🔧 Configuration Overview

Required Configuration

🌐 Running as API Server

📦 Using as Python Module

🧪 Testing

🐳 Deployment Options

Local Development

Docker

AWS Lambda

Azure Functions

GCP Cloud Functions

🔗 Integration with SDK

📚 Documentation

🔍 Troubleshooting

Module 'boto3' not found

API key errors

Import errors

Server won't start

⚙️ Environment Variables

📝 License

🤝 Contributing

Quick Command Reference

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes