Skip to main content

PDF field extraction, semantic mapping, embedding and filling engine

Project description

PDF Mapper Module

The core PDF field extraction, mapping, embedding, and filling engine.

๐Ÿš€ Quick Start

1. Configure the Module

IMPORTANT: You must configure before running!

# Copy configuration templates
cp .env.example .env
cp config.ini.example config.ini

# Edit .env - Add your API keys
nano .env

# Edit config.ini - Set your storage paths
nano config.ini

See SETUP_GUIDE.md for detailed configuration instructions.

2. Install Dependencies

# Core dependencies
pip install -r requirements.txt

# For API server
pip install -r requirements-api.txt

3. Run API Server

python api_server.py

Server will be available at: http://localhost:8000

Interactive docs at: http://localhost:8000/docs


๐Ÿ“ File Structure

modules/mapper/
โ”œโ”€โ”€ .env.example            โ† Copy to .env (add your API keys)
โ”œโ”€โ”€ config.ini.example      โ† Copy to config.ini (configure paths)
โ”œโ”€โ”€ config.ini              โ† Active configuration (DO NOT COMMIT)
โ”œโ”€โ”€ .env                    โ† Active environment (DO NOT COMMIT)
โ”œโ”€โ”€ SETUP_GUIDE.md          โ† Detailed setup instructions
โ”œโ”€โ”€ API_SERVER.md           โ† API server documentation
โ”œโ”€โ”€ api_server.py           โ† FastAPI server (run this!)
โ”œโ”€โ”€ requirements.txt        โ† Python dependencies
โ”œโ”€โ”€ requirements-api.txt    โ† API server dependencies
โ”œโ”€โ”€ setup.py                โ† Package setup
โ””โ”€โ”€ src/                    โ† Core source code
    โ”œโ”€โ”€ orchestrator.py     โ† Main orchestration logic
    โ”œโ”€โ”€ extractors/         โ† PDF field extraction
    โ”œโ”€โ”€ mappers/            โ† Field mapping (LLM)
    โ”œโ”€โ”€ embedders/          โ† Metadata embedding
    โ”œโ”€โ”€ fillers/            โ† PDF filling
    โ”œโ”€โ”€ chunkers/           โ† Document chunking
    โ”œโ”€โ”€ groupers/           โ† Field grouping
    โ”œโ”€โ”€ headers/            โ† Header detection
    โ”œโ”€โ”€ validators/         โ† Field validation
    โ””โ”€โ”€ core/               โ† Configuration & logging

๐ŸŽฏ What This Module Does

  1. Extract - Extracts form fields from PDF files
  2. Map - Maps extracted fields to your data schema using LLM
  3. Embed - Embeds mapping metadata into PDF for reuse
  4. Fill - Fills embedded PDFs with actual data

Operations

Operation Input Output Use Case
extract PDF file Field list JSON Discover what fields exist
map Extracted fields Mapping JSON Create field-to-schema mapping
embed PDF + mapping Embedded PDF Prepare PDF for filling
fill Embedded PDF + data Filled PDF Generate completed forms
make-embed PDF file Embedded PDF One-step: extract+map+embed
run-all PDF + data Filled PDF Complete pipeline

๐Ÿ”ง Configuration Overview

Required Configuration

In .env:

# Choose one
CLOUD_PROVIDER=local          # For local development
# CLOUD_PROVIDER=aws          # For AWS deployment
# CLOUD_PROVIDER=azure        # For Azure deployment

# Add your LLM API key
OPENAI_API_KEY=sk-your-key-here

In config.ini:

[general]
source_type = local

[mapping]
llm_model = gpt-4o
use_second_mapper = false

[local]
cache_registry_path = /path/to/cache/hash_registry.json
output_base_path = /path/to/output

See SETUP_GUIDE.md for complete details.


๐ŸŒ Running as API Server

# Start server
python api_server.py

# In another terminal, test it
curl http://localhost:8000/health

Available endpoints:

  • GET / - API info
  • GET /health - Health check
  • POST /mapper/extract - Extract fields
  • POST /mapper/map - Map fields
  • POST /mapper/embed - Embed metadata
  • POST /mapper/fill - Fill PDF
  • POST /mapper/make-embed - Extract+Map+Embed
  • POST /mapper/fill-pdf - Fill embedded PDF
  • POST /mapper/check-embed-file - Check if PDF has embeddings
  • POST /mapper/run-all - Complete pipeline

See API_SERVER.md for API documentation.


๐Ÿ“ฆ Using as Python Module

from src.orchestrator import run_extraction, run_mapping, run_embedding, run_filling

# Extract fields
extracted = run_extraction(pdf_path, user_id, pdf_doc_id)

# Map fields
mapped = run_mapping(user_id, pdf_doc_id)

# Embed metadata
embedded = run_embedding(pdf_path, user_id, pdf_doc_id)

# Fill PDF
filled = run_filling(embedded_pdf_path, user_id, pdf_doc_id, input_data)

๐Ÿงช Testing

# Run all tests
pytest

# Run specific test
pytest tests/test_extract.py

# With coverage
pytest --cov=src

๐Ÿณ Deployment Options

Local Development

python api_server.py

Docker

docker build -t pdf-mapper .
docker run -p 8000:8000 --env-file .env pdf-mapper

AWS Lambda

See deployment/aws/ for Lambda deployment scripts.

Azure Functions

See deployment/azure/ for Azure deployment scripts.

GCP Cloud Functions

See deployment/gcp/ for GCP deployment scripts.


๐Ÿ”— Integration with SDK

Once the API server is running, install the SDK:

cd ../../sdks/python
pip install -e .

# Use CLI
pdf-autofiller --api-url http://localhost:8000 extract input.pdf

# Or Python
from pdf_autofiller import PDFMapperClient
client = PDFMapperClient("http://localhost:8000")
result = client.extract("input.pdf", 1, 100)

๐Ÿ“š Documentation


๐Ÿ” Troubleshooting

Module 'boto3' not found

# In config.ini, set:
[general]
rag_api_url = 
# Leave empty to disable RAG

API key errors

# Make sure .env has your key:
OPENAI_API_KEY=sk-your-actual-key-here

Import errors

# Install dependencies:
pip install -r requirements.txt

Server won't start

# Install API dependencies:
pip install -r requirements-api.txt

โš™๏ธ Environment Variables

Key environment variables (set in .env):

Variable Required Description
CLOUD_PROVIDER โœ… local, aws, azure, or gcp
OPENAI_API_KEY โœ… OpenAI API key
ANTHROPIC_API_KEY ๐Ÿ”ท Claude/Anthropic key (if using Claude)
AWS_ACCESS_KEY_ID ๐Ÿ”ท AWS credentials (if using AWS)
AWS_SECRET_ACCESS_KEY ๐Ÿ”ท AWS credentials (if using AWS)
AZURE_STORAGE_CONNECTION_STRING ๐Ÿ”ท Azure credentials (if using Azure)

๐Ÿ“ License

See LICENSE file in project root.


๐Ÿค Contributing

See CONTRIBUTING.md for guidelines.


Quick Command Reference

# Setup
cp .env.example .env
cp config.ini.example config.ini
pip install -r requirements.txt requirements-api.txt

# Run server
python api_server.py

# Test
curl http://localhost:8000/health
pytest

# Install SDK (for client usage)
cd ../../sdks/python && pip install -e .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_autofillr_mapper-1.0.6.tar.gz (15.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_autofillr_mapper-1.0.6-py3-none-any.whl (15.5 MB view details)

Uploaded Python 3

File details

Details for the file pdf_autofillr_mapper-1.0.6.tar.gz.

File metadata

  • Download URL: pdf_autofillr_mapper-1.0.6.tar.gz
  • Upload date:
  • Size: 15.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdf_autofillr_mapper-1.0.6.tar.gz
Algorithm Hash digest
SHA256 12014c1d40005e67dd53c174dde314a90f790a6cb22781cde2d4f3f8aa08e056
MD5 2814e1c715c1f9e0e5393ee3307e4225
BLAKE2b-256 da24e7f5a1236539da4b4d55cfca37e77b22884ae4e27fea4483085568f4fd23

See more details on using hashes here.

File details

Details for the file pdf_autofillr_mapper-1.0.6-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_autofillr_mapper-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 6d654bfa6870459110b4de1dea4614612f04de56298c644a634eb56f274bffe1
MD5 f819147c7fc6cc182797952743889089
BLAKE2b-256 6bceb9fe6a8cb5e21287a5bee079b700500d7609bdc050a867862b4edc605795

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page