Skip to main content

PDF field extraction, semantic mapping, embedding and filling engine

Project description

PDF Mapper Module

The core PDF field extraction, mapping, embedding, and filling engine.

๐Ÿš€ Quick Start

1. Configure the Module

IMPORTANT: You must configure before running!

# Copy configuration templates
cp .env.example .env
cp config.ini.example config.ini

# Edit .env - Add your API keys
nano .env

# Edit config.ini - Set your storage paths
nano config.ini

See SETUP_GUIDE.md for detailed configuration instructions.

2. Install Dependencies

# Core dependencies
pip install -r requirements.txt

# For API server
pip install -r requirements-api.txt

3. Run API Server

python api_server.py

Server will be available at: http://localhost:8000

Interactive docs at: http://localhost:8000/docs


๐Ÿ“ File Structure

modules/mapper/
โ”œโ”€โ”€ .env.example            โ† Copy to .env (add your API keys)
โ”œโ”€โ”€ config.ini.example      โ† Copy to config.ini (configure paths)
โ”œโ”€โ”€ config.ini              โ† Active configuration (DO NOT COMMIT)
โ”œโ”€โ”€ .env                    โ† Active environment (DO NOT COMMIT)
โ”œโ”€โ”€ SETUP_GUIDE.md          โ† Detailed setup instructions
โ”œโ”€โ”€ API_SERVER.md           โ† API server documentation
โ”œโ”€โ”€ api_server.py           โ† FastAPI server (run this!)
โ”œโ”€โ”€ requirements.txt        โ† Python dependencies
โ”œโ”€โ”€ requirements-api.txt    โ† API server dependencies
โ”œโ”€โ”€ setup.py                โ† Package setup
โ””โ”€โ”€ src/                    โ† Core source code
    โ”œโ”€โ”€ orchestrator.py     โ† Main orchestration logic
    โ”œโ”€โ”€ extractors/         โ† PDF field extraction
    โ”œโ”€โ”€ mappers/            โ† Field mapping (LLM)
    โ”œโ”€โ”€ embedders/          โ† Metadata embedding
    โ”œโ”€โ”€ fillers/            โ† PDF filling
    โ”œโ”€โ”€ chunkers/           โ† Document chunking
    โ”œโ”€โ”€ groupers/           โ† Field grouping
    โ”œโ”€โ”€ headers/            โ† Header detection
    โ”œโ”€โ”€ validators/         โ† Field validation
    โ””โ”€โ”€ core/               โ† Configuration & logging

๐ŸŽฏ What This Module Does

  1. Extract - Extracts form fields from PDF files
  2. Map - Maps extracted fields to your data schema using LLM
  3. Embed - Embeds mapping metadata into PDF for reuse
  4. Fill - Fills embedded PDFs with actual data

Operations

Operation Input Output Use Case
extract PDF file Field list JSON Discover what fields exist
map Extracted fields Mapping JSON Create field-to-schema mapping
embed PDF + mapping Embedded PDF Prepare PDF for filling
fill Embedded PDF + data Filled PDF Generate completed forms
make-embed PDF file Embedded PDF One-step: extract+map+embed
run-all PDF + data Filled PDF Complete pipeline

๐Ÿ”ง Configuration Overview

Required Configuration

In .env:

# Choose one
CLOUD_PROVIDER=local          # For local development
# CLOUD_PROVIDER=aws          # For AWS deployment
# CLOUD_PROVIDER=azure        # For Azure deployment

# Add your LLM API key
OPENAI_API_KEY=sk-your-key-here

In config.ini:

[general]
source_type = local

[mapping]
llm_model = gpt-4o
use_second_mapper = false

[local]
cache_registry_path = /path/to/cache/hash_registry.json
output_base_path = /path/to/output

See SETUP_GUIDE.md for complete details.


๐ŸŒ Running as API Server

# Start server
python api_server.py

# In another terminal, test it
curl http://localhost:8000/health

Available endpoints:

  • GET / - API info
  • GET /health - Health check
  • POST /mapper/extract - Extract fields
  • POST /mapper/map - Map fields
  • POST /mapper/embed - Embed metadata
  • POST /mapper/fill - Fill PDF
  • POST /mapper/make-embed - Extract+Map+Embed
  • POST /mapper/fill-pdf - Fill embedded PDF
  • POST /mapper/check-embed-file - Check if PDF has embeddings
  • POST /mapper/run-all - Complete pipeline

See API_SERVER.md for API documentation.


๐Ÿ“ฆ Using as Python Module

from src.orchestrator import run_extraction, run_mapping, run_embedding, run_filling

# Extract fields
extracted = run_extraction(pdf_path, user_id, pdf_doc_id)

# Map fields
mapped = run_mapping(user_id, pdf_doc_id)

# Embed metadata
embedded = run_embedding(pdf_path, user_id, pdf_doc_id)

# Fill PDF
filled = run_filling(embedded_pdf_path, user_id, pdf_doc_id, input_data)

๐Ÿงช Testing

# Run all tests
pytest

# Run specific test
pytest tests/test_extract.py

# With coverage
pytest --cov=src

๐Ÿณ Deployment Options

Local Development

python api_server.py

Docker

docker build -t pdf-mapper .
docker run -p 8000:8000 --env-file .env pdf-mapper

AWS Lambda

See deployment/aws/ for Lambda deployment scripts.

Azure Functions

See deployment/azure/ for Azure deployment scripts.

GCP Cloud Functions

See deployment/gcp/ for GCP deployment scripts.


๐Ÿ”— Integration with SDK

Once the API server is running, install the SDK:

cd ../../sdks/python
pip install -e .

# Use CLI
pdf-autofiller --api-url http://localhost:8000 extract input.pdf

# Or Python
from pdf_autofiller import PDFMapperClient
client = PDFMapperClient("http://localhost:8000")
result = client.extract("input.pdf", 1, 100)

๐Ÿ“š Documentation


๐Ÿ” Troubleshooting

Module 'boto3' not found

# In config.ini, set:
[general]
rag_api_url = 
# Leave empty to disable RAG

API key errors

# Make sure .env has your key:
OPENAI_API_KEY=sk-your-actual-key-here

Import errors

# Install dependencies:
pip install -r requirements.txt

Server won't start

# Install API dependencies:
pip install -r requirements-api.txt

โš™๏ธ Environment Variables

Key environment variables (set in .env):

Variable Required Description
CLOUD_PROVIDER โœ… local, aws, azure, or gcp
OPENAI_API_KEY โœ… OpenAI API key
ANTHROPIC_API_KEY ๐Ÿ”ท Claude/Anthropic key (if using Claude)
AWS_ACCESS_KEY_ID ๐Ÿ”ท AWS credentials (if using AWS)
AWS_SECRET_ACCESS_KEY ๐Ÿ”ท AWS credentials (if using AWS)
AZURE_STORAGE_CONNECTION_STRING ๐Ÿ”ท Azure credentials (if using Azure)

๐Ÿ“ License

See LICENSE file in project root.


๐Ÿค Contributing

See CONTRIBUTING.md for guidelines.


Quick Command Reference

# Setup
cp .env.example .env
cp config.ini.example config.ini
pip install -r requirements.txt requirements-api.txt

# Run server
python api_server.py

# Test
curl http://localhost:8000/health
pytest

# Install SDK (for client usage)
cd ../../sdks/python && pip install -e .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_autofillr_mapper-1.0.1.tar.gz (15.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_autofillr_mapper-1.0.1-py3-none-any.whl (15.5 MB view details)

Uploaded Python 3

File details

Details for the file pdf_autofillr_mapper-1.0.1.tar.gz.

File metadata

  • Download URL: pdf_autofillr_mapper-1.0.1.tar.gz
  • Upload date:
  • Size: 15.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdf_autofillr_mapper-1.0.1.tar.gz
Algorithm Hash digest
SHA256 dc5921ac17dfd7f44729af9beb13e182ccc3133918200cca291b791729953e9e
MD5 09bb320b41809d9cf3f1a4a6a9d955c0
BLAKE2b-256 50f9c4aefa5eaa86996d9fc6540ba056aa851845e39839f00526c81a62a802d6

See more details on using hashes here.

File details

Details for the file pdf_autofillr_mapper-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_autofillr_mapper-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 896240404a657b7bb9d31b64daa97a0e102cbf3a19abc27948b0c3b8ef4e2a63
MD5 f1da33a45fa6613a6da1b1413c002a0b
BLAKE2b-256 b03f45bb89b1edf4a5f64826b56eb7126806feff618b2eafca6beebf5278e7ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page