Skip to main content

Document extraction and reconciliation for genealogical sources.

Project description

askemblaex

Document extraction, reconciliation, entity extraction, and embedding for genealogical sources.

askemblaex processes PDF and image-based genealogical documents through a multi-stage pipeline:

  1. Extract — run multiple OCR providers per page
  2. Reconcile — merge OCR outputs into a single best-quality transcription via OpenAI
  3. Entities — extract structured persons, events, and claims from each page via windowed AI analysis
  4. Embed — generate vector embeddings from reconciled text or person descriptions

Each stage is independently runnable and fully incremental — nothing is overwritten unless explicitly forced.


Features

  • Multi-method OCR extraction — Azure Computer Vision, Azure Document Intelligence, PyMuPDF, and pdfplumber run per page
  • AI reconciliation — OpenAI merges all OCR outputs into the most accurate transcription, preferring Azure sources
  • Windowed entity extraction — structured persons, events, claims, and places extracted from each page with surrounding page context
  • Per-person embeddings — each person mention gets an embedding-ready text string and optional vector
  • Structured output — every document gets a hash-keyed folder; every page gets its own JSON file
  • Incremental processing — partial runs pick up where they left off; model-change detection triggers re-reconciliation automatically
  • Interactive preflight — missing credentials trigger a prompt to enter them, skip the service, or quit

Installation

pip install askemblaex

System dependencies

pdf2image requires Poppler:

# macOS
brew install poppler

# Ubuntu / Debian
sudo apt-get install poppler-utils

# Windows — download from https://github.com/oschwartz10612/poppler-windows

Configuration

Create a .env file in your working directory (or set environment variables directly):

# Azure Computer Vision (optional — enables azure_computer_vision method)
AZURE_VISION_ENDPOINT=https://<resource>.cognitiveservices.azure.com/
AZURE_VISION_KEY=<your-key>

# Azure Document Intelligence (optional — enables azure_docint method)
AZURE_DOCINT_ENDPOINT=https://<resource>.cognitiveservices.azure.com/
AZURE_DOCINT_KEY=<your-key>

# OpenAI (required for --reconcile and --entities)
OPENAI_KEY=<your-key>
OPENAI_MODEL=gpt-4o
OPENAI_BASE_URL=https://api.openai.com/v1   # override for proxies / compatible APIs

# OpenAI entity extraction model (optional — falls back to OPENAI_MODEL then gpt-4o)
OPENAI_ENTITY_MODEL=gpt-4o

# Ollama embeddings (optional — takes priority over OpenAI for --embed)
OLLAMA_ENDPOINT=http://localhost:11434
OLLAMA_EMODEL=nomic-embed-text
OLLAMA_EDIM=768       # optional dimension validation

# OpenAI embeddings (used if Ollama is not configured)
OPENAI_EMODEL=text-embedding-3-small
OPENAI_EDIM=1536      # optional dimension validation

# Override recognised image extensions (default: .png,.jpg,.jpeg,.tiff,.tif)
ASKEMBLAEX_IMG_EXTS=.png,.jpg,.tif

CLI usage

askemblaex [--source DIR] [--output DIR]
           [--extract] [--reconcile] [--entities] [--embed]
           [--methods METHOD ...] [--skip-methods METHOD ...]
           [--force-extract] [--force-reconcile] [--force-entities] [--force-embed]
           [--dpi DPI] [--recursive] [-v|-vv|-vvv]
           [--list-methods]

Flags reference

Directories

Flag Description
--source DIR, -s DIR Source directory containing PDFs or images
--output DIR, -o DIR Output directory. Defaults to --source

Pipeline stages

Flag Description Requires
--extract Run multi-method OCR extraction --source
--reconcile Merge OCR outputs into best-quality text via OpenAI Extracted pages, OPENAI_KEY
--entities Extract structured persons/events/claims per page via OpenAI Reconciled pages, OPENAI_KEY
--embed Generate vector embeddings from reconciled page text Reconciled pages, Ollama or OpenAI embeddings
--list-methods Print available extraction methods and exit

Method control (extraction only)

Flag Description
--methods METHOD [...] Whitelist specific extraction methods (default: all)
--skip-methods METHOD [...] Exclude specific extraction methods

Force flags

Flag Description
--force-extract Re-extract even if already marked complete
--force-reconcile Re-reconcile even if model matches
--force-entities Re-extract entities even if already done with same model
--force-embed Re-embed even if already embedded with same model

Other options

Flag Default Description
--dpi INT 300 DPI for PDF page rendering
--recursive, -r off Search source directory recursively
-v / -vv / -vvv Verbosity: INFO / DEBUG / DEBUG + tracebacks

Common workflows

# List available extraction methods
askemblaex --list-methods

# Extract all PDFs in a folder (output defaults to source)
askemblaex --source /path/to/pdfs --extract

# Extract to a separate output folder
askemblaex --source /path/to/pdfs --output /path/to/output --extract

# Full pipeline in one command
askemblaex --source /path/to/pdfs --output /path/to/output \
    --extract --reconcile --entities --embed

# Reconcile previously extracted pages
askemblaex --output /path/to/output --reconcile

# Entity extraction only (requires reconciled text)
askemblaex --output /path/to/output --entities

# Generate embeddings only (requires reconciled text)
askemblaex --output /path/to/output --embed

# Only run specific extraction methods
askemblaex --source /path/to/pdfs --extract \
    --methods azure_computer_vision pymupdf

# Skip a method
askemblaex --source /path/to/pdfs --extract --skip-methods pdfplumber

# Custom DPI for PDF rendering (default: 300)
askemblaex --source /path/to/pdfs --extract --dpi 400

# Force re-extraction (will not touch reconciled or entity data)
askemblaex --source /path/to/pdfs --output /path/to/output \
    --extract --force-extract

# Force re-reconciliation only
askemblaex --output /path/to/output --reconcile --force-reconcile

# Force re-entity-extraction only
askemblaex --output /path/to/output --entities --force-entities

# Recursive source tree + verbose output
askemblaex --source /path/to/pdfs --output /path/to/output \
    --extract --reconcile --recursive -vv

Verbosity levels

Flag Level Output
(none) WARNING Errors and service failures only
-v INFO Per-file progress
-vv DEBUG Per-page detail
-vvv DEBUG Full tracebacks on errors

Output structure

Each processed document produces a hash-keyed folder under the output root:

output/
  2e7698fc.../
    2e7698fc....metadata._.json       ← document metadata + processing state
    2e7698fc....log                   ← per-document log
    2e7698fc....pymupdf.image.0.0.png ← embedded images extracted by PyMuPDF
    2e7698fc....pdfplumber.table.0.0.csv
    pages/
      2e7698fc....page.0000.json      ← all extraction data for page 0
      2e7698fc....page.0001.json
      ...
      2e7698fc....person.0000.0000.json  ← person mention, page 0, person 0
      2e7698fc....person.0000.0001.json  ← person mention, page 0, person 1
      ...
    images/
      page.0000.png                   ← rendered page images (for Azure CV)
      page.0000.pdf                   ← single-page PDFs (for Azure DocInt)
      ...
    logs/
      askemblaex.log

Metadata file (<hash>.metadata._.json)

{
  "_key": "2e7698fc...",
  "source": {
    "filename": "The Freewill Baptist Register.pdf",
    "type": ".pdf",
    "title": "The Freewill Baptist Register",
    "created_utc": "2026-02-24T10:00:00Z",
    "local": true,
    "uris": []
  },
  "raw": {
    "page_count": 42
  },
  "extraction": {
    "complete": true,
    "started_utc": "2026-02-24T10:01:00Z",
    "completed_utc": "2026-02-24T10:05:00Z",
    "steps": {
      "azure_computer_vision": true,
      "azure_docint": true,
      "pymupdf": true,
      "pdfplumber": true,
      "reconciled": true,
      "entities": true,
      "embeddings": true
    }
  }
}

Page file (<hash>.page.<NNNN>.json)

{
  "schema_version": 1.0,
  "doc_id": "2e7698fc...",
  "page_num": 0,
  "default": "reconciled",
  "created_at": "...",
  "updated_at": "...",
  "extractions": {
    "azure_computer_vision": {
      "text": "...",
      "text_hash": "...",
      "method": "azure_computer_vision",
      "extracted_at": "..."
    },
    "azure_docint":  { "text": "...", "..." : "..." },
    "pymupdf":       { "text": "...", "..." : "..." },
    "pdfplumber":    { "text": "...", "..." : "..." },
    "reconciled": {
      "text": "...",
      "model": "gpt-4o",
      "provider": "openai",
      "source_methods": ["azure_computer_vision", "azure_docint", "pymupdf", "pdfplumber"],
      "extracted_at": "..."
    },
    "entities": {
      "raw": { "persons": [...], "events": [...], "claims": [...], "places": [...] },
      "model": "gpt-4o",
      "extracted_at": "...",
      "person_count": 3,
      "window_pages": [0, 1, 2],
      "person_files": ["2e7698fc....person.0000.0000.json", "..."]
    },
    "embedding": {
      "values": [0.012, -0.034, "..."],
      "model": "nomic-embed-text",
      "provider": "ollama",
      "dim": 768,
      "created_at": "..."
    }
  }
}

Person mention file (<hash>.person.<page>.<idx>.json)

One file is written per person mention found on a page.

{
  "_key": "person_john_smith_2e7698fc_p0_n0",
  "type": { "entity": "person", "source": "mention" },
  "process_version": 2,
  "schema_version": 4,
  "method": "windowed_default",
  "datetime": "...",
  "person_name": "John Smith",
  "source_document": "2e7698fc...",
  "page_num": 0,
  "person_num": 0,
  "page_text": "...",
  "window_text": "=== PAGE 0 (TARGET) ===\n...",
  "embedding_text": "TYPE: Person\nNAME: John Smith\nVITALS: born 1842 County Cork...",
  "embedding_model": "nomic-embed-text",
  "embedding": [0.012, -0.034, "..."],
  "structure": { "persons": [...], "events": [...] }
}

Available extraction methods

Method Description Credentials required
azure_computer_vision Azure Computer Vision Read OCR AZURE_VISION_ENDPOINT, AZURE_VISION_KEY
azure_docint Azure Document Intelligence layout + key-value pairs AZURE_DOCINT_ENDPOINT, AZURE_DOCINT_KEY
pymupdf PyMuPDF embedded text layer None
pdfplumber pdfplumber embedded text layer None

Module reference

askemblaex.main

CLI entrypoint and pipeline orchestration.

Symbol Description
main() Parse args, run preflight, execute pipeline stages
build_parser() Return configured argparse.ArgumentParser
run_extraction(source, output, *, ...) Run multi-method OCR extraction
run_reconciliation(output, *, ...) Run OpenAI reconciliation
run_entities(output, *, ...) Run windowed entity extraction
run_embedding(output, *, ...) Generate page-level embeddings

askemblaex.extract

File discovery and per-page text extraction.

Symbol Description
discover_files(root, *, recursive) Find all PDFs and images under a directory
process_one(src_path, output_root, logger, *, ...) Extract a single source file into a hash-keyed output folder
process_all(source_root, output_root, logger) Extract all files found under a directory
extract_page_text(path, logger, *, ...) Low-level per-page extraction returning List[PageExtraction]
PageExtraction Dataclass holding page_index and per-method text dict

askemblaex.reconcile

OpenAI-powered multi-source reconciliation.

Symbol Description
reconcile_folder(folder, *, client, model, ...) Reconcile all pages in a hash-keyed document folder
reconcile_page_file(page_file, doc_id, page_num, parent_folder, *, ...) Load, reconcile, and write back a single page file
reconcile_page(page_data, *, client, model) Reconcile one page's extraction data dict; returns text or None

askemblaex.window

Dynamic context-window builder for entity extraction.

Symbol Description
build_dynamic_extraction_window(pages, anchor_page, *, context_pages, max_chars) Build a ExtractionWindow centred on anchor_page with surrounding context pages
ExtractionWindow Dataclass with anchor_page, pages_included, text, char_count

How windowing works: the anchor page is always included in full. Up to context_pages (default 2) pages before and after are added alternately until the max_chars (default 30 000) budget is exhausted. The anchor page is labelled === PAGE N (TARGET) === in the text so the model knows which page to extract from.


askemblaex.entities

Structured entity extraction and person embedding generation.

Symbol Description
entity_extract_folder(folder, *, client, model, embed_provider, ...) Extract entities for all pages in a document folder
entity_extract_page_file(page_file, parent_folder, page_num, all_page_texts, *, ...) Extract entities for one page, write person files and update page JSON
call_entity_extraction(window_text, *, client, model) Call OpenAI and return parsed entity JSON dict
build_person_embeddings(extraction_json) Build one embedding-ready text string per person from entity JSON

Entity schema (returned by call_entity_extraction):

{
  "persons": [
    {
      "name": "John Smith",
      "alt_names": ["J. Smith"],
      "summary": "Head of household in the 1881 census.",
      "roles": ["head of household"],
      "birth": { "date_text": "1842", "place": "County Cork, Ireland" },
      "death": { "date_text": "1910", "place": "Melbourne, Victoria" },
      "residences": ["Melbourne"],
      "relationships": {
        "parents": ["Thomas Smith", "Mary O'Brien"],
        "spouse": ["Catherine Murphy"],
        "children": ["William Smith"],
        "siblings": []
      },
      "attributes": ["farmer", "Catholic"],
      "events": ["Arrived Melbourne 1865"],
      "evidence_phrases": ["head of household", "born County Cork"]
    }
  ],
  "events": [ { "type": "marriage", "date_text": "1865", "people": ["John Smith", "Catherine Murphy"], "details": [] } ],
  "claims": [],
  "places": ["Melbourne", "County Cork"],
  "notes": []
}

askemblaex.embed

Vector embedding generation for reconciled page text.

Symbol Description
embed_folder(folder, *, provider, force, verbosity) Embed all reconciled pages in a document folder
embed_page_file(page_file, doc_id, page_num, parent_folder, *, provider, ...) Embed a single page file's reconciled text
generate_embedding(text, provider) Generate an embedding vector via Ollama or OpenAI
detect_provider() Return "ollama", "openai", or None based on environment

askemblaex.preflight

Service credential checks and interactive recovery.

Symbol Description
run_preflight(requested_methods, needs_reconcile, needs_entities, needs_embed, *, verbose) Run all required preflight checks; returns PreflightResult
PreflightResult Dataclass with active_methods, openai_available, openai_client, openai_model, embed_provider, services
ServiceStatus Dataclass with name, available, reason, env_vars

askemblaex.pages

Per-page JSON file I/O utilities.

Symbol Description
save_or_merge_page(parent_folder, doc_id, page, data) Create or update a page JSON file, merging extraction data
load_page(parent_folder, doc_id, page) Load a page JSON file; returns None if missing
build_page_schema(doc_id, page) Build a fresh page dict with empty extraction slots
get_page_number(filepath) Parse zero-based page number from a page filename
page_file_path(out_dir, doc_id, page) Return canonical path for a page JSON file

askemblaex.metadata

Document metadata building, reading, and writing.

Symbol Description
build_metadata(src_path, *, file_hash) Build a fresh metadata dict from a source file
write_metadata(out_dir, file_hash, metadata) Write metadata dict to <out_dir>/<hash>.metadata._.json
load_metadata(out_dir, file_hash) Load metadata dict; returns None if missing
merge_metadata(existing, new, *, overwrite) Shallow-merge two metadata dicts

askemblaex.hash

Content-based file hashing.

Symbol Description
hash_file(path, algo) Hash a file by content using chunked reads; returns hex digest

Python API examples

Extract a single file

from askemblaex.extract import process_one
from pathlib import Path
import logging

logger = logging.getLogger("myapp")

process_one(
    Path("registers/freewill_baptist.pdf"),
    Path("output/"),
    logger,
    active_methods={"azure_computer_vision", "pymupdf"},
    pdf_dpi=300,
)

Reconcile a document folder

from askemblaex.reconcile import reconcile_folder
from openai import OpenAI
from pathlib import Path

client = OpenAI(api_key="...")
reconcile_folder(
    Path("output/2e7698fc.../"),
    client=client,
    model="gpt-4o",
)

Extract entities from a reconciled folder

from askemblaex.entities import entity_extract_folder
from openai import OpenAI
from pathlib import Path

client = OpenAI(api_key="...")
extracted, skipped = entity_extract_folder(
    Path("output/2e7698fc.../"),
    client=client,
    model="gpt-4o",
    embed_provider="ollama",   # or "openai", or None to skip embeddings
)
print(f"{extracted} pages extracted, {skipped} skipped")

Build a context window manually

from askemblaex.window import build_dynamic_extraction_window

pages = {
    0: "Page zero text...",
    1: "Page one text...",
    2: "Page two text...",
    3: "Page three text...",
}

window = build_dynamic_extraction_window(pages, anchor_page=2, context_pages=1)
print(f"Window covers pages {window.pages_included}, {window.char_count} chars")
# Window covers pages [1, 2, 3], 123 chars

Build person embedding text

from askemblaex.entities import build_person_embeddings
import json

structured = {
    "persons": [{
        "name": "John Smith",
        "summary": "Head of household.",
        "birth": {"date_text": "1842", "place": "County Cork"},
        "death": {"date_text": "1910", "place": "Melbourne"},
        "relationships": {"spouse": ["Catherine Murphy"], "children": ["William Smith"]},
        "attributes": ["farmer"],
        "events": ["Arrived Melbourne 1865"],
        "evidence_phrases": ["head of household"],
    }],
    "events": [],
    "claims": [],
}

texts = build_person_embeddings(json.dumps(structured))
print(texts[0])
# TYPE: Person
# NAME: John Smith
# ALT_NAMES: none
# SUMMARY: Head of household.
# VITALS: born 1842 County Cork; died 1910 Melbourne
# ATTRIBUTES: farmer
# RELATIONSHIPS:
# - spouse: Catherine Murphy
# - children: William Smith
# EVENTS:
# - Arrived Melbourne 1865
# EVIDENCE_ANCHORS:
# - head of household
# YEAR_HINT: 1842–1910

Generate an embedding

from askemblaex.embed import generate_embedding

vector = generate_embedding("John Smith, farmer, born 1842 County Cork", provider="ollama")
print(len(vector))  # e.g. 768

Publishing to PyPI

pip install hatch

# Build
hatch build

# Upload to TestPyPI first
hatch publish --repo test

# Upload to PyPI
hatch publish

Development

git clone https://github.com/askembla/askemblaex
cd askemblaex
pip install -e ".[dev]"
pytest

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

askemblaex-0.5.1.tar.gz (316.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

askemblaex-0.5.1-py3-none-any.whl (60.0 kB view details)

Uploaded Python 3

File details

Details for the file askemblaex-0.5.1.tar.gz.

File metadata

  • Download URL: askemblaex-0.5.1.tar.gz
  • Upload date:
  • Size: 316.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.5 cpython/3.13.1 HTTPX/0.28.1

File hashes

Hashes for askemblaex-0.5.1.tar.gz
Algorithm Hash digest
SHA256 4b3e96152ab06852d6ac45f0273795d2963fef2660468858fabec71944ba31af
MD5 7ebf8cf02d82e6cbbcf4e54bc0c893c0
BLAKE2b-256 52c4690340e68e8135a38b3aa34bfd0f85367bfd58bd863ce0b84e6068dc8b32

See more details on using hashes here.

File details

Details for the file askemblaex-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: askemblaex-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 60.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.5 cpython/3.13.1 HTTPX/0.28.1

File hashes

Hashes for askemblaex-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 64b7fe12b18484a74d3d934491c65204879fce62a55ba2b28af7157074b6db6a
MD5 f72c031864e1421b8597c8076722fa47
BLAKE2b-256 c5fd97b70541b26ee984236b37bade4652ef7337f78bd02064bc8439a4814c42

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page