Document extraction and reconciliation for genealogical sources.

These details have not been verified by PyPI

Project links

Project description

askemblaex

Document extraction, reconciliation, entity extraction, and embedding for genealogical sources.

askemblaex processes PDF and image-based genealogical documents through a multi-stage pipeline:

Extract — run multiple OCR providers per page
Reconcile — merge OCR outputs into a single best-quality transcription via OpenAI
Entities — extract structured persons, events, and claims from each page via windowed AI analysis
Embed — generate vector embeddings from reconciled text or person descriptions

Each stage is independently runnable and fully incremental — nothing is overwritten unless explicitly forced.

Features

Multi-method OCR extraction — Azure Computer Vision, Azure Document Intelligence, PyMuPDF, and pdfplumber run per page
AI reconciliation — OpenAI merges all OCR outputs into the most accurate transcription, preferring Azure sources
Windowed entity extraction — structured persons, events, claims, and places extracted from each page with surrounding page context
Per-person embeddings — each person mention gets an embedding-ready text string and optional vector
Structured output — every document gets a hash-keyed folder; every page gets its own JSON file
Incremental processing — partial runs pick up where they left off; model-change detection triggers re-reconciliation automatically
Interactive preflight — missing credentials trigger a prompt to enter them, skip the service, or quit

Installation

pip install askemblaex

System dependencies

pdf2image requires Poppler:

# macOS
brew install poppler

# Ubuntu / Debian
sudo apt-get install poppler-utils

# Windows — download from https://github.com/oschwartz10612/poppler-windows

Configuration

Create a .env file in your working directory (or set environment variables directly):

# Azure Computer Vision (optional — enables azure_computer_vision method)
AZURE_VISION_ENDPOINT=https://<resource>.cognitiveservices.azure.com/
AZURE_VISION_KEY=<your-key>

# Azure Document Intelligence (optional — enables azure_docint method)
AZURE_DOCINT_ENDPOINT=https://<resource>.cognitiveservices.azure.com/
AZURE_DOCINT_KEY=<your-key>

# OpenAI (required for --reconcile and --entities)
OPENAI_KEY=<your-key>
OPENAI_MODEL=gpt-4o
OPENAI_BASE_URL=https://api.openai.com/v1   # override for proxies / compatible APIs

# OpenAI entity extraction model (optional — falls back to OPENAI_MODEL then gpt-4o)
OPENAI_ENTITY_MODEL=gpt-4o

# Ollama embeddings (optional — takes priority over OpenAI for --embed)
OLLAMA_ENDPOINT=http://localhost:11434
OLLAMA_EMODEL=nomic-embed-text
OLLAMA_EDIM=768       # optional dimension validation

# OpenAI embeddings (used if Ollama is not configured)
OPENAI_EMODEL=text-embedding-3-small
OPENAI_EDIM=1536      # optional dimension validation

# Override recognised image extensions (default: .png,.jpg,.jpeg,.tiff,.tif)
ASKEMBLAEX_IMG_EXTS=.png,.jpg,.tif

CLI usage

askemblaex [--source DIR] [--output DIR]
           [--extract] [--reconcile] [--entities] [--embed]
           [--methods METHOD ...] [--skip-methods METHOD ...]
           [--force-extract] [--force-reconcile] [--force-entities] [--force-embed]
           [--dpi DPI] [--recursive] [-v|-vv|-vvv]
           [--list-methods]

Flags reference

Directories

Flag	Description
`--source DIR`, `-s DIR`	Source directory containing PDFs or images
`--output DIR`, `-o DIR`	Output directory. Defaults to `--source`

Pipeline stages

Flag	Description	Requires
`--extract`	Run multi-method OCR extraction	`--source`
`--reconcile`	Merge OCR outputs into best-quality text via OpenAI	Extracted pages, `OPENAI_KEY`
`--entities`	Extract structured persons/events/claims per page via OpenAI	Reconciled pages, `OPENAI_KEY`
`--embed`	Generate vector embeddings from reconciled page text	Reconciled pages, Ollama or OpenAI embeddings
`--list-methods`	Print available extraction methods and exit	—

Method control (extraction only)

Flag	Description
`--methods METHOD [...]`	Whitelist specific extraction methods (default: all)
`--skip-methods METHOD [...]`	Exclude specific extraction methods

Force flags

Flag	Description
`--force-extract`	Re-extract even if already marked complete
`--force-reconcile`	Re-reconcile even if model matches
`--force-entities`	Re-extract entities even if already done with same model
`--force-embed`	Re-embed even if already embedded with same model

Other options

Flag	Default	Description
`--dpi INT`	`300`	DPI for PDF page rendering
`--recursive`, `-r`	off	Search source directory recursively
`-v` / `-vv` / `-vvv`	—	Verbosity: INFO / DEBUG / DEBUG + tracebacks

Common workflows

# List available extraction methods
askemblaex --list-methods

# Extract all PDFs in a folder (output defaults to source)
askemblaex --source /path/to/pdfs --extract

# Extract to a separate output folder
askemblaex --source /path/to/pdfs --output /path/to/output --extract

# Full pipeline in one command
askemblaex --source /path/to/pdfs --output /path/to/output \
    --extract --reconcile --entities --embed

# Reconcile previously extracted pages
askemblaex --output /path/to/output --reconcile

# Entity extraction only (requires reconciled text)
askemblaex --output /path/to/output --entities

# Generate embeddings only (requires reconciled text)
askemblaex --output /path/to/output --embed

# Only run specific extraction methods
askemblaex --source /path/to/pdfs --extract \
    --methods azure_computer_vision pymupdf

# Skip a method
askemblaex --source /path/to/pdfs --extract --skip-methods pdfplumber

# Custom DPI for PDF rendering (default: 300)
askemblaex --source /path/to/pdfs --extract --dpi 400

# Force re-extraction (will not touch reconciled or entity data)
askemblaex --source /path/to/pdfs --output /path/to/output \
    --extract --force-extract

# Force re-reconciliation only
askemblaex --output /path/to/output --reconcile --force-reconcile

# Force re-entity-extraction only
askemblaex --output /path/to/output --entities --force-entities

# Recursive source tree + verbose output
askemblaex --source /path/to/pdfs --output /path/to/output \
    --extract --reconcile --recursive -vv

Verbosity levels

Flag	Level	Output
(none)	WARNING	Errors and service failures only
`-v`	INFO	Per-file progress
`-vv`	DEBUG	Per-page detail
`-vvv`	DEBUG	Full tracebacks on errors

Output structure

Each processed document produces a hash-keyed folder under the output root:

output/
  2e7698fc.../
    2e7698fc....metadata._.json       ← document metadata + processing state
    2e7698fc....log                   ← per-document log
    2e7698fc....pymupdf.image.0.0.png ← embedded images extracted by PyMuPDF
    2e7698fc....pdfplumber.table.0.0.csv
    pages/
      2e7698fc....page.0000.json      ← all extraction data for page 0
      2e7698fc....page.0001.json
      ...
      2e7698fc....person.0000.0000.json  ← person mention, page 0, person 0
      2e7698fc....person.0000.0001.json  ← person mention, page 0, person 1
      ...
    images/
      page.0000.png                   ← rendered page images (for Azure CV)
      page.0000.pdf                   ← single-page PDFs (for Azure DocInt)
      ...
    logs/
      askemblaex.log

Metadata file (`<hash>.metadata._.json`)

{
  "_key": "2e7698fc...",
  "source": {
    "filename": "The Freewill Baptist Register.pdf",
    "type": ".pdf",
    "title": "The Freewill Baptist Register",
    "created_utc": "2026-02-24T10:00:00Z",
    "local": true,
    "uris": []
  },
  "raw": {
    "page_count": 42
  },
  "extraction": {
    "complete": true,
    "started_utc": "2026-02-24T10:01:00Z",
    "completed_utc": "2026-02-24T10:05:00Z",
    "steps": {
      "azure_computer_vision": true,
      "azure_docint": true,
      "pymupdf": true,
      "pdfplumber": true,
      "reconciled": true,
      "entities": true,
      "embeddings": true
    }
  }
}

Page file (`<hash>.page.<NNNN>.json`)

{
  "schema_version": 1.0,
  "doc_id": "2e7698fc...",
  "page_num": 0,
  "default": "reconciled",
  "created_at": "...",
  "updated_at": "...",
  "extractions": {
    "azure_computer_vision": {
      "text": "...",
      "text_hash": "...",
      "method": "azure_computer_vision",
      "extracted_at": "..."
    },
    "azure_docint":  { "text": "...", "..." : "..." },
    "pymupdf":       { "text": "...", "..." : "..." },
    "pdfplumber":    { "text": "...", "..." : "..." },
    "reconciled": {
      "text": "...",
      "model": "gpt-4o",
      "provider": "openai",
      "source_methods": ["azure_computer_vision", "azure_docint", "pymupdf", "pdfplumber"],
      "extracted_at": "..."
    },
    "entities": {
      "raw": { "persons": [...], "events": [...], "claims": [...], "places": [...] },
      "model": "gpt-4o",
      "extracted_at": "...",
      "person_count": 3,
      "window_pages": [0, 1, 2],
      "person_files": ["2e7698fc....person.0000.0000.json", "..."]
    },
    "embedding": {
      "values": [0.012, -0.034, "..."],
      "model": "nomic-embed-text",
      "provider": "ollama",
      "dim": 768,
      "created_at": "..."
    }
  }
}

Person mention file (`<hash>.person.<page>.<idx>.json`)

One file is written per person mention found on a page.

{
  "_key": "person_john_smith_2e7698fc_p0_n0",
  "type": { "entity": "person", "source": "mention" },
  "process_version": 2,
  "schema_version": 4,
  "method": "windowed_default",
  "datetime": "...",
  "person_name": "John Smith",
  "source_document": "2e7698fc...",
  "page_num": 0,
  "person_num": 0,
  "page_text": "...",
  "window_text": "=== PAGE 0 (TARGET) ===\n...",
  "embedding_text": "TYPE: Person\nNAME: John Smith\nVITALS: born 1842 County Cork...",
  "embedding_model": "nomic-embed-text",
  "embedding": [0.012, -0.034, "..."],
  "structure": { "persons": [...], "events": [...] }
}

Available extraction methods

Method	Description	Credentials required
`azure_computer_vision`	Azure Computer Vision Read OCR	`AZURE_VISION_ENDPOINT`, `AZURE_VISION_KEY`
`azure_docint`	Azure Document Intelligence layout + key-value pairs	`AZURE_DOCINT_ENDPOINT`, `AZURE_DOCINT_KEY`
`pymupdf`	PyMuPDF embedded text layer	None
`pdfplumber`	pdfplumber embedded text layer	None

Module reference

`askemblaex.main`

CLI entrypoint and pipeline orchestration.

Symbol	Description
`main()`	Parse args, run preflight, execute pipeline stages
`build_parser()`	Return configured `argparse.ArgumentParser`
`run_extraction(source, output, *, ...)`	Run multi-method OCR extraction
`run_reconciliation(output, *, ...)`	Run OpenAI reconciliation
`run_entities(output, *, ...)`	Run windowed entity extraction
`run_embedding(output, *, ...)`	Generate page-level embeddings

`askemblaex.extract`

File discovery and per-page text extraction.

Symbol	Description
`discover_files(root, *, recursive)`	Find all PDFs and images under a directory
`process_one(src_path, output_root, logger, *, ...)`	Extract a single source file into a hash-keyed output folder
`process_all(source_root, output_root, logger)`	Extract all files found under a directory
`extract_page_text(path, logger, *, ...)`	Low-level per-page extraction returning `List[PageExtraction]`
`PageExtraction`	Dataclass holding `page_index` and per-method text dict

`askemblaex.reconcile`

OpenAI-powered multi-source reconciliation.

Symbol	Description
`reconcile_folder(folder, *, client, model, ...)`	Reconcile all pages in a hash-keyed document folder
`reconcile_page_file(page_file, doc_id, page_num, parent_folder, *, ...)`	Load, reconcile, and write back a single page file
`reconcile_page(page_data, *, client, model)`	Reconcile one page's extraction data dict; returns text or `None`

`askemblaex.window`

Dynamic context-window builder for entity extraction.

Symbol	Description
`build_dynamic_extraction_window(pages, anchor_page, *, context_pages, max_chars)`	Build a `ExtractionWindow` centred on `anchor_page` with surrounding context pages
`ExtractionWindow`	Dataclass with `anchor_page`, `pages_included`, `text`, `char_count`

How windowing works: the anchor page is always included in full. Up to context_pages (default 2) pages before and after are added alternately until the max_chars (default 30 000) budget is exhausted. The anchor page is labelled === PAGE N (TARGET) === in the text so the model knows which page to extract from.

`askemblaex.entities`

Structured entity extraction and person embedding generation.

Symbol	Description
`entity_extract_folder(folder, *, client, model, embed_provider, ...)`	Extract entities for all pages in a document folder
`entity_extract_page_file(page_file, parent_folder, page_num, all_page_texts, *, ...)`	Extract entities for one page, write person files and update page JSON
`call_entity_extraction(window_text, *, client, model)`	Call OpenAI and return parsed entity JSON dict
`build_person_embeddings(extraction_json)`	Build one embedding-ready text string per person from entity JSON

Entity schema (returned by call_entity_extraction):

{
  "persons": [
    {
      "name": "John Smith",
      "alt_names": ["J. Smith"],
      "summary": "Head of household in the 1881 census.",
      "roles": ["head of household"],
      "birth": { "date_text": "1842", "place": "County Cork, Ireland" },
      "death": { "date_text": "1910", "place": "Melbourne, Victoria" },
      "residences": ["Melbourne"],
      "relationships": {
        "parents": ["Thomas Smith", "Mary O'Brien"],
        "spouse": ["Catherine Murphy"],
        "children": ["William Smith"],
        "siblings": []
      },
      "attributes": ["farmer", "Catholic"],
      "events": ["Arrived Melbourne 1865"],
      "evidence_phrases": ["head of household", "born County Cork"]
    }
  ],
  "events": [ { "type": "marriage", "date_text": "1865", "people": ["John Smith", "Catherine Murphy"], "details": [] } ],
  "claims": [],
  "places": ["Melbourne", "County Cork"],
  "notes": []
}

`askemblaex.embed`

Vector embedding generation for reconciled page text.

Symbol	Description
`embed_folder(folder, *, provider, force, verbosity)`	Embed all reconciled pages in a document folder
`embed_page_file(page_file, doc_id, page_num, parent_folder, *, provider, ...)`	Embed a single page file's reconciled text
`generate_embedding(text, provider)`	Generate an embedding vector via Ollama or OpenAI
`detect_provider()`	Return `"ollama"`, `"openai"`, or `None` based on environment

`askemblaex.preflight`

Service credential checks and interactive recovery.

Symbol	Description
`run_preflight(requested_methods, needs_reconcile, needs_entities, needs_embed, *, verbose)`	Run all required preflight checks; returns `PreflightResult`
`PreflightResult`	Dataclass with `active_methods`, `openai_available`, `openai_client`, `openai_model`, `embed_provider`, `services`
`ServiceStatus`	Dataclass with `name`, `available`, `reason`, `env_vars`

`askemblaex.pages`

Per-page JSON file I/O utilities.

Symbol	Description
`save_or_merge_page(parent_folder, doc_id, page, data)`	Create or update a page JSON file, merging extraction data
`load_page(parent_folder, doc_id, page)`	Load a page JSON file; returns `None` if missing
`build_page_schema(doc_id, page)`	Build a fresh page dict with empty extraction slots
`get_page_number(filepath)`	Parse zero-based page number from a page filename
`page_file_path(out_dir, doc_id, page)`	Return canonical path for a page JSON file

`askemblaex.metadata`

Document metadata building, reading, and writing.

Symbol	Description
`build_metadata(src_path, *, file_hash)`	Build a fresh metadata dict from a source file
`write_metadata(out_dir, file_hash, metadata)`	Write metadata dict to `<out_dir>/<hash>.metadata._.json`
`load_metadata(out_dir, file_hash)`	Load metadata dict; returns `None` if missing
`merge_metadata(existing, new, *, overwrite)`	Shallow-merge two metadata dicts

`askemblaex.hash`

Content-based file hashing.

Symbol	Description
`hash_file(path, algo)`	Hash a file by content using chunked reads; returns hex digest

Python API examples

Extract a single file

from askemblaex.extract import process_one
from pathlib import Path
import logging

logger = logging.getLogger("myapp")

process_one(
    Path("registers/freewill_baptist.pdf"),
    Path("output/"),
    logger,
    active_methods={"azure_computer_vision", "pymupdf"},
    pdf_dpi=300,
)

Reconcile a document folder

from askemblaex.reconcile import reconcile_folder
from openai import OpenAI
from pathlib import Path

client = OpenAI(api_key="...")
reconcile_folder(
    Path("output/2e7698fc.../"),
    client=client,
    model="gpt-4o",
)

Extract entities from a reconciled folder

from askemblaex.entities import entity_extract_folder
from openai import OpenAI
from pathlib import Path

client = OpenAI(api_key="...")
extracted, skipped = entity_extract_folder(
    Path("output/2e7698fc.../"),
    client=client,
    model="gpt-4o",
    embed_provider="ollama",   # or "openai", or None to skip embeddings
)
print(f"{extracted} pages extracted, {skipped} skipped")

Build a context window manually

from askemblaex.window import build_dynamic_extraction_window

pages = {
    0: "Page zero text...",
    1: "Page one text...",
    2: "Page two text...",
    3: "Page three text...",
}

window = build_dynamic_extraction_window(pages, anchor_page=2, context_pages=1)
print(f"Window covers pages {window.pages_included}, {window.char_count} chars")
# Window covers pages [1, 2, 3], 123 chars

Build person embedding text

from askemblaex.entities import build_person_embeddings
import json

structured = {
    "persons": [{
        "name": "John Smith",
        "summary": "Head of household.",
        "birth": {"date_text": "1842", "place": "County Cork"},
        "death": {"date_text": "1910", "place": "Melbourne"},
        "relationships": {"spouse": ["Catherine Murphy"], "children": ["William Smith"]},
        "attributes": ["farmer"],
        "events": ["Arrived Melbourne 1865"],
        "evidence_phrases": ["head of household"],
    }],
    "events": [],
    "claims": [],
}

texts = build_person_embeddings(json.dumps(structured))
print(texts[0])
# TYPE: Person
# NAME: John Smith
# ALT_NAMES: none
# SUMMARY: Head of household.
# VITALS: born 1842 County Cork; died 1910 Melbourne
# ATTRIBUTES: farmer
# RELATIONSHIPS:
# - spouse: Catherine Murphy
# - children: William Smith
# EVENTS:
# - Arrived Melbourne 1865
# EVIDENCE_ANCHORS:
# - head of household
# YEAR_HINT: 1842–1910

Generate an embedding

from askemblaex.embed import generate_embedding

vector = generate_embedding("John Smith, farmer, born 1842 County Cork", provider="ollama")
print(len(vector))  # e.g. 768

Publishing to PyPI

pip install hatch

# Build
hatch build

# Upload to TestPyPI first
hatch publish --repo test

# Upload to PyPI
hatch publish

Development

git clone https://github.com/askembla/askemblaex
cd askemblaex
pip install -e ".[dev]"
pytest

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.5.1

Mar 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

askemblaex-0.5.1.tar.gz (316.5 kB view details)

Uploaded Mar 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

askemblaex-0.5.1-py3-none-any.whl (60.0 kB view details)

Uploaded Mar 3, 2026 Python 3

File details

Details for the file askemblaex-0.5.1.tar.gz.

File metadata

Download URL: askemblaex-0.5.1.tar.gz
Upload date: Mar 3, 2026
Size: 316.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: Hatch/1.16.5 cpython/3.13.1 HTTPX/0.28.1

File hashes

Hashes for askemblaex-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`4b3e96152ab06852d6ac45f0273795d2963fef2660468858fabec71944ba31af`
MD5	`7ebf8cf02d82e6cbbcf4e54bc0c893c0`
BLAKE2b-256	`52c4690340e68e8135a38b3aa34bfd0f85367bfd58bd863ce0b84e6068dc8b32`

See more details on using hashes here.

File details

Details for the file askemblaex-0.5.1-py3-none-any.whl.

File metadata

Download URL: askemblaex-0.5.1-py3-none-any.whl
Upload date: Mar 3, 2026
Size: 60.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: Hatch/1.16.5 cpython/3.13.1 HTTPX/0.28.1

File hashes

Hashes for askemblaex-0.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`64b7fe12b18484a74d3d934491c65204879fce62a55ba2b28af7157074b6db6a`
MD5	`f72c031864e1421b8597c8076722fa47`
BLAKE2b-256	`c5fd97b70541b26ee984236b37bade4652ef7337f78bd02064bc8439a4814c42`

See more details on using hashes here.

askemblaex 0.5.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

askemblaex

Features

Installation

System dependencies

Configuration

CLI usage

Flags reference

Directories

Pipeline stages

Method control (extraction only)

Force flags

Other options

Common workflows

Verbosity levels

Output structure

Metadata file (<hash>.metadata._.json)

Page file (<hash>.page.<NNNN>.json)

Person mention file (<hash>.person.<page>.<idx>.json)

Available extraction methods

Module reference

askemblaex.main

askemblaex.extract

askemblaex.reconcile

askemblaex.window

askemblaex.entities

askemblaex.embed

askemblaex.preflight

askemblaex.pages

askemblaex.metadata

askemblaex.hash

Python API examples

Extract a single file

Reconcile a document folder

Extract entities from a reconciled folder

Build a context window manually

Build person embedding text

Generate an embedding

Publishing to PyPI

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Metadata file (`<hash>.metadata._.json`)

Page file (`<hash>.page.<NNNN>.json`)

Person mention file (`<hash>.person.<page>.<idx>.json`)

`askemblaex.main`

`askemblaex.extract`

`askemblaex.reconcile`

`askemblaex.window`

`askemblaex.entities`

`askemblaex.embed`

`askemblaex.preflight`

`askemblaex.pages`

`askemblaex.metadata`

`askemblaex.hash`