Document extraction and reconciliation for genealogical sources.
Project description
askemblaex
Document extraction, reconciliation, entity extraction, and embedding for genealogical sources.
askemblaex processes PDF and image-based genealogical documents through a multi-stage pipeline:
- Extract — run multiple OCR providers per page
- Reconcile — merge OCR outputs into a single best-quality transcription via OpenAI
- Entities — extract structured persons, events, and claims from each page via windowed AI analysis
- Embed — generate vector embeddings from reconciled text or person descriptions
Each stage is independently runnable and fully incremental — nothing is overwritten unless explicitly forced.
Features
- Multi-method OCR extraction — Azure Computer Vision, Azure Document Intelligence, PyMuPDF, and pdfplumber run per page
- AI reconciliation — OpenAI merges all OCR outputs into the most accurate transcription, preferring Azure sources
- Windowed entity extraction — structured persons, events, claims, and places extracted from each page with surrounding page context
- Per-person embeddings — each person mention gets an embedding-ready text string and optional vector
- Structured output — every document gets a hash-keyed folder; every page gets its own JSON file
- Incremental processing — partial runs pick up where they left off; model-change detection triggers re-reconciliation automatically
- Interactive preflight — missing credentials trigger a prompt to enter them, skip the service, or quit
Installation
pip install askemblaex
System dependencies
pdf2image requires Poppler:
# macOS
brew install poppler
# Ubuntu / Debian
sudo apt-get install poppler-utils
# Windows — download from https://github.com/oschwartz10612/poppler-windows
Configuration
Create a .env file in your working directory (or set environment variables directly):
# Azure Computer Vision (optional — enables azure_computer_vision method)
AZURE_VISION_ENDPOINT=https://<resource>.cognitiveservices.azure.com/
AZURE_VISION_KEY=<your-key>
# Azure Document Intelligence (optional — enables azure_docint method)
AZURE_DOCINT_ENDPOINT=https://<resource>.cognitiveservices.azure.com/
AZURE_DOCINT_KEY=<your-key>
# OpenAI (required for --reconcile and --entities)
OPENAI_KEY=<your-key>
OPENAI_MODEL=gpt-4o
OPENAI_BASE_URL=https://api.openai.com/v1 # override for proxies / compatible APIs
# OpenAI entity extraction model (optional — falls back to OPENAI_MODEL then gpt-4o)
OPENAI_ENTITY_MODEL=gpt-4o
# Ollama embeddings (optional — takes priority over OpenAI for --embed)
OLLAMA_ENDPOINT=http://localhost:11434
OLLAMA_EMODEL=nomic-embed-text
OLLAMA_EDIM=768 # optional dimension validation
# OpenAI embeddings (used if Ollama is not configured)
OPENAI_EMODEL=text-embedding-3-small
OPENAI_EDIM=1536 # optional dimension validation
# Override recognised image extensions (default: .png,.jpg,.jpeg,.tiff,.tif)
ASKEMBLAEX_IMG_EXTS=.png,.jpg,.tif
CLI usage
askemblaex [--source DIR] [--output DIR]
[--extract] [--reconcile] [--entities] [--embed]
[--methods METHOD ...] [--skip-methods METHOD ...]
[--force-extract] [--force-reconcile] [--force-entities] [--force-embed]
[--dpi DPI] [--recursive] [-v|-vv|-vvv]
[--list-methods]
Flags reference
Directories
| Flag | Description |
|---|---|
--source DIR, -s DIR |
Source directory containing PDFs or images |
--output DIR, -o DIR |
Output directory. Defaults to --source |
Pipeline stages
| Flag | Description | Requires |
|---|---|---|
--extract |
Run multi-method OCR extraction | --source |
--reconcile |
Merge OCR outputs into best-quality text via OpenAI | Extracted pages, OPENAI_KEY |
--entities |
Extract structured persons/events/claims per page via OpenAI | Reconciled pages, OPENAI_KEY |
--embed |
Generate vector embeddings from reconciled page text | Reconciled pages, Ollama or OpenAI embeddings |
--list-methods |
Print available extraction methods and exit | — |
Method control (extraction only)
| Flag | Description |
|---|---|
--methods METHOD [...] |
Whitelist specific extraction methods (default: all) |
--skip-methods METHOD [...] |
Exclude specific extraction methods |
Force flags
| Flag | Description |
|---|---|
--force-extract |
Re-extract even if already marked complete |
--force-reconcile |
Re-reconcile even if model matches |
--force-entities |
Re-extract entities even if already done with same model |
--force-embed |
Re-embed even if already embedded with same model |
Other options
| Flag | Default | Description |
|---|---|---|
--dpi INT |
300 |
DPI for PDF page rendering |
--recursive, -r |
off | Search source directory recursively |
-v / -vv / -vvv |
— | Verbosity: INFO / DEBUG / DEBUG + tracebacks |
Common workflows
# List available extraction methods
askemblaex --list-methods
# Extract all PDFs in a folder (output defaults to source)
askemblaex --source /path/to/pdfs --extract
# Extract to a separate output folder
askemblaex --source /path/to/pdfs --output /path/to/output --extract
# Full pipeline in one command
askemblaex --source /path/to/pdfs --output /path/to/output \
--extract --reconcile --entities --embed
# Reconcile previously extracted pages
askemblaex --output /path/to/output --reconcile
# Entity extraction only (requires reconciled text)
askemblaex --output /path/to/output --entities
# Generate embeddings only (requires reconciled text)
askemblaex --output /path/to/output --embed
# Only run specific extraction methods
askemblaex --source /path/to/pdfs --extract \
--methods azure_computer_vision pymupdf
# Skip a method
askemblaex --source /path/to/pdfs --extract --skip-methods pdfplumber
# Custom DPI for PDF rendering (default: 300)
askemblaex --source /path/to/pdfs --extract --dpi 400
# Force re-extraction (will not touch reconciled or entity data)
askemblaex --source /path/to/pdfs --output /path/to/output \
--extract --force-extract
# Force re-reconciliation only
askemblaex --output /path/to/output --reconcile --force-reconcile
# Force re-entity-extraction only
askemblaex --output /path/to/output --entities --force-entities
# Recursive source tree + verbose output
askemblaex --source /path/to/pdfs --output /path/to/output \
--extract --reconcile --recursive -vv
Verbosity levels
| Flag | Level | Output |
|---|---|---|
| (none) | WARNING | Errors and service failures only |
-v |
INFO | Per-file progress |
-vv |
DEBUG | Per-page detail |
-vvv |
DEBUG | Full tracebacks on errors |
Output structure
Each processed document produces a hash-keyed folder under the output root:
output/
2e7698fc.../
2e7698fc....metadata._.json ← document metadata + processing state
2e7698fc....log ← per-document log
2e7698fc....pymupdf.image.0.0.png ← embedded images extracted by PyMuPDF
2e7698fc....pdfplumber.table.0.0.csv
pages/
2e7698fc....page.0000.json ← all extraction data for page 0
2e7698fc....page.0001.json
...
2e7698fc....person.0000.0000.json ← person mention, page 0, person 0
2e7698fc....person.0000.0001.json ← person mention, page 0, person 1
...
images/
page.0000.png ← rendered page images (for Azure CV)
page.0000.pdf ← single-page PDFs (for Azure DocInt)
...
logs/
askemblaex.log
Metadata file (<hash>.metadata._.json)
{
"_key": "2e7698fc...",
"source": {
"filename": "The Freewill Baptist Register.pdf",
"type": ".pdf",
"title": "The Freewill Baptist Register",
"created_utc": "2026-02-24T10:00:00Z",
"local": true,
"uris": []
},
"raw": {
"page_count": 42
},
"extraction": {
"complete": true,
"started_utc": "2026-02-24T10:01:00Z",
"completed_utc": "2026-02-24T10:05:00Z",
"steps": {
"azure_computer_vision": true,
"azure_docint": true,
"pymupdf": true,
"pdfplumber": true,
"reconciled": true,
"entities": true,
"embeddings": true
}
}
}
Page file (<hash>.page.<NNNN>.json)
{
"schema_version": 1.0,
"doc_id": "2e7698fc...",
"page_num": 0,
"default": "reconciled",
"created_at": "...",
"updated_at": "...",
"extractions": {
"azure_computer_vision": {
"text": "...",
"text_hash": "...",
"method": "azure_computer_vision",
"extracted_at": "..."
},
"azure_docint": { "text": "...", "..." : "..." },
"pymupdf": { "text": "...", "..." : "..." },
"pdfplumber": { "text": "...", "..." : "..." },
"reconciled": {
"text": "...",
"model": "gpt-4o",
"provider": "openai",
"source_methods": ["azure_computer_vision", "azure_docint", "pymupdf", "pdfplumber"],
"extracted_at": "..."
},
"entities": {
"raw": { "persons": [...], "events": [...], "claims": [...], "places": [...] },
"model": "gpt-4o",
"extracted_at": "...",
"person_count": 3,
"window_pages": [0, 1, 2],
"person_files": ["2e7698fc....person.0000.0000.json", "..."]
},
"embedding": {
"values": [0.012, -0.034, "..."],
"model": "nomic-embed-text",
"provider": "ollama",
"dim": 768,
"created_at": "..."
}
}
}
Person mention file (<hash>.person.<page>.<idx>.json)
One file is written per person mention found on a page.
{
"_key": "person_john_smith_2e7698fc_p0_n0",
"type": { "entity": "person", "source": "mention" },
"process_version": 2,
"schema_version": 4,
"method": "windowed_default",
"datetime": "...",
"person_name": "John Smith",
"source_document": "2e7698fc...",
"page_num": 0,
"person_num": 0,
"page_text": "...",
"window_text": "=== PAGE 0 (TARGET) ===\n...",
"embedding_text": "TYPE: Person\nNAME: John Smith\nVITALS: born 1842 County Cork...",
"embedding_model": "nomic-embed-text",
"embedding": [0.012, -0.034, "..."],
"structure": { "persons": [...], "events": [...] }
}
Available extraction methods
| Method | Description | Credentials required |
|---|---|---|
azure_computer_vision |
Azure Computer Vision Read OCR | AZURE_VISION_ENDPOINT, AZURE_VISION_KEY |
azure_docint |
Azure Document Intelligence layout + key-value pairs | AZURE_DOCINT_ENDPOINT, AZURE_DOCINT_KEY |
pymupdf |
PyMuPDF embedded text layer | None |
pdfplumber |
pdfplumber embedded text layer | None |
Module reference
askemblaex.main
CLI entrypoint and pipeline orchestration.
| Symbol | Description |
|---|---|
main() |
Parse args, run preflight, execute pipeline stages |
build_parser() |
Return configured argparse.ArgumentParser |
run_extraction(source, output, *, ...) |
Run multi-method OCR extraction |
run_reconciliation(output, *, ...) |
Run OpenAI reconciliation |
run_entities(output, *, ...) |
Run windowed entity extraction |
run_embedding(output, *, ...) |
Generate page-level embeddings |
askemblaex.extract
File discovery and per-page text extraction.
| Symbol | Description |
|---|---|
discover_files(root, *, recursive) |
Find all PDFs and images under a directory |
process_one(src_path, output_root, logger, *, ...) |
Extract a single source file into a hash-keyed output folder |
process_all(source_root, output_root, logger) |
Extract all files found under a directory |
extract_page_text(path, logger, *, ...) |
Low-level per-page extraction returning List[PageExtraction] |
PageExtraction |
Dataclass holding page_index and per-method text dict |
askemblaex.reconcile
OpenAI-powered multi-source reconciliation.
| Symbol | Description |
|---|---|
reconcile_folder(folder, *, client, model, ...) |
Reconcile all pages in a hash-keyed document folder |
reconcile_page_file(page_file, doc_id, page_num, parent_folder, *, ...) |
Load, reconcile, and write back a single page file |
reconcile_page(page_data, *, client, model) |
Reconcile one page's extraction data dict; returns text or None |
askemblaex.window
Dynamic context-window builder for entity extraction.
| Symbol | Description |
|---|---|
build_dynamic_extraction_window(pages, anchor_page, *, context_pages, max_chars) |
Build a ExtractionWindow centred on anchor_page with surrounding context pages |
ExtractionWindow |
Dataclass with anchor_page, pages_included, text, char_count |
How windowing works: the anchor page is always included in full. Up to context_pages (default 2) pages before and after are added alternately until the max_chars (default 30 000) budget is exhausted. The anchor page is labelled === PAGE N (TARGET) === in the text so the model knows which page to extract from.
askemblaex.entities
Structured entity extraction and person embedding generation.
| Symbol | Description |
|---|---|
entity_extract_folder(folder, *, client, model, embed_provider, ...) |
Extract entities for all pages in a document folder |
entity_extract_page_file(page_file, parent_folder, page_num, all_page_texts, *, ...) |
Extract entities for one page, write person files and update page JSON |
call_entity_extraction(window_text, *, client, model) |
Call OpenAI and return parsed entity JSON dict |
build_person_embeddings(extraction_json) |
Build one embedding-ready text string per person from entity JSON |
Entity schema (returned by call_entity_extraction):
{
"persons": [
{
"name": "John Smith",
"alt_names": ["J. Smith"],
"summary": "Head of household in the 1881 census.",
"roles": ["head of household"],
"birth": { "date_text": "1842", "place": "County Cork, Ireland" },
"death": { "date_text": "1910", "place": "Melbourne, Victoria" },
"residences": ["Melbourne"],
"relationships": {
"parents": ["Thomas Smith", "Mary O'Brien"],
"spouse": ["Catherine Murphy"],
"children": ["William Smith"],
"siblings": []
},
"attributes": ["farmer", "Catholic"],
"events": ["Arrived Melbourne 1865"],
"evidence_phrases": ["head of household", "born County Cork"]
}
],
"events": [ { "type": "marriage", "date_text": "1865", "people": ["John Smith", "Catherine Murphy"], "details": [] } ],
"claims": [],
"places": ["Melbourne", "County Cork"],
"notes": []
}
askemblaex.embed
Vector embedding generation for reconciled page text.
| Symbol | Description |
|---|---|
embed_folder(folder, *, provider, force, verbosity) |
Embed all reconciled pages in a document folder |
embed_page_file(page_file, doc_id, page_num, parent_folder, *, provider, ...) |
Embed a single page file's reconciled text |
generate_embedding(text, provider) |
Generate an embedding vector via Ollama or OpenAI |
detect_provider() |
Return "ollama", "openai", or None based on environment |
askemblaex.preflight
Service credential checks and interactive recovery.
| Symbol | Description |
|---|---|
run_preflight(requested_methods, needs_reconcile, needs_entities, needs_embed, *, verbose) |
Run all required preflight checks; returns PreflightResult |
PreflightResult |
Dataclass with active_methods, openai_available, openai_client, openai_model, embed_provider, services |
ServiceStatus |
Dataclass with name, available, reason, env_vars |
askemblaex.pages
Per-page JSON file I/O utilities.
| Symbol | Description |
|---|---|
save_or_merge_page(parent_folder, doc_id, page, data) |
Create or update a page JSON file, merging extraction data |
load_page(parent_folder, doc_id, page) |
Load a page JSON file; returns None if missing |
build_page_schema(doc_id, page) |
Build a fresh page dict with empty extraction slots |
get_page_number(filepath) |
Parse zero-based page number from a page filename |
page_file_path(out_dir, doc_id, page) |
Return canonical path for a page JSON file |
askemblaex.metadata
Document metadata building, reading, and writing.
| Symbol | Description |
|---|---|
build_metadata(src_path, *, file_hash) |
Build a fresh metadata dict from a source file |
write_metadata(out_dir, file_hash, metadata) |
Write metadata dict to <out_dir>/<hash>.metadata._.json |
load_metadata(out_dir, file_hash) |
Load metadata dict; returns None if missing |
merge_metadata(existing, new, *, overwrite) |
Shallow-merge two metadata dicts |
askemblaex.hash
Content-based file hashing.
| Symbol | Description |
|---|---|
hash_file(path, algo) |
Hash a file by content using chunked reads; returns hex digest |
Python API examples
Extract a single file
from askemblaex.extract import process_one
from pathlib import Path
import logging
logger = logging.getLogger("myapp")
process_one(
Path("registers/freewill_baptist.pdf"),
Path("output/"),
logger,
active_methods={"azure_computer_vision", "pymupdf"},
pdf_dpi=300,
)
Reconcile a document folder
from askemblaex.reconcile import reconcile_folder
from openai import OpenAI
from pathlib import Path
client = OpenAI(api_key="...")
reconcile_folder(
Path("output/2e7698fc.../"),
client=client,
model="gpt-4o",
)
Extract entities from a reconciled folder
from askemblaex.entities import entity_extract_folder
from openai import OpenAI
from pathlib import Path
client = OpenAI(api_key="...")
extracted, skipped = entity_extract_folder(
Path("output/2e7698fc.../"),
client=client,
model="gpt-4o",
embed_provider="ollama", # or "openai", or None to skip embeddings
)
print(f"{extracted} pages extracted, {skipped} skipped")
Build a context window manually
from askemblaex.window import build_dynamic_extraction_window
pages = {
0: "Page zero text...",
1: "Page one text...",
2: "Page two text...",
3: "Page three text...",
}
window = build_dynamic_extraction_window(pages, anchor_page=2, context_pages=1)
print(f"Window covers pages {window.pages_included}, {window.char_count} chars")
# Window covers pages [1, 2, 3], 123 chars
Build person embedding text
from askemblaex.entities import build_person_embeddings
import json
structured = {
"persons": [{
"name": "John Smith",
"summary": "Head of household.",
"birth": {"date_text": "1842", "place": "County Cork"},
"death": {"date_text": "1910", "place": "Melbourne"},
"relationships": {"spouse": ["Catherine Murphy"], "children": ["William Smith"]},
"attributes": ["farmer"],
"events": ["Arrived Melbourne 1865"],
"evidence_phrases": ["head of household"],
}],
"events": [],
"claims": [],
}
texts = build_person_embeddings(json.dumps(structured))
print(texts[0])
# TYPE: Person
# NAME: John Smith
# ALT_NAMES: none
# SUMMARY: Head of household.
# VITALS: born 1842 County Cork; died 1910 Melbourne
# ATTRIBUTES: farmer
# RELATIONSHIPS:
# - spouse: Catherine Murphy
# - children: William Smith
# EVENTS:
# - Arrived Melbourne 1865
# EVIDENCE_ANCHORS:
# - head of household
# YEAR_HINT: 1842–1910
Generate an embedding
from askemblaex.embed import generate_embedding
vector = generate_embedding("John Smith, farmer, born 1842 County Cork", provider="ollama")
print(len(vector)) # e.g. 768
Publishing to PyPI
pip install hatch
# Build
hatch build
# Upload to TestPyPI first
hatch publish --repo test
# Upload to PyPI
hatch publish
Development
git clone https://github.com/askembla/askemblaex
cd askemblaex
pip install -e ".[dev]"
pytest
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file askemblaex-0.5.1.tar.gz.
File metadata
- Download URL: askemblaex-0.5.1.tar.gz
- Upload date:
- Size: 316.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.5 cpython/3.13.1 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b3e96152ab06852d6ac45f0273795d2963fef2660468858fabec71944ba31af
|
|
| MD5 |
7ebf8cf02d82e6cbbcf4e54bc0c893c0
|
|
| BLAKE2b-256 |
52c4690340e68e8135a38b3aa34bfd0f85367bfd58bd863ce0b84e6068dc8b32
|
File details
Details for the file askemblaex-0.5.1-py3-none-any.whl.
File metadata
- Download URL: askemblaex-0.5.1-py3-none-any.whl
- Upload date:
- Size: 60.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.5 cpython/3.13.1 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64b7fe12b18484a74d3d934491c65204879fce62a55ba2b28af7157074b6db6a
|
|
| MD5 |
f72c031864e1421b8597c8076722fa47
|
|
| BLAKE2b-256 |
c5fd97b70541b26ee984236b37bade4652ef7337f78bd02064bc8439a4814c42
|