CLI toolkit for digitizing, organizing, transcribing, and searching family archives

These details have not been verified by PyPI

Project links

Project description

HistoryTools

A CLI toolkit for digitizing, organizing, transcribing, and searching family archives — scanned documents, photos, audio recordings, and more. Turn a box of old scans, cassette tapes, and photos into a searchable, organized, transcribed digital archive.

Installation

git clone https://github.com/mmackelprang/HistoryTools.git
cd HistoryTools
pip install -e ".[all]"

After installation, the family-archive command is available:

family-archive --help

System tools (also needed)

Tesseract OCR: https://github.com/tesseract-ocr/tesseract
FFmpeg: https://ffmpeg.org/download.html

Quick Start

# 1. Run the setup wizard (creates config, sets up API keys)
family-archive init

# 2. Process your archive
family-archive ingest /path/to/your/scans

That's it. The init wizard walks you through configuration interactively. For manual setup, see docs/WORKFLOW.md.

Commands

Ingest — Process Everything at Once

The fastest way to process a folder of scans, recordings, and photos:

# Scan and classify all files, produce a plan for review
family-archive ingest /path/to/scans --scan

# Review _ingest-plan.json (edit classifications if needed)

# Execute the full pipeline (copy, transcribe, format, rename)
family-archive ingest --execute

# Or do it interactively (scan → approve → execute)
family-archive ingest /path/to/scans

# Merge new files into an existing archive
family-archive ingest /path/to/new-scans --scan --mode merge

# Source can be a ZIP file (nested ZIPs are handled too)
family-archive ingest /path/to/archive.zip --scan

Ingest is fully restartable — if interrupted, run --execute again.

Individual Steps

You can also run each step individually for more control:

# Organize files into the archive structure
family-archive organize --dry-run        # preview
family-archive organize                  # run

# Transcribe PDFs — tiered approach (free first, AI only when needed)
python scripts/transcribe_pdfs.py        # free: native text + Tesseract OCR
family-archive transcribe --low-confidence-only   # paid: AI only for low-confidence results
family-archive transcribe                # paid: AI for all untranscribed PDFs

# Transcribe audio (AssemblyAI — with speaker diarization)
family-archive transcribe-audio --dry-run
family-archive transcribe-audio

# Assign real names to speaker labels (e.g., Speaker A → Alice)
family-archive speakers path/to/transcript.md              # interactive
family-archive speakers --dir AudioRecordings --map "A=Alice,B=Bob"  # batch

# Format transcripts with summaries and markdown structure
family-archive format --dry-run
family-archive format                    # free mechanical cleanup
family-archive format --with-summary     # + AI summary (requires API key)

# Propose descriptive filenames for generic files
family-archive rename --dry-run          # preview
family-archive rename                    # generate proposals
# Review _rename-proposals.md, then:
family-archive rename --apply            # apply approved renames

# Detect dates in undated files
family-archive detect-dates              # generate proposals
family-archive detect-dates --apply      # apply approved dates

# Split compilation PDFs into individual documents
family-archive split --dry-run           # preview splittable files
family-archive split                     # generate split proposals
# Review _split-proposals.md, then:
family-archive split --apply --dry-run   # preview
family-archive split --apply             # apply approved splits
family-archive split --apply --archive-original  # move originals to _compilations/

# Extract text from Office documents (DOC, DOCX, XLS, XLSX)
family-archive extract --dry-run          # preview
family-archive extract                    # extract all
family-archive extract --folder NeedsReview  # one folder

# Catalog photos, detect duplicates, generate report
family-archive photos
# Detect and manage duplicate files
family-archive duplicates --scan           # detect duplicates
family-archive duplicates --apply          # quarantine approved
family-archive duplicates --status         # check quarantine
family-archive duplicates --purge          # delete past TTL
family-archive report

# Build the search index (rebuilds from filesystem)
family-archive reindex
family-archive reindex --check    # verify index matches filesystem

# Search across all transcripts
family-archive search "Springfield"
family-archive search "Springfield" --folder Letters
family-archive search "Springfield" --type audio --year 1984

# Archive statistics
family-archive stats

# Review AI API costs
family-archive costs              # summary by pipeline step
family-archive costs --detail     # per-session breakdown

# Check tool installation
family-archive verify

Targeting Specific Files or Folders

Most commands support --folder and --file for targeted processing:

family-archive transcribe --folder Journals
family-archive format --file Letters/1983/letter.transcript.md
family-archive rename --folder FamilyMembers

Transcription Strategy

PDF transcription uses a tiered approach to minimize AI costs:

Native text extraction (free, instant) — PDFs with embedded text are extracted using PyMuPDF
Tesseract OCR (free, slower) — Scanned/image PDFs are OCR'd locally
Gemini AI vision (paid, best quality) — Only used for files where steps 1-2 produced low-confidence results (typically handwritten documents)

The ingest pipeline runs all three tiers automatically. When running manually:

python scripts/transcribe_pdfs.py                    # free: tiers 1 + 2
family-archive transcribe --low-confidence-only       # paid: tier 3 for low-confidence only

# Batch mode (default — 50% cheaper, processes overnight)
family-archive transcribe                  # submit batch jobs
family-archive transcribe --status         # check batch progress
family-archive transcribe --collect        # retrieve results

# Real-time mode (immediate results)
family-archive transcribe --fast           # cross-PDF parallelism

AI-Powered Features

These features require API keys (see docs/SETUP-API-KEYS.md):

Command	Default Service	Alternatives	What It Does	Estimated Cost
`family-archive transcribe`	Google Gemini	OpenAI GPT-4o	AI vision (batch default, 50% cheaper)	~$0.25-0.50 per 1000 pages
`family-archive transcribe --low-confidence-only`	Google Gemini	OpenAI GPT-4o	AI only for low-confidence files	Much less (only handwriting)
`family-archive transcribe-audio`	AssemblyAI	--	Speaker-diarized audio transcription	~$0.01/minute
`family-archive format`	— (mechanical)	—	Page breaks, whitespace cleanup, artifact removal	Free
`family-archive format --with-summary`	Any AI vendor	—	+ AI-generated summary at top	~$0.10-0.20 per 500 files
`family-archive rename`	Google Gemini	OpenAI GPT-4o	AI-suggested filenames	~$0.10-0.30 per 500 files
`family-archive detect-dates`	Google Gemini	OpenAI GPT-4o	Date detection in undated files	~$0.05-0.10 per 200 files
`family-archive split`	Google Gemini	OpenAI GPT-4o	Document boundary detection	~$0.01-0.05 per file

All AI features are optional. Without API keys, local tools (Tesseract OCR, Whisper) are used instead. A unified AI client (ai_client.py) supports Gemini, OpenAI, and Anthropic — vendor swapping via a --vendor CLI flag is planned for an upcoming release.

AI costs are tracked automatically. Run family-archive costs to see token usage and estimated spend across all sessions. Costs are estimates based on published per-token pricing — compare against your vendor dashboards for exact billing.

Modes

Standalone (`"mode": "standalone"`)

Creates a fresh organized archive from scratch.

Merge (`"mode": "merge"`)

Adds new files into an existing organized archive. Detects duplicates by MD5 hash.

Configuration

config.json — Paths and settings

{
  "source_root": "/path/to/source/files",
  "dest_root": "/path/to/Organized",
  "mode": "standalone",
  "whisper_model": "base",
  "transcribe_folders": ["Letters", "Journals", "Cards", "Documents/Writings"]
}

taxonomy.json — File classification rules

Controls how files are classified, which keywords trigger which folders, and which processing steps apply to each file type. Ships with sensible defaults, fully customizable.

# Add a new file extension (e.g., .webp as a photo type)
# Edit taxonomy.json → file_types → photo → extensions

# Add a new classification folder (e.g., Military records)
# Edit taxonomy.json → folders → add "Military/Service" with keywords

# Customize processing pipelines
# Edit taxonomy.json → processing_pipelines

See taxonomy.example.json for a fully commented reference. If taxonomy.json is missing, built-in defaults are used automatically.

.env — API keys

cp .env.example .env
# Edit with your keys (see docs/SETUP-API-KEYS.md)

File Naming Convention

All files are renamed to: YYYY-MM-DD_descriptive-slug.ext

Dates sourced from: filename > EXIF > content analysis
Unknown dates: undated_slug.ext
Partial dates: 1983-06-00_slug.ext (month known, day unknown)

Safety

Source files are never modified or deleted — all operations produce copies
Duplicates are moved, never deleted — review at your leisure
Unclassifiable files go to NeedsReview — nothing is silently discarded
All operations support --dry-run — preview before committing
All operations are restartable — interrupted jobs resume where they left off
Proposals require review — renames, date changes, and splits are proposed then applied

Dependencies

Installed automatically via pip install -e ".[all]":

Package	Purpose
PyMuPDF	PDF text extraction & page rendering
Pillow	Image processing
python-dotenv	Load API keys from .env
google-genai	Gemini AI for handwriting OCR
openai	OpenAI API (alternative to Gemini/Claude)
assemblyai	Audio transcription with speaker ID
anthropic	Transcript formatting with Claude
openai-whisper	Local audio transcription
exifread	EXIF metadata from photos
imagehash	Perceptual duplicate detection
python-docx	DOCX text extraction
openpyxl	XLSX spreadsheet extraction
xlrd	XLS (legacy) spreadsheet extraction
olefile	DOC (legacy Word) extraction

System tools (install separately): Tesseract OCR, FFmpeg

Documentation

File	Contents
docs/SETUP-API-KEYS.md	API keys, costs, recommended models, vendor options
docs/SYSTEM-REQUIREMENTS.md	OS, Python, disk space, RAM, system tools
docs/WORKFLOW.md	Step-by-step processing guide
docs/VISION.md	Long-term product vision and roadmap

Roadmap

Open source (this repo):

Phase 1 ✅: CLI toolkit for documents, audio, and basic organization
Phase 2 ✅: Core library, SQLite/FTS5 search, document splitting, duplicate detection, Gemini batch processing, Office document extraction
Phase 3: Web UI, video transcription, email import, photo AI, embedding-based search

Subscription service (historytools.io):

Phase 3+: Web UI, managed AI gateway, photo AI, timeline/map/people graph, narrative generation, FamilySearch integration, multi-family collaboration

The open-source CLI is fully functional on its own. The subscription service adds a web UI, managed AI, and advanced visualization features. Data is fully portable between both. See docs/VISION.md for the complete roadmap.

License

MIT License — see LICENSE

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Apr 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

family_archive_toolkit-0.2.0.tar.gz (142.6 kB view details)

Uploaded Apr 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

family_archive_toolkit-0.2.0-py3-none-any.whl (121.8 kB view details)

Uploaded Apr 13, 2026 Python 3

File details

Details for the file family_archive_toolkit-0.2.0.tar.gz.

File metadata

Download URL: family_archive_toolkit-0.2.0.tar.gz
Upload date: Apr 13, 2026
Size: 142.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for family_archive_toolkit-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ff51bd39da9de3f7752030254976f4ee57f99a93f6bbb2cd77a7f8b16cf2564f`
MD5	`ba028c633762af4e1e8c4314ab85be3f`
BLAKE2b-256	`5c7317fa157698e3c8b549b1a176dbc02190f66f2fc190b21d88f030c32fc664`

See more details on using hashes here.

File details

Details for the file family_archive_toolkit-0.2.0-py3-none-any.whl.

File metadata

Download URL: family_archive_toolkit-0.2.0-py3-none-any.whl
Upload date: Apr 13, 2026
Size: 121.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for family_archive_toolkit-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9622e2784e95a3238672bf206354ef09a44eeed0f3bea5916a0ccbf025ff6dbd`
MD5	`0d9244575da7806f425a31baa9e791ea`
BLAKE2b-256	`cf22bffc83013c4b3ab8b8e2f808d291e6d32be79baa3f1e3bcc85d8e230128e`

See more details on using hashes here.

family-archive-toolkit 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

HistoryTools

Installation

System tools (also needed)

Quick Start

Commands

Ingest — Process Everything at Once

Individual Steps

Targeting Specific Files or Folders

Transcription Strategy

AI-Powered Features

Modes

Standalone ("mode": "standalone")

Merge ("mode": "merge")

Configuration

config.json — Paths and settings

taxonomy.json — File classification rules

.env — API keys

File Naming Convention

Safety

Dependencies

Documentation

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Standalone (`"mode": "standalone"`)

Merge (`"mode": "merge"`)