Skip to main content

CLI toolkit for digitizing, organizing, transcribing, and searching family archives

Project description

HistoryTools

A CLI toolkit for digitizing, organizing, transcribing, and searching family archives — scanned documents, photos, audio recordings, and more. Turn a box of old scans, cassette tapes, and photos into a searchable, organized, transcribed digital archive.

Installation

git clone https://github.com/mmackelprang/HistoryTools.git
cd HistoryTools
pip install -e ".[all]"

After installation, the family-archive command is available:

family-archive --help

System tools (also needed)

Quick Start

# 1. Run the setup wizard (creates config, sets up API keys)
family-archive init

# 2. Process your archive
family-archive ingest /path/to/your/scans

That's it. The init wizard walks you through configuration interactively. For manual setup, see docs/WORKFLOW.md.

Commands

Ingest — Process Everything at Once

The fastest way to process a folder of scans, recordings, and photos:

# Scan and classify all files, produce a plan for review
family-archive ingest /path/to/scans --scan

# Review _ingest-plan.json (edit classifications if needed)

# Execute the full pipeline (copy, transcribe, format, rename)
family-archive ingest --execute

# Or do it interactively (scan → approve → execute)
family-archive ingest /path/to/scans

# Merge new files into an existing archive
family-archive ingest /path/to/new-scans --scan --mode merge

# Source can be a ZIP file (nested ZIPs are handled too)
family-archive ingest /path/to/archive.zip --scan

Ingest is fully restartable — if interrupted, run --execute again.

Individual Steps

You can also run each step individually for more control:

# Organize files into the archive structure
family-archive organize --dry-run        # preview
family-archive organize                  # run

# Transcribe PDFs — tiered approach (free first, AI only when needed)
python scripts/transcribe_pdfs.py        # free: native text + Tesseract OCR
family-archive transcribe --low-confidence-only   # paid: AI only for low-confidence results
family-archive transcribe                # paid: AI for all untranscribed PDFs

# Transcribe audio (AssemblyAI — with speaker diarization)
family-archive transcribe-audio --dry-run
family-archive transcribe-audio

# Assign real names to speaker labels (e.g., Speaker A → Alice)
family-archive speakers path/to/transcript.md              # interactive
family-archive speakers --dir AudioRecordings --map "A=Alice,B=Bob"  # batch

# Format transcripts with summaries and markdown structure
family-archive format --dry-run
family-archive format                    # free mechanical cleanup
family-archive format --with-summary     # + AI summary (requires API key)

# Propose descriptive filenames for generic files
family-archive rename --dry-run          # preview
family-archive rename                    # generate proposals
# Review _rename-proposals.md, then:
family-archive rename --apply            # apply approved renames

# Detect dates in undated files
family-archive detect-dates              # generate proposals
family-archive detect-dates --apply      # apply approved dates

# Split compilation PDFs into individual documents
family-archive split --dry-run           # preview splittable files
family-archive split                     # generate split proposals
# Review _split-proposals.md, then:
family-archive split --apply --dry-run   # preview
family-archive split --apply             # apply approved splits
family-archive split --apply --archive-original  # move originals to _compilations/

# Extract text from Office documents (DOC, DOCX, XLS, XLSX)
family-archive extract --dry-run          # preview
family-archive extract                    # extract all
family-archive extract --folder NeedsReview  # one folder

# Catalog photos, detect duplicates, generate report
family-archive photos
# Detect and manage duplicate files
family-archive duplicates --scan           # detect duplicates
family-archive duplicates --apply          # quarantine approved
family-archive duplicates --status         # check quarantine
family-archive duplicates --purge          # delete past TTL
family-archive report

# Build the search index (rebuilds from filesystem)
family-archive reindex
family-archive reindex --check    # verify index matches filesystem

# Search across all transcripts
family-archive search "Springfield"
family-archive search "Springfield" --folder Letters
family-archive search "Springfield" --type audio --year 1984

# Archive statistics
family-archive stats

# Review AI API costs
family-archive costs              # summary by pipeline step
family-archive costs --detail     # per-session breakdown

# Check tool installation
family-archive verify

Targeting Specific Files or Folders

Most commands support --folder and --file for targeted processing:

family-archive transcribe --folder Journals
family-archive format --file Letters/1983/letter.transcript.md
family-archive rename --folder FamilyMembers

Transcription Strategy

PDF transcription uses a tiered approach to minimize AI costs:

  1. Native text extraction (free, instant) — PDFs with embedded text are extracted using PyMuPDF
  2. Tesseract OCR (free, slower) — Scanned/image PDFs are OCR'd locally
  3. Gemini AI vision (paid, best quality) — Only used for files where steps 1-2 produced low-confidence results (typically handwritten documents)

The ingest pipeline runs all three tiers automatically. When running manually:

python scripts/transcribe_pdfs.py                    # free: tiers 1 + 2
family-archive transcribe --low-confidence-only       # paid: tier 3 for low-confidence only
# Batch mode (default — 50% cheaper, processes overnight)
family-archive transcribe                  # submit batch jobs
family-archive transcribe --status         # check batch progress
family-archive transcribe --collect        # retrieve results

# Real-time mode (immediate results)
family-archive transcribe --fast           # cross-PDF parallelism

AI-Powered Features

These features require API keys (see docs/SETUP-API-KEYS.md):

Command Default Service Alternatives What It Does Estimated Cost
family-archive transcribe Google Gemini OpenAI GPT-4o AI vision (batch default, 50% cheaper) ~$0.25-0.50 per 1000 pages
family-archive transcribe --low-confidence-only Google Gemini OpenAI GPT-4o AI only for low-confidence files Much less (only handwriting)
family-archive transcribe-audio AssemblyAI -- Speaker-diarized audio transcription ~$0.01/minute
family-archive format — (mechanical) Page breaks, whitespace cleanup, artifact removal Free
family-archive format --with-summary Any AI vendor + AI-generated summary at top ~$0.10-0.20 per 500 files
family-archive rename Google Gemini OpenAI GPT-4o AI-suggested filenames ~$0.10-0.30 per 500 files
family-archive detect-dates Google Gemini OpenAI GPT-4o Date detection in undated files ~$0.05-0.10 per 200 files
family-archive split Google Gemini OpenAI GPT-4o Document boundary detection ~$0.01-0.05 per file

All AI features are optional. Without API keys, local tools (Tesseract OCR, Whisper) are used instead. A unified AI client (ai_client.py) supports Gemini, OpenAI, and Anthropic — vendor swapping via a --vendor CLI flag is planned for an upcoming release.

AI costs are tracked automatically. Run family-archive costs to see token usage and estimated spend across all sessions. Costs are estimates based on published per-token pricing — compare against your vendor dashboards for exact billing.

Modes

Standalone ("mode": "standalone")

Creates a fresh organized archive from scratch.

Merge ("mode": "merge")

Adds new files into an existing organized archive. Detects duplicates by MD5 hash.

Configuration

config.json — Paths and settings

{
  "source_root": "/path/to/source/files",
  "dest_root": "/path/to/Organized",
  "mode": "standalone",
  "whisper_model": "base",
  "transcribe_folders": ["Letters", "Journals", "Cards", "Documents/Writings"]
}

taxonomy.json — File classification rules

Controls how files are classified, which keywords trigger which folders, and which processing steps apply to each file type. Ships with sensible defaults, fully customizable.

# Add a new file extension (e.g., .webp as a photo type)
# Edit taxonomy.json → file_types → photo → extensions

# Add a new classification folder (e.g., Military records)
# Edit taxonomy.json → folders → add "Military/Service" with keywords

# Customize processing pipelines
# Edit taxonomy.json → processing_pipelines

See taxonomy.example.json for a fully commented reference. If taxonomy.json is missing, built-in defaults are used automatically.

.env — API keys

cp .env.example .env
# Edit with your keys (see docs/SETUP-API-KEYS.md)

File Naming Convention

All files are renamed to: YYYY-MM-DD_descriptive-slug.ext

  • Dates sourced from: filename > EXIF > content analysis
  • Unknown dates: undated_slug.ext
  • Partial dates: 1983-06-00_slug.ext (month known, day unknown)

Safety

  • Source files are never modified or deleted — all operations produce copies
  • Duplicates are moved, never deleted — review at your leisure
  • Unclassifiable files go to NeedsReview — nothing is silently discarded
  • All operations support --dry-run — preview before committing
  • All operations are restartable — interrupted jobs resume where they left off
  • Proposals require review — renames, date changes, and splits are proposed then applied

Dependencies

Installed automatically via pip install -e ".[all]":

Package Purpose
PyMuPDF PDF text extraction & page rendering
Pillow Image processing
python-dotenv Load API keys from .env
google-genai Gemini AI for handwriting OCR
openai OpenAI API (alternative to Gemini/Claude)
assemblyai Audio transcription with speaker ID
anthropic Transcript formatting with Claude
openai-whisper Local audio transcription
exifread EXIF metadata from photos
imagehash Perceptual duplicate detection
python-docx DOCX text extraction
openpyxl XLSX spreadsheet extraction
xlrd XLS (legacy) spreadsheet extraction
olefile DOC (legacy Word) extraction

System tools (install separately): Tesseract OCR, FFmpeg

Documentation

File Contents
docs/SETUP-API-KEYS.md API keys, costs, recommended models, vendor options
docs/SYSTEM-REQUIREMENTS.md OS, Python, disk space, RAM, system tools
docs/WORKFLOW.md Step-by-step processing guide
docs/VISION.md Long-term product vision and roadmap

Roadmap

Open source (this repo):

  • Phase 1 ✅: CLI toolkit for documents, audio, and basic organization
  • Phase 2 ✅: Core library, SQLite/FTS5 search, document splitting, duplicate detection, Gemini batch processing, Office document extraction
  • Phase 3: Web UI, video transcription, email import, photo AI, embedding-based search

Subscription service (historytools.io):

  • Phase 3+: Web UI, managed AI gateway, photo AI, timeline/map/people graph, narrative generation, FamilySearch integration, multi-family collaboration

The open-source CLI is fully functional on its own. The subscription service adds a web UI, managed AI, and advanced visualization features. Data is fully portable between both. See docs/VISION.md for the complete roadmap.

License

MIT License — see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

family_archive_toolkit-0.2.0.tar.gz (142.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

family_archive_toolkit-0.2.0-py3-none-any.whl (121.8 kB view details)

Uploaded Python 3

File details

Details for the file family_archive_toolkit-0.2.0.tar.gz.

File metadata

  • Download URL: family_archive_toolkit-0.2.0.tar.gz
  • Upload date:
  • Size: 142.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for family_archive_toolkit-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ff51bd39da9de3f7752030254976f4ee57f99a93f6bbb2cd77a7f8b16cf2564f
MD5 ba028c633762af4e1e8c4314ab85be3f
BLAKE2b-256 5c7317fa157698e3c8b549b1a176dbc02190f66f2fc190b21d88f030c32fc664

See more details on using hashes here.

File details

Details for the file family_archive_toolkit-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for family_archive_toolkit-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9622e2784e95a3238672bf206354ef09a44eeed0f3bea5916a0ccbf025ff6dbd
MD5 0d9244575da7806f425a31baa9e791ea
BLAKE2b-256 cf22bffc83013c4b3ab8b8e2f808d291e6d32be79baa3f1e3bcc85d8e230128e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page