Historical document analysis CLI - Extract, analyze, and present handwritten text from document images
Project description
Flatfish
Historical document analysis CLI - Extract, analyze, and present handwritten text from document images.
Features
- 📜 Handwritten Text Recognition (HTR) - Extract text from historical document images
- 🏷️ Named Entity Recognition - Identify people, places, dates, and more with contextual descriptions
- 📊 AI-Powered Summaries - Generate timelines, track changes, and suggest research questions
- 🌐 Static Website Builder - Create searchable, browsable document collections
Installation
pip install flatfish
Quick Start
# Initialize a new project
flatfish init
# Edit configuration
nano flatfish.yaml
nano .env
# Validate setup
flatfish validate
# Process documents
flatfish process
# Preview the site
flatfish publish
Configuration
flatfish.yaml
dataset:
source: "username/dataset-name"
splits:
- "train"
image_column: "image"
processing:
extract_entities: true
entity_context: true
summary:
enabled: true
model: "qwen-vl-max"
website:
title: "Document Collection"
password: "changeme"
.env
HUGGINGFACE_TOKEN=hf_xxxxxxxxxxxxx
DASHSCOPE_API_KEY=sk-xxxxxxxxxxxxx
Commands
| Command | Description |
|---|---|
flatfish init |
Initialize a new project |
flatfish process |
Run the full pipeline |
flatfish extract |
Extract text from images only |
flatfish entities |
Extract entities only |
flatfish summarize |
Generate AI summary only |
flatfish build |
Build static site only |
flatfish serve |
Preview site locally |
flatfish deploy |
Deploy to Netlify |
flatfish status |
Show processing status |
flatfish validate |
Validate configuration |
Deployment .
Deploy your site to Netlify:
# Install netlify-python
pip install netlify-python
# Set your Netlify token (get from https://app.netlify.com/user/applications)
export NETLIFY_TOKEN=your-token
export NETLIFY_SITE_ID=your-site-id
# Deploy a draft preview
flatfish deploy
# Deploy to production
flatfish deploy --prod
# Specify a site ID directly
flatfish deploy --prod --site your-site-id
Output
project/
├── transcriptions/ # Extracted text files
├── entities/ # Entity JSON files
├── summaries/ # AI-generated summaries
└── _site/ # Built static website
License
MIT
Disclosure of Delegation to Generative AI
The authors declare the use of generative AI in the research and writing process. According to the GAIDeT taxonomy (2025), the following tasks were delegated to GAI tools under full human supervision:
- Code generation
- Code optimization
The GAI tool used was: Claude Sonnet. Responsibility for the final manuscript lies entirely with the authors. GAI tools are not listed as authors and do not bear responsibility for the final outcomes. Declaration submitted by: Andrew Janco
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flatfish-0.1.0.tar.gz.
File metadata
- Download URL: flatfish-0.1.0.tar.gz
- Upload date:
- Size: 59.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fcb3bd40cd7e839b22b40c3b98c97928f502018e356b9c5e76fa665d25017676
|
|
| MD5 |
622c4477e8036e0f8d48080e081781bf
|
|
| BLAKE2b-256 |
84b596a133cc2fc63171e751aab18ea83258353212b50045c39f5fbe88ee7019
|
File details
Details for the file flatfish-0.1.0-py3-none-any.whl.
File metadata
- Download URL: flatfish-0.1.0-py3-none-any.whl
- Upload date:
- Size: 74.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8355d2fd4876975a840a94e9755be37889b0020d4ef3f23a165d4efbceb4e58
|
|
| MD5 |
67d19ffe8545d3bca3f9e7706e3e426e
|
|
| BLAKE2b-256 |
656e4a02aec666b8f7162191da309ccd4ce28352e525250058b4e3c56e734b9e
|