CLI pipeline that extracts anime titles, URLs, and recipes from Instagram screenshots
Project description
paku
CLI tool that turns Instagram screenshots into structured data. Feed it a screenshot. It runs OCR (Google Cloud Vision), figures out whether you've shown it an anime recommendation, a GitHub link, or a recipe, pulls the relevant fields, and writes them somewhere you can use.
What it does
Three extractors:
- URL — 4-tier cascade tested on 34 real screenshots. Matches full URLs (github.com, arxiv.org, etc.), spots non-GitHub domains via a curated TLD allowlist, rebuilds GitHub
author/repofrom repo-card layouts, and stubs project-name-only cases for manual review. Survives browser-bar truncation (with or without a visible ellipsis), hyphen-broken URLs, and social-platform false positives. Phase 1 gate: Tier 1 100%, Tier 2-3 71.4%, Tier 4 100%, zero false positives. - Anime — 10-pattern title cascade plus AniList GraphQL enrichment. Strips Instagram UI chrome (15+ filter categories), recognises platform context (AniList app, TikTok, Threads), and pulls every title out of carousel and numbered-list posts. An enhanced Levenshtein ratio (substring containment plus a word-overlap boost) decides auto-accept (>= 0.8) vs review queue. Phase 2 gate: 30/30 = 100% auto-accepted.
- Recipe — multilingual ingredient-block detection (English and Italian anchors). Splits every line into quantity, unit, and name. Never stored as "100g" — always
{qty: 100, unit: "g"}. Handles unicode fractions, wrapped OCR lines, the reversed metric-parens format giallozafferano.com uses, instructions extraction, and source-account detection. Outputs.txt,.csv, and.json. Phase 3 gate: 10/10 = 100%.
Anything the pipeline isn't confident about goes into the review queue instead of getting silently dropped.
paku serve starts a local dashboard (FastAPI plus a vanilla-JS SPA) for browsing what you've extracted, uploading new screenshots, and tracking watch status. The Collection tab has a "Recommended for you" panel that pulls AniList recommendations for your most recently saved title, marks entries you already own, and lets you add the rest with one click. No cloud accounts. SQLite-backed. Runs on 127.0.0.1. Phase 5 gate passed.
Status
v1.0.1 — three extractors and the dashboard are complete; the AniList recommendations panel ships with this release. 521 tests pass. CI runs on every push: lint, test matrix (Python 3.11 and 3.12), wheel build. Tagged v* pushes auto-publish to PyPI via OIDC Trusted Publishing.
--smart flag enables confidence-gated re-run: when fast-path extraction returns confidence < 0.4, the pipeline re-OCRs with a local Ollama VLM (Gemma 4) for richer text and re-extracts. Falls back cleanly if Ollama is unavailable.
Batch mode produces three consolidated outputs: anime_titles.txt / urls.txt / recipe_titles.txt (one entry per line, deduped), plus anime_export.csv (9 property columns, ready to import). Per-image JSON is written throughout.
Install
pip install paku # core + stub OCR (for testing)
pip install "paku[ocr]" # + Google Cloud Vision (real OCR)
pip install "paku[web]" # + FastAPI dashboard (paku serve)
pip install "paku[smart]" # + Ollama VLM (--smart flag)
Then set OCR credentials — either:
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json(env var), orgoogle_vision.api_key: <key>inconfig.yaml
Google Cloud Vision free tier covers 1,000 images/month.
Development install
git clone https://github.com/loremcc/paku.git
cd paku
pip install -e ".[dev]"
Usage
# Single image
paku digest screenshot.png
# Single image — force extraction mode + output formats
paku digest screenshot.png --mode url --output json --output txt
# Smart re-run (re-OCR with Ollama VLM when confidence is low)
paku digest screenshot.png --mode anime --smart
# Batch — directory of images
paku digest ./screenshots/ --mode anime --output csv --output txt --output json
# Batch — resume interrupted run (default behavior: skips already-processed images)
paku digest ./screenshots/ --mode anime --output csv --resume
# Batch — start fresh, ignore checkpoint
paku digest ./screenshots/ --mode anime --output csv --no-resume
# Batch — print breakdown by content type after completion
paku digest ./screenshots/ --report
# Dashboard — browse collection, upload screenshots, manage watch status
paku serve
paku serve --port 8080 --host 127.0.0.1
Batch mode writes a .paku_checkpoint file in the output directory. Each successfully processed image is recorded there, so --resume (the default) skips it on the next run.
Consolidated outputs written after a batch completes:
--output txt→anime_titles.txt,urls.txt,recipe_titles.txt(one entry per line, deduped, sorted)--output csvwith--mode anime→anime_export.csv(9 property columns, deduped by AniList ID)
Config
Copy config.yaml.template to config.yaml and fill in your keys. The file is gitignored.
google_vision:
api_key: "" # or use GOOGLE_APPLICATION_CREDENTIALS env var
credentials_file: "" # path to service account JSON file
anilist:
base_url: "https://graphql.anilist.co"
confidence_threshold: 0.8
ollama:
base_url: "http://localhost:11434" # or LAN host running Ollama
model: "gemma4-paku:latest" # custom model (see Modelfile.paku)
Everything works with defaults except OCR credentials. The ollama section is optional — --smart falls back gracefully if Ollama is unavailable.
Tests
# All tests (521 currently)
python -m pytest
# With coverage
pytest --cov=paku --cov-report=term-missing
# Integration tests (require real OCR credentials + fixture images)
pytest tests/test_google_vision_engine.py -m integration -s
Test fixtures go in tests/fixtures/. Real screenshots are gitignored — populate them manually.
Roadmap
| Version | What | Status |
|---|---|---|
| v0.1 | Scaffold + OCR baseline | Done |
| v0.2 | URL extractor | Done (gate passed) |
| v0.3 | Anime extractor + AniList | Done (gate passed) |
| v0.4 | Recipe extractor | Done (gate passed) |
| v0.5 | Batch processing + anime CSV | Done (gate passed 2026-04-24) |
| v0.6 | Dashboard + product identity | Done (gate passed 2026-04-23) |
| v1.0 | Polish + open source | Done (2026-04-26) |
| v1.0.1 | AniList recommendations panel + PyPI auto-publish | Done (2026-04-28) |
Each version has an explicit gate — a minimum accuracy threshold or throughput test measured on real screenshots — that must pass before the next version starts.
Project structure
paku/
cli.py # Click commands (digest: single + batch, --resume/--no-resume, --report)
pipeline.py # OCR -> classify -> extract -> output; process_batch() + BatchReport
config.py # YAML config loader
context.py # Singleton: config + logger + OCR registry
models.py # Pydantic v2: OcrResult, ExtractionResult, URLExtractionResult, AnimeExtractionResult, RecipeExtractionResult, Ingredient
ocr/
base.py # OCREngine ABC
stub.py # Fake engine for tests
google_vision.py # Google Cloud Vision (document_text_detection)
ollama.py # OllamaVLMEngine — smart re-run (stream-parsed NDJSON)
router.py # light/heavy/auto/smart strategy selection
extractors/
url.py # 4-tier URL extraction cascade
anime.py # 10-pattern title cascade + AniList enrichment
recipe.py # multilingual ingredient block detection + qty/unit split
outputs/
json_out.py # Pretty-printed JSON writer (per image)
txt_out.py # Per-image text writer + write_batch_txt() (consolidated, deduped)
csv_out.py # Recipe ingredient CSV (per image) + write_anime_csv() (post-batch import)
web/
database.py # SQLite layer: Database class, ingest_pipeline_result, Pydantic models
app.py # FastAPI factory create_app(db_path), 9 endpoints
static/
index.html # Vanilla JS + Tailwind SPA — Collection, Add, Review, Dashboard tabs
Modelfile.paku # Ollama Modelfile for "gemma4-paku:latest" custom model
License
This project is licensed under the Mozilla Public License 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paku-1.0.1.tar.gz.
File metadata
- Download URL: paku-1.0.1.tar.gz
- Upload date:
- Size: 329.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa14331cbb5474c060bd98b78db268bb087568acb29cbe78afababab44bf65e2
|
|
| MD5 |
f4708c19e08dc9810beeeb98e57c1165
|
|
| BLAKE2b-256 |
faaea5f8fb84d5b25f36850d0e86ebe9991c7691062eb0a639fa70816874e8b8
|
Provenance
The following attestation bundles were made for paku-1.0.1.tar.gz:
Publisher:
ci.yml on loremcc/paku
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
paku-1.0.1.tar.gz -
Subject digest:
fa14331cbb5474c060bd98b78db268bb087568acb29cbe78afababab44bf65e2 - Sigstore transparency entry: 1396709588
- Sigstore integration time:
-
Permalink:
loremcc/paku@93a88c5c1e40ba2245a0b4c19e46e05973e62fc8 -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/loremcc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@93a88c5c1e40ba2245a0b4c19e46e05973e62fc8 -
Trigger Event:
push
-
Statement type:
File details
Details for the file paku-1.0.1-py3-none-any.whl.
File metadata
- Download URL: paku-1.0.1-py3-none-any.whl
- Upload date:
- Size: 79.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0ee049e151f4f8e466b9209ac4a8b14c364470919ed8dc42fba8fae01a77c49
|
|
| MD5 |
5f87229cfb97574eb5bfe7edd87c2cf5
|
|
| BLAKE2b-256 |
83add297ab816ba26b5d3dbd02b320a70c56ee233f9ff5042583d16560444ada
|
Provenance
The following attestation bundles were made for paku-1.0.1-py3-none-any.whl:
Publisher:
ci.yml on loremcc/paku
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
paku-1.0.1-py3-none-any.whl -
Subject digest:
f0ee049e151f4f8e466b9209ac4a8b14c364470919ed8dc42fba8fae01a77c49 - Sigstore transparency entry: 1396709595
- Sigstore integration time:
-
Permalink:
loremcc/paku@93a88c5c1e40ba2245a0b4c19e46e05973e62fc8 -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/loremcc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@93a88c5c1e40ba2245a0b4c19e46e05973e62fc8 -
Trigger Event:
push
-
Statement type: