Read, annotate, train and decrypt text captchas with a CRNN+CTC model.
Project description
txtcaptcha
Read, annotate, train and decrypt text captchas in images with a modern CRNN + CTC pipeline in PyTorch.
txtcaptcha ships:
- a CRNN architecture that handles arbitrary input sizes and variable-length labels,
- the full alphanumeric vocabulary
0-9a-zA-Z(62 classes + CTC blank), - decode-time masking so a single trained model can be restricted per
site (e.g.
mask="[0-9]"), - fixed-length decoding via
length=Nfor sites with a known length, - a pretrained unified model hosted on the Hugging Face Hub with ~89% captcha-level accuracy across ten Brazilian court captcha datasets.
Installation
pip install txtcaptcha
Or from source with uv:
git clone https://github.com/jtrecenti/txtcaptcha
cd txtcaptcha
uv sync --extra dev
Quick start
The first decrypt call downloads the pretrained model from the Hugging Face
Hub into ~/.cache/huggingface/hub; subsequent calls are free.
from txtcaptcha import read_captcha, decrypt
cap = read_captcha("path/to/captcha.png")
print(decrypt(cap)) # greedy, variable length
print(decrypt(cap, mask="[0-9]")) # digits only
print(decrypt(cap, length=5)) # force exactly 5 chars
print(decrypt(cap, mask=list("abcdef0123"))) # explicit allowed set
Pin a specific release or load a different Hub repo explicitly:
from txtcaptcha import from_pretrained
model = from_pretrained("jtrecenti/txtcaptcha-crnn", revision="v0.1.0")
print(decrypt(cap, model=model))
Training your own model
from txtcaptcha import fit_model, save_model, download_dataset
data_dir = download_dataset("tjmg", "data")
model, history = fit_model(
data_dir,
epochs=30,
batch_size=64,
case_sensitive=False,
)
save_model(model, "tjmg.pt")
Publishing your own model to the Hub
from txtcaptcha import push_to_hub
push_to_hub(
model,
repo_id="your-username/your-captcha-model",
model_card="# My captcha model\n\nTrained on ...",
tag="v0.1.0",
)
Public API
| Function | Purpose |
|---|---|
read_captcha(files, lab_in_path=False) |
Load image(s) into a Captcha object. |
Captcha |
Container with images, labels, paths, plot(). |
annotate(files, labels=None, ...) |
Interactive/batch labeling (filename convention). |
CaptchaDataset(root, vocab, height, case_sensitive) |
PyTorch dataset over a folder of <id>_<label>.<ext> files. |
transform_image(files, height=32) |
Load + resize + width-pad for batching. |
encode_label, decode_indices |
Vocab ↔ tensor (CTC blank index 0). |
pad_collate |
DataLoader collate fn for variable-width batching. |
CRNN(vocab, ...) |
CNN + BiLSTM + linear head. |
fit_model(dir, ...) |
Training loop with CTC loss + early stopping. |
decrypt(files, model=None, mask=None, case_sensitive=True, length=None) |
Predict labels; auto-downloads the pretrained model when model=None. |
save_model, load_model |
Local checkpoint persistence. |
from_pretrained, save_pretrained, push_to_hub |
Hugging Face Hub integration. |
download_dataset, available_datasets |
Fetch labeled training datasets. |
download_captchas (CLI) |
Download live, unlabeled captchas from 10 Brazilian sources. |
sequence_accuracy(preds, targets) |
Exact-match accuracy metric. |
Full API reference: https://jtrecenti.github.io/txtcaptcha/.
Architecture
CRNN is a Convolutional Recurrent Neural Network:
- CNN backbone — ResNet-style basic blocks (
64 → 128 → 256 → 256channels) with strided pooling. Down-samples height by8and width by4, preserving width resolution for the sequence dimension. - Adaptive pool — collapses the remaining height to 1, producing a width-indexed sequence of feature vectors.
- BiLSTM — 2-layer bidirectional LSTM (hidden 256).
- Linear head — projects to
len(vocab) + 1logits per timestep (the extra slot is the CTC blank). - CTC loss — handles variable-length targets, no per-position softmax.
Variable image dimensions are handled by resizing height to 32 at load time,
preserving the aspect-ratio width, and padding widths within each batch via
pad_collate. CRNN+CTC is the de-facto baseline for short-text scene-text
recognition — lighter than transformer OCR (e.g. TrOCR) and consistently
strong on short captcha images.
Variable-length labels
CRNN + CTC handles variable label lengths natively. The convolutional stack
emits T logits per image; CTC collapsing (remove consecutive repeats, then
remove blanks) turns any path into a string of arbitrary length between 0
and T. Training mixes 4-char and 5-char labels in the same batch — no
length head, no padding tokens.
The downside of greedy CTC is that a confident wrong timestep can yield a
prediction of the wrong length. When you know the expected length, pass
length= to switch to an exact dynamic-programming search over CTC paths
that collapse to exactly that many characters:
decrypt(cap) # greedy
decrypt(cap, length=5) # force 5 chars
decrypt(cap, length=4, mask="[0-9]") # combine with masking
The DP runs in O(T · L · |vocab|) per image, tracks the best path for
every (collapsed_count, last_index) state and reconstructs the argmax. It
is strictly at least as good as greedy when the true length is known and
never emits a wrong-length prediction.
Decode-time masking
decrypt(..., mask=...) zeros out forbidden vocabulary logits before CTC
decoding, so the same trained model can be specialized per site:
decrypt(cap, mask=["a", "b", "c", "1", "2", "3"]) # explicit list
decrypt(cap, mask="[0-9a-z]") # regex char-class
decrypt(cap, mask="[A-Z]", case_sensitive=True) # uppercase only
decrypt(cap, mask="[a-z]", case_sensitive=False) # output lowercased
Notebooks
notebooks/train_unified_model.ipynb— downloads every dataset, merges them and trains the unified CRNN. Designed for a cloud GPU machine.notebooks/eval_per_dataset.ipynb— per-dataset accuracy on a held-out split.notebooks/eval_per_dataset_live.ipynb— predictions on freshly downloaded, unlabeled captchas (overfit check).
Tests
uv run pytest
License
MIT © Julio Trecenti
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file txtcaptcha-0.1.0.tar.gz.
File metadata
- Download URL: txtcaptcha-0.1.0.tar.gz
- Upload date:
- Size: 33.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78c410b89fe4bedf665378a8cbd7274609d085cd1422ec1731f4e7c619e63a1c
|
|
| MD5 |
46f0353c4e485e6d6bba866fc1aa71b5
|
|
| BLAKE2b-256 |
96e8a1436a0a8ff37ba63aa08f42058f2fa20abd42f837634315a6b1ae554951
|
Provenance
The following attestation bundles were made for txtcaptcha-0.1.0.tar.gz:
Publisher:
publish.yml on jtrecenti/txtcaptcha
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
txtcaptcha-0.1.0.tar.gz -
Subject digest:
78c410b89fe4bedf665378a8cbd7274609d085cd1422ec1731f4e7c619e63a1c - Sigstore transparency entry: 1280177386
- Sigstore integration time:
-
Permalink:
jtrecenti/txtcaptcha@a00e96f09c8f6f524ad220fa8673f59d9cf436e0 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/jtrecenti
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a00e96f09c8f6f524ad220fa8673f59d9cf436e0 -
Trigger Event:
release
-
Statement type:
File details
Details for the file txtcaptcha-0.1.0-py3-none-any.whl.
File metadata
- Download URL: txtcaptcha-0.1.0-py3-none-any.whl
- Upload date:
- Size: 31.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16f523e849a20c8fb1d7d8f0a6597a768d8dfc343a24cc26fdb30b1c2103d27a
|
|
| MD5 |
0e110cf279637ce9c9e1e0f15bcf9780
|
|
| BLAKE2b-256 |
4cbc357303f28361c240ff1558087e1003446684db62144a4c5f62b0fbe97c83
|
Provenance
The following attestation bundles were made for txtcaptcha-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on jtrecenti/txtcaptcha
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
txtcaptcha-0.1.0-py3-none-any.whl -
Subject digest:
16f523e849a20c8fb1d7d8f0a6597a768d8dfc343a24cc26fdb30b1c2103d27a - Sigstore transparency entry: 1280177390
- Sigstore integration time:
-
Permalink:
jtrecenti/txtcaptcha@a00e96f09c8f6f524ad220fa8673f59d9cf436e0 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/jtrecenti
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a00e96f09c8f6f524ad220fa8673f59d9cf436e0 -
Trigger Event:
release
-
Statement type: