Latin text preprocessing: U/V normalization, long-s correction, and more
Project description
latincy-preprocess
Latin text preprocessing: U/V normalization, long-s OCR correction, diacritics stripping, and macron removal — with optional Rust acceleration and spaCy integration.
Consolidates latincy-uv and latincy-long-s into a single package.
Installation
pip install latincy-preprocess
For spaCy pipeline components:
pip install latincy-preprocess[spacy]
Quick Start
from latincy_preprocess import normalize
normalize("Gallia eft omnis diuisa in partes tres")
# 'Gallia est omnis divisa in partes tres'
Per-Normalizer Usage
U/V Normalization
Converts u-only Latin spelling to proper u/v distinction using rule-based analysis:
from latincy_preprocess import normalize_uv
normalize_uv("Arma uirumque cano")
# 'Arma virumque cano'
Rules handle digraphs (qu), trigraphs (ngu), morphological exceptions (cui, fuit), positional context (initial, intervocalic, post-consonant), and case preservation.
Long-S OCR Correction
Corrects OCR errors where historical long-s (ſ) was misread as f, using n-gram frequency analysis from Latin treebank data:
from latincy_preprocess import LongSNormalizer
normalizer = LongSNormalizer()
word, rules = normalizer.normalize_word_full("ftatua")
# ('statua', [TransformationRule(...)])
text = normalizer.normalize_text_full("funt in fundamento reipublicae ftatua")
# 'sunt in fundamento reipublicae statua'
Two-pass strategy: Pass 1 applies high-confidence rules (impossible bigrams like ft, fp, fc). Pass 2 uses 4-gram frequency disambiguation for ambiguous word-initial f- patterns.
Diacritics and Macrons
from latincy_preprocess import strip_diacritics, strip_macrons
strip_macrons("ārma")
# 'arma'
strip_diacritics("λόγος")
# 'λογος'
spaCy Integration
Three pipeline components are available as spaCy factories:
Unified Preprocessor (recommended)
Chains long-s correction → U/V normalization in the correct order:
import spacy
nlp = spacy.blank("la")
nlp.add_pipe("latin_preprocessor")
doc = nlp("Gallia eft omnis diuisa in partes tres")
doc._.preprocessed # 'Gallia est omnis divisa in partes tres'
doc[2]._.preprocessed # 'est'
doc[2]._.preprocessed_lemma # normalized lemma
Either normalizer can be disabled:
nlp.add_pipe("latin_preprocessor", config={"uv": False})
nlp.add_pipe("latin_preprocessor", config={"long_s": False})
Standalone Components
nlp.add_pipe("uv_normalizer")
# doc._.uv_normalized, token._.uv_normalized, token._.uv_normalized_lemma
nlp.add_pipe("long_s_normalizer")
# doc._.long_s_normalized, token._.long_s_normalized
Rust Backend
When compiled with maturin, a Rust backend provides ~3x throughput for both normalizers. The backend is selected automatically:
from latincy_preprocess import backend
backend() # 'rust' or 'python'
The Python backend is fully functional and used as the fallback.
Accuracy
U/V Normalization
| Dataset | Accuracy |
|---|---|
| Curated test set (100 sentences) | 100% |
| UD Latin PROIEL (~21K u/v chars) | ~98% |
| UD Latin Perseus (~18K u/v chars) | ~97% |
Long-S Correction
Pass 1 rules have a 0.00% false positive rate. Pass 2 disambiguation uses a protected allowlist of ~170 common Latin f- words (inline in long_s/_rules.py) plus n-gram frequency tables (JSON files in long_s/data/ngrams/).
Changelog
See CHANGELOG.md for release history.
Citation
@software{latincy_preprocess,
title = {latincy-preprocess: Text Preprocessing for LatinCy Projects},
author = {Burns, Patrick J.},
year = {2026},
url = {https://github.com/latincy/latincy-preprocess}
}
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file latincy_preprocess-0.2.0.tar.gz.
File metadata
- Download URL: latincy_preprocess-0.2.0.tar.gz
- Upload date:
- Size: 158.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7a755b2a90a9883241a3511eb5c45a0c6de178ef3234a8e8ed5f1da27983fd8
|
|
| MD5 |
e4fd291dcbb9c64f273400a5b29e1047
|
|
| BLAKE2b-256 |
26fd55ebf7816f3ce398865bffb57a59f6c8f20fbd24c4a4215dc982e4dae0b1
|
File details
Details for the file latincy_preprocess-0.2.0-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: latincy_preprocess-0.2.0-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 456.2 kB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f01972ea8a5efa1d5503a0a482ed58eeef2d2b7030a13e839a4c053611d2728
|
|
| MD5 |
e766abd872989567e515b890fac821c1
|
|
| BLAKE2b-256 |
fb7eab2a3c1f3305e3bee68f02779138e517a32c64a2c37c88404e0d6841e234
|