Skip to main content

Latin text preprocessing: U/V normalization, long-s correction, and more

Project description

latincy-preprocess

Latin text preprocessing: U/V normalization, long-s OCR correction, diacritics stripping, and macron removal — with optional Rust acceleration and spaCy integration.

Consolidates latincy-uv and latincy-long-s into a single package.

Installation

pip install latincy-preprocess

For spaCy pipeline components:

pip install latincy-preprocess[spacy]

Quick Start

from latincy_preprocess import normalize

normalize("Gallia eft omnis diuisa in partes tres")
# 'Gallia est omnis divisa in partes tres'

Per-Normalizer Usage

U/V Normalization

Converts u-only Latin spelling to proper u/v distinction using rule-based analysis:

from latincy_preprocess import normalize_uv

normalize_uv("Arma uirumque cano")
# 'Arma virumque cano'

Rules handle digraphs (qu), trigraphs (ngu), morphological exceptions (cui, fuit), positional context (initial, intervocalic, post-consonant), and case preservation.

Long-S OCR Correction

Corrects OCR errors where historical long-s (ſ) was misread as f, using n-gram frequency analysis from Latin treebank data:

from latincy_preprocess import LongSNormalizer

normalizer = LongSNormalizer()

word, rules = normalizer.normalize_word_full("ftatua")
# ('statua', [TransformationRule(...)])

text = normalizer.normalize_text_full("funt in fundamento reipublicae ftatua")
# 'sunt in fundamento reipublicae statua'

Two-pass strategy: Pass 1 applies high-confidence rules (impossible bigrams like ft, fp, fc). Pass 2 uses 4-gram frequency disambiguation for ambiguous word-initial f- patterns.

Diacritics and Macrons

from latincy_preprocess import strip_diacritics, strip_macrons

strip_macrons("ārma")
# 'arma'

strip_diacritics("λόγος")
# 'λογος'

spaCy Integration

Three pipeline components are available as spaCy factories:

Unified Preprocessor (recommended)

Chains long-s correction → U/V normalization in the correct order:

import spacy

nlp = spacy.blank("la")
nlp.add_pipe("latin_preprocessor")

doc = nlp("Gallia eft omnis diuisa in partes tres")
doc._.preprocessed          # 'Gallia est omnis divisa in partes tres'
doc[2]._.preprocessed       # 'est'
doc[2]._.preprocessed_lemma # normalized lemma

Either normalizer can be disabled:

nlp.add_pipe("latin_preprocessor", config={"uv": False})
nlp.add_pipe("latin_preprocessor", config={"long_s": False})

Standalone Components

nlp.add_pipe("uv_normalizer")
# doc._.uv_normalized, token._.uv_normalized, token._.uv_normalized_lemma

nlp.add_pipe("long_s_normalizer")
# doc._.long_s_normalized, token._.long_s_normalized

Rust Backend

When compiled with maturin, a Rust backend provides ~3x throughput for both normalizers. The backend is selected automatically:

from latincy_preprocess import backend

backend()  # 'rust' or 'python'

The Python backend is fully functional and used as the fallback.

Accuracy

U/V Normalization

Dataset Accuracy
Curated test set (100 sentences) 100%
UD Latin PROIEL (~21K u/v chars) ~98%
UD Latin Perseus (~18K u/v chars) ~97%

Long-S Correction

Pass 1 rules have a 0.00% false positive rate. Pass 2 disambiguation uses a protected allowlist of ~170 common Latin f- words (inline in long_s/_rules.py) plus n-gram frequency tables (JSON files in long_s/data/ngrams/).

Changelog

0.1.1

  • Fix: strip_diacritics() no longer lowercases text — now preserves original case. Lowercasing was an unintended side effect conflating two separate operations.

0.1.0

  • Initial release: U/V normalization, long-s OCR correction, diacritics stripping, macron removal, spaCy integration, optional Rust backend.

Citation

@software{latincy_preprocess,
  title = {latincy-preprocess: Text Preprocessing for LatinCy Projects},
  author = {Burns, Patrick J.},
  year = {2026},
  url = {https://github.com/diyclassics/latincy-preprocess}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

latincy_preprocess-0.1.1.tar.gz (30.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

latincy_preprocess-0.1.1-cp312-cp312-macosx_11_0_arm64.whl (321.6 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

File details

Details for the file latincy_preprocess-0.1.1.tar.gz.

File metadata

  • Download URL: latincy_preprocess-0.1.1.tar.gz
  • Upload date:
  • Size: 30.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for latincy_preprocess-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5d825e946e4fba6c5985ed20215804b1733d4cc1b27569731156c416c7633a74
MD5 e72359e3ca22e2917395b275859c9563
BLAKE2b-256 ce58b0e526e35610a48a2b31d6b6f555392cd185b31d33556952f735fbe21a0a

See more details on using hashes here.

File details

Details for the file latincy_preprocess-0.1.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for latincy_preprocess-0.1.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 768d6150e22b1c04922d3b8e7ab8fd9a49e9225b3fdf7408330c7dbed65f665c
MD5 015293cab40aa7cfd999c0044980ca23
BLAKE2b-256 a1e8b20db3d8d8e5aec91227ace7926ea58306ffae4266209b8dcfd22d749414

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page