Skip to main content

Latin text preprocessing: U/V normalization, long-s correction, and more

Project description

LatinCy Preprocess

Latin text preprocessing: U/V normalization, long-s OCR correction, diacritics stripping, macron removal, and Beta Code → Unicode Greek conversion — with optional Rust acceleration and spaCy integration.

Consolidates latincy-uv and latincy-long-s into a single package.

Installation

pip install latincy-preprocess

For spaCy pipeline components:

pip install latincy-preprocess[spacy]

Quick Start

from latincy_preprocess import normalize

normalize("Gallia eft omnis diuisa in partes tres")
# 'Gallia est omnis divisa in partes tres'

Per-Normalizer Usage

U/V Normalization

Converts u-only Latin spelling to proper u/v distinction using rule-based analysis:

from latincy_preprocess import normalize_uv

normalize_uv("Arma uirumque cano")
# 'Arma virumque cano'

Rules handle digraphs (qu), trigraphs (ngu), morphological exceptions (cui, fuit), positional context (initial, intervocalic, post-consonant), and case preservation.

Long-S OCR Correction

Corrects OCR errors where historical long-s (ſ) was misread as f, using n-gram frequency analysis from Latin treebank data:

from latincy_preprocess import LongSNormalizer

normalizer = LongSNormalizer()

word, rules = normalizer.normalize_word_full("ftatua")
# ('statua', [TransformationRule(...)])

text = normalizer.normalize_text_full("funt in fundamento reipublicae ftatua")
# 'sunt in fundamento reipublicae statua'

Two-pass strategy: Pass 1 applies high-confidence rules (impossible bigrams like ft, fp, fc). Pass 2 uses 4-gram frequency disambiguation for ambiguous word-initial f- patterns.

Diacritics and Macrons

from latincy_preprocess import strip_diacritics, strip_macrons

strip_macrons("ārma")
# 'arma'

strip_diacritics("λόγος")
# 'λογος'

Beta Code → Unicode Greek

Latin prose corpora often encode embedded Greek quotations as TLG/Perseus-style Beta Code. Convert it to polytonic Unicode (NFC):

from latincy_preprocess import beta_to_unicode

beta_to_unicode("zei/dwros a)/roura")
# 'ζείδωρος ἄρουρα'

Note: this transliterates every ASCII letter to Greek, so apply it only to isolated Beta Code spans, not mixed Latin/Greek text. Use is_betacode() to guard or segment input:

from latincy_preprocess import beta_to_unicode, is_betacode

span = "a)/nqrwpos"
clean = beta_to_unicode(span) if is_betacode(span) else span
# 'ἄνθρωπος'  —  Latin spans are left untouched

is_betacode() is a heuristic (Beta Code written with no diacritics is indistinguishable from Latin), but it reliably catches accented Greek and ignores ordinary Latin punctuation.

spaCy Integration

Three pipeline components are available as spaCy factories:

Unified Preprocessor (recommended)

Chains long-s correction → U/V normalization in the correct order:

import spacy

nlp = spacy.blank("la")
nlp.add_pipe("latin_preprocessor")

doc = nlp("Gallia eft omnis diuisa in partes tres")
doc._.preprocessed          # 'Gallia est omnis divisa in partes tres'
doc[2]._.preprocessed       # 'est'
doc[2]._.preprocessed_lemma # normalized lemma

Either normalizer can be disabled:

nlp.add_pipe("latin_preprocessor", config={"uv": False})
nlp.add_pipe("latin_preprocessor", config={"long_s": False})

Standalone Components

nlp.add_pipe("uv_normalizer")
# doc._.uv_normalized, token._.uv_normalized, token._.uv_normalized_lemma

nlp.add_pipe("long_s_normalizer")
# doc._.long_s_normalized, token._.long_s_normalized

Rust Backend

When compiled with maturin, a Rust backend provides ~3x throughput for both normalizers. The backend is selected automatically:

from latincy_preprocess import backend

backend()  # 'rust' or 'python'

The Python backend is fully functional and used as the fallback.

Accuracy

U/V Normalization

Dataset Accuracy
Curated test set (100 sentences) 100%
UD Latin PROIEL (~21K u/v chars) ~98%
UD Latin Perseus (~18K u/v chars) ~97%

Long-S Correction

Pass 1 rules have a 0.00% false positive rate. Pass 2 disambiguation uses a protected allowlist of ~170 common Latin f- words (inline in long_s/_rules.py) plus n-gram frequency tables (JSON files in long_s/data/ngrams/).

Changelog

See CHANGELOG.md for release history.

Citation

@software{latincy_preprocess,
  title = {latincy-preprocess: Text Preprocessing for LatinCy Projects},
  author = {Burns, Patrick J.},
  year = {2026},
  url = {https://github.com/latincy/latincy-preprocess}
}

Acknowledgments

The betacode submodule adapts the Beta Code → Unicode conversion tables and algorithm from the Classical Language Toolkit (cltk.alphabet.grc.beta_to_unicode), used under the MIT License (Copyright © 2013 Classical Language Toolkit). It is reimplemented here on the Python standard library so the package remains dependency-free.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

latincy_preprocess-0.3.1.tar.gz (164.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

latincy_preprocess-0.3.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (490.3 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

latincy_preprocess-0.3.1-cp313-cp313-win_amd64.whl (352.5 kB view details)

Uploaded CPython 3.13Windows x86-64

latincy_preprocess-0.3.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (488.9 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

latincy_preprocess-0.3.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (481.6 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ ARM64

latincy_preprocess-0.3.1-cp313-cp313-macosx_11_0_arm64.whl (443.8 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

latincy_preprocess-0.3.1-cp313-cp313-macosx_10_12_x86_64.whl (449.6 kB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

latincy_preprocess-0.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (489.3 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

latincy_preprocess-0.3.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (481.8 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

latincy_preprocess-0.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (489.3 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

latincy_preprocess-0.3.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (482.2 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

latincy_preprocess-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (489.5 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

latincy_preprocess-0.3.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (482.4 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

File details

Details for the file latincy_preprocess-0.3.1.tar.gz.

File metadata

  • Download URL: latincy_preprocess-0.3.1.tar.gz
  • Upload date:
  • Size: 164.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for latincy_preprocess-0.3.1.tar.gz
Algorithm Hash digest
SHA256 956d4714f5e60017285fa81e0cc15640eb50d124016722186d3d3bf0d4e89e54
MD5 23b3b3337d9ee179cbac84e20f5bd262
BLAKE2b-256 b2fc80c45fad2b1751816ef6c86cd3f306fcb6ae235b2a5616daf4a79447a5fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for latincy_preprocess-0.3.1.tar.gz:

Publisher: release.yml on latincy/latincy-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file latincy_preprocess-0.3.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for latincy_preprocess-0.3.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9895126c22d414d0116d7b87f108a6ae845cb04f9483701dd6d9593a9d97c771
MD5 e855b7eb81d156fb9f4a6552258a9fef
BLAKE2b-256 26bf04f10585680d03d824ebdf25b194d3d12fc39b295f9b33209a91b0b349c5

See more details on using hashes here.

Provenance

The following attestation bundles were made for latincy_preprocess-0.3.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on latincy/latincy-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file latincy_preprocess-0.3.1-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for latincy_preprocess-0.3.1-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b2316f2eab3501ceb4a36869920780b59c8b4ac35fe0ba63bf39ec20f036e786
MD5 a5fb8cb035fbc724cb9834c102cc755a
BLAKE2b-256 9d281b7fd22245dff22a1d87454dc07a33f2b97fb3dd8462e7695bc3185e4990

See more details on using hashes here.

Provenance

The following attestation bundles were made for latincy_preprocess-0.3.1-pp311-pypy311_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on latincy/latincy-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file latincy_preprocess-0.3.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for latincy_preprocess-0.3.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6a514b540894f0deb036492587bf13d5835fc773b4224693ba4da6d7e1b76d25
MD5 02064752e7de8f2bf51e9f70875fdecc
BLAKE2b-256 5b0e9f2f5de2fa7754f8b76c67a064c6d62742857730337462ea16d800d4b18d

See more details on using hashes here.

Provenance

The following attestation bundles were made for latincy_preprocess-0.3.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on latincy/latincy-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file latincy_preprocess-0.3.1-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for latincy_preprocess-0.3.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 53c35f86e94bce78cdeea398909e33c1ab1a756736b32b836fb832a37f722e6a
MD5 09b371f643432a8e2f60baea0ce8da77
BLAKE2b-256 f5a8fb344352113aa6320e484380dc20b9394adc9338f2e15f3bdd3899f9336f

See more details on using hashes here.

Provenance

The following attestation bundles were made for latincy_preprocess-0.3.1-cp313-cp313-win_amd64.whl:

Publisher: release.yml on latincy/latincy-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file latincy_preprocess-0.3.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for latincy_preprocess-0.3.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b0f67433610c284a2888e59a5cab50a5d3f66ec5d98c74f81bf92ed79e99cb7f
MD5 bb4ab769cfe8fd22b6e96a29992bc4d8
BLAKE2b-256 9d86a645efa7dc365ec3deb120ef6031046c30191378b7c5a69dea02cef72c38

See more details on using hashes here.

Provenance

The following attestation bundles were made for latincy_preprocess-0.3.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on latincy/latincy-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file latincy_preprocess-0.3.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for latincy_preprocess-0.3.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5b48b45e4b5e2eca8843209181d2a5f2922efa5c12d2a7885dbfc393b31c8956
MD5 81c2619a409f18d29611ef4bd9698d44
BLAKE2b-256 0014afda4473a769ddd48f38674bace43ad9fb00742dc35763f0e1afd7b1c859

See more details on using hashes here.

Provenance

The following attestation bundles were made for latincy_preprocess-0.3.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on latincy/latincy-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file latincy_preprocess-0.3.1-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for latincy_preprocess-0.3.1-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cbe628481f4b284a503261d27f85de19e16373d3daa13856f99b828352cfc4eb
MD5 299255791f6c77e78842495244d1379e
BLAKE2b-256 9e0f953effb662b5aeba072523c33929814e9bd278cb24fbb1c8801d67ecb800

See more details on using hashes here.

Provenance

The following attestation bundles were made for latincy_preprocess-0.3.1-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: release.yml on latincy/latincy-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file latincy_preprocess-0.3.1-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for latincy_preprocess-0.3.1-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5ba31993b50459dea2c848e9dfae08ebd377145e665fd013c00bf202caee196d
MD5 9aae02a36aa8f49274c14985e929ade2
BLAKE2b-256 9eb3b2fd33ff8e5c316ca999a7664321f5b7489ea17da74d682d668a4c0a6b03

See more details on using hashes here.

Provenance

The following attestation bundles were made for latincy_preprocess-0.3.1-cp313-cp313-macosx_10_12_x86_64.whl:

Publisher: release.yml on latincy/latincy-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file latincy_preprocess-0.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for latincy_preprocess-0.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fc5368eab35c885609c1bce2c9f3743cb1ae3c0a36f0cf9eb6353a5318729b05
MD5 38a6e328ae1b3e35360a7e41f1a9596a
BLAKE2b-256 87a5a263556a0b73c85b53f9277654a4e3151fddfddfea920327135543e5051b

See more details on using hashes here.

Provenance

The following attestation bundles were made for latincy_preprocess-0.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on latincy/latincy-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file latincy_preprocess-0.3.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for latincy_preprocess-0.3.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9e57a72613f018df6ef7da3913336410e6b1eb8d203cb6009c10aedb5ce756f4
MD5 ee55bb741138de989af4fd7b80de6bf9
BLAKE2b-256 d7ba57a516e1d7a8310ff730abc823bc8c78831bc224b0766c9a66ce603a1dd3

See more details on using hashes here.

Provenance

The following attestation bundles were made for latincy_preprocess-0.3.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on latincy/latincy-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file latincy_preprocess-0.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for latincy_preprocess-0.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 de1b6a192ea4eba107f3a1d60606f1d6508324744c5f98b05e31df75388138ed
MD5 5f751beebbdcf3752c593dfca648ecb5
BLAKE2b-256 f42639fd274741f451b9371a29833a8ee947ada216c553834fd093c23b37b762

See more details on using hashes here.

Provenance

The following attestation bundles were made for latincy_preprocess-0.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on latincy/latincy-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file latincy_preprocess-0.3.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for latincy_preprocess-0.3.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 da054634303caffa2537a9b5c78587199c1d9ab00fe51b738ef8af612e32807f
MD5 d85ea71a50b3bcd608910c286578eb4f
BLAKE2b-256 f808518cec64f1bde89f40e02ff87eff3994ad4e9a3940d49f477fccfa6dc215

See more details on using hashes here.

Provenance

The following attestation bundles were made for latincy_preprocess-0.3.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on latincy/latincy-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file latincy_preprocess-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for latincy_preprocess-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f31a9479c9f7abf5a8735282e1a49fa536c6d2e1620a62dbcfbef4cc1ccd3d62
MD5 1eaf1da330dfd2c97b07dc9fb7cddfa5
BLAKE2b-256 b6f8c1285be8ff290fc2d80acbe789e5d93e80ec26c84efc6acda288caf2038b

See more details on using hashes here.

Provenance

The following attestation bundles were made for latincy_preprocess-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on latincy/latincy-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file latincy_preprocess-0.3.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for latincy_preprocess-0.3.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b3179370bf7dd85b41c6933be41f6439dcccabf51cfa930f555b83269eff9b3b
MD5 7f00fe496178f25dc9009dbeb751a22e
BLAKE2b-256 3e384c6d4bd34147d565930dbade8f16d7b297cebb25ab0dab45f06f997a94df

See more details on using hashes here.

Provenance

The following attestation bundles were made for latincy_preprocess-0.3.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on latincy/latincy-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page