Whitaker's Words lexical data for LatinCy

These details have not been verified by PyPI

Project description

LatinCy Readers

LatinCy Lexicon

Whitaker's Words as LatinCy pipeline components for Latin NLP.

latincy-lexicon makes the lexical data and morphological analysis engine from Whitaker's Words available as spaCy pipeline components, designed for use with LatinCy language models.

Quick Start

import spacy

nlp = spacy.load("la_core_web_lg")
nlp.add_pipe("whitakers_words", config={
    "lexicon_path": "data/json/lexicon.json",
    "analyzer_path": "data/json/analyzer.json",
})
nlp.add_pipe("paradigm_generator", config={
    "analyzer_path": "data/json/analyzer.json",
})

doc = nlp("Poeta bonus carmina pulchra scribit.")

# Dictionary glosses
for token in doc:
    if token._.gloss:
        print(f"{token.text:12} {token._.gloss}")
# Poeta        poet
# bonus        good, honest, brave, noble, kind, pleasant, right
# carmina      song, poem
# pulchra      pretty, beautiful, handsome, noble, illustrious
# scribit      write

# Reinflection: change morphological features, get the right Latin form
scribit = doc[4]
print(scribit._.reinflect(Number="Plur"))    # scribunt
print(scribit._.reinflect(Tense="Imp"))      # scribebat
print(scribit._.reinflect(Voice="Pass"))     # scribitur

Features

whitakers_words — Single pipeline component providing dictionary glosses (token._.lexicon), rule-based morphological analysis (token._.ww), and short definitions (token._.gloss)
paradigm_generator — Generates complete inflectional paradigms for any lemma, with reinflection support (token._.paradigm, token._.reinflect)
Standalone Generator API — Produce all inflected forms for a lemma, or build form-to-lemma lookup tables, without requiring spaCy
POS-aware ranking — Uses upstream tagger/morphologizer output to rank ambiguous entries and parses
Multi-signal disambiguation — Scores candidates using lemma match, morphological features, dependency labels, NER context, and dictionary frequency

Installation

pip install latincy-lexicon

Or for development:

git clone https://github.com/latincy/latincy-lexicon.git
cd latincy-lexicon
uv venv && source .venv/bin/activate
uv pip install -e ".[dev,spacy]"

Data Setup

The Whitaker's Words data files are bundled in the package. Build the JSON data files with a single command:

latincy-lexicon build

This parses the bundled DICTLINE, INFLECTS, UNIQUES, and ADDONS files, applies patches (sum/esse, pronoun endings), reconstructs headwords, and writes analyzer.json and lexicon.json to data/json/.

Usage

import spacy

nlp = spacy.load("la_core_web_lg")

# Add Whitaker's Words (lexicon + analyzer in one component)
nlp.add_pipe("whitakers_words", config={
    "lexicon_path": "data/json/lexicon.json",
    "analyzer_path": "data/json/analyzer.json",
})

doc = nlp("Gallia est omnis divisa in partes tres.")

for token in doc:
    print(f"{token.text:12} {token._.gloss}")

Pipeline Components

`whitakers_words`

A single component that provides three token extensions:

token._.lexicon — list of dictionary entries matching the token's lemma, with glosses, part of speech, principal parts, and age/frequency metadata
token._.ww — full morphological parse list from the Words stem+ending engine, ranked by POS match, morphological features, dependency labels, NER context, and frequency
token._.gloss — short definition from the top-ranked parse

Either data path is optional: pass only lexicon_path for dictionary lookups, only analyzer_path for morphological analysis, or both. Best results when placed after all LatinCy pipeline components.

`paradigm_generator`

Generates complete inflectional paradigms for Latin words. The inverse of the analyzer: given a lemma, it produces all inflected forms with UD morphological features.

nlp.add_pipe("paradigm_generator", config={
    "analyzer_path": "data/json/analyzer.json",
})

doc = nlp("Amat puellam.")
for token in doc:
    if token._.paradigm:
        print(f"{token.text}: {len(token._.paradigm)} forms")

Token extensions:

token._.paradigm — list of all inflected forms for the token's lemma, each with form, lemma, upos, and feats (dict of UD features). None for punctuation or unknown lemmas.
token._.reinflect(**overrides) — returns a surface form matching the token's current morphology merged with the provided UD feature overrides, or None if no match exists.

doc = nlp("amat")
doc[0]._.reinflect(Number="Plur")           # "amant"
doc[0]._.reinflect(Tense="Imp")             # "amabat"
doc[0]._.reinflect(Tense="Imp", Number="Plur")  # "amabant"

Standalone Generator API

The Generator class can be used independently of spaCy:

from latincy_lexicon.generator import Generator

gen = Generator.from_json("data/json/analyzer.json")

# Generate all forms of a lemma
forms = gen.generate("amo")              # all forms of "amo"
forms = gen.generate("rex", pos="N")     # noun forms only

for f in forms[:5]:
    print(f"{f.form:15} {f.upos:6} {f.feats}")
# amo             VERB   Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin|Voice=Act
# amas            VERB   Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin|Voice=Act
# amat            VERB   Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act
# amamus          VERB   Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin|Voice=Act
# amatis          VERB   Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin|Voice=Act

# Build form→lemma lookup tables for batch processing
lookup = gen.to_lookup_dict(["rex", "puella"])
# {"rex": "rex", "regis": "rex", "regi": "rex", ..., "puella": "puella", ...}

Each Form has four fields: form (surface), lemma (citation), upos (UD POS), and feats (UD feature string).

Acknowledgments

This project is built on Whitaker's Words, a Latin dictionary and morphological analysis program created by Colonel William A. Whitaker (USAF, Retired). The WORDS system — including its lexicon (DICTLINE), inflection tables (INFLECTS), and morphological analysis logic — is the foundation of latincy-lexicon. Whitaker made all parts of the WORDS system freely available for any purpose ("Permission is hereby freely given for any and all use of program and data.", cf. here); this project exists because of that generosity.

The WORDS data files used by this project are maintained at mk270/whitakers-words. Thank you to Martin Keegan for continuing Whitaker's work and sharing that work in the same spirit.

License

The original Python code in this project is released under the MIT License.

The Whitaker's Words data and analysis logic incorporated in this project are copyright William A. Whitaker (1936–2010) and distributed under his original permissive license (see LICENSE for full text).

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.4

Apr 27, 2026

0.2.3

Apr 27, 2026

0.2.2

Apr 23, 2026

0.2.1

Apr 21, 2026

This version

0.2.0

Apr 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

latincy_lexicon-0.2.0.tar.gz (1.4 MB view details)

Uploaded Apr 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

latincy_lexicon-0.2.0-py3-none-any.whl (1.3 MB view details)

Uploaded Apr 17, 2026 Python 3

File details

Details for the file latincy_lexicon-0.2.0.tar.gz.

File metadata

Download URL: latincy_lexicon-0.2.0.tar.gz
Upload date: Apr 17, 2026
Size: 1.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for latincy_lexicon-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`e5c29dba7434d74982ddc233735ce8573046e20baa388669401be418119c6e9c`
MD5	`ca84c9a5e4fae4b925cdef46452323e5`
BLAKE2b-256	`36d1d1d1b6381f009f24ee1a299df3b1fe4fb2c433353010f019d24d86c3c60f`

See more details on using hashes here.

File details

Details for the file latincy_lexicon-0.2.0-py3-none-any.whl.

File metadata

Download URL: latincy_lexicon-0.2.0-py3-none-any.whl
Upload date: Apr 17, 2026
Size: 1.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for latincy_lexicon-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ea02b00217a5bc5279be84bbe622623e0bfa6487de35f61229a168bf063005d9`
MD5	`7ce5b37b7f33a9e2c45149996b58e524`
BLAKE2b-256	`5e6720c260554e61eb5020ecde87647d63d336e37a533d1853a5d05d217debd2`

See more details on using hashes here.

latincy-lexicon 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

LatinCy Lexicon

Quick Start

Features

Installation

Data Setup

Usage

Pipeline Components

`whitakers_words`

`paradigm_generator`

Standalone Generator API

Acknowledgments

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes