latincy-readers

Corpus readers for Latin texts with LatinCy/spaCy integration

These details have not been verified by PyPI

Project links

Project description

LatinCy Readers

Corpus readers for Latin texts with LatinCy integration. Now also supporting Ancient Greek texts with OdyCy.

Version 1.2.0; Python 3.10+; LatinCy 3.8.0+

Installation

# Install from PyPI
pip install latincy-readers

# Install the LatinCy model (for Latin texts)
pip install https://huggingface.co/latincy/la_core_web_lg/resolve/main/la_core_web_lg-3.8.0-py3-none-any.whl

# Install the OdyCy model (for Ancient Greek texts)
pip install https://huggingface.co/chcaa/grc_odycy_joint_sm/resolve/main/grc_odycy_joint_sm-any-py3-none-any.whl

# For development (editable install)
git clone https://github.com/diyclassics/latincy-readers.git
cd latincy-readers
pip install -e ".[dev]"

Quick Start

from latincyreaders import TesseraeReader, AnnotationLevel

# Auto-download corpus on first use
reader = TesseraeReader()

# Or specify a custom path
reader = TesseraeReader("/path/to/tesserae/corpus")

# Iterate over documents as spaCy Docs
for doc in reader.docs():
    print(f"{doc._.fileid}: {len(list(doc.sents))} sentences")

# Search for sentences containing specific forms
for result in reader.find_sents(forms=["Caesar", "Caesarem"]):
    print(f"{result['citation']}: {result['sentence']}")

# Get raw text (no NLP processing)
for text in reader.texts():
    print(text[:100])

Readers

Reader	Format	Auto-Download	Description
`TesseraeReader`	`.tess`	Yes	CLTK Latin Tesserae corpus
`GreekTesseraeReader`	`.tess`	Yes	CLTK Greek Tesserae corpus (OdyCy)
`PlaintextReader`	`.txt`	No	Plain text files
`LatinLibraryReader`	`.txt`	Yes	Latin Library corpus
`TEIReader`	`.xml`	No	TEI-XML documents
`PerseusReader`	`.xml`	No	Perseus Digital Library TEI
`CamenaReader`	`.xml`	Yes	CAMENA Neo-Latin corpus
`TxtdownReader`	`.txtd`	No	Txtdown format with citations
`UDReader`	`.conllu`	No	Universal Dependencies CoNLL-U
`LatinUDReader`	`.conllu`	Yes	All 6 Latin UD treebanks

Auto-Download

Readers with auto-download support will automatically fetch the corpus on first use:

# Downloads to ~/latincy_data/lat_text_tesserae/texts if not found
reader = TesseraeReader()

# Disable auto-download
reader = TesseraeReader(auto_download=False)

# Use environment variable for custom location
# export TESSERAE_PATH=/custom/path
reader = TesseraeReader()

# Manual download to specific location
TesseraeReader.download("/path/to/destination")

Ancient Greek (GreekTesseraeReader)

Read Ancient Greek texts from the CLTK Greek Tesserae corpus using OdyCy NLP models:

from latincyreaders import GreekTesseraeReader, AnnotationLevel

# Auto-download Greek Tesserae corpus on first use
reader = GreekTesseraeReader()

# Use TOKENIZE level (no OdyCy model needed)
reader = GreekTesseraeReader(annotation_level=AnnotationLevel.TOKENIZE)

# Iterate over citation lines
for citation, text in reader.texts_by_line():
    print(f"{citation}: {text[:60]}...")

# Search for Greek words
for fid, cit, text, matches in reader.search(r"Ἀχιλ"):
    print(f"{cit}: found {matches}")

# Environment variable for custom location
# export GRC_TESSERAE_PATH=/custom/path
reader = GreekTesseraeReader()

Universal Dependencies Treebanks

Access gold-standard linguistic annotations from Latin UD treebanks:

from latincyreaders import LatinUDReader, PROIELReader

# See available treebanks
LatinUDReader.available_treebanks()
# {'proiel': 'Vulgate, Caesar, Cicero, Palladius',
#  'perseus': 'Classical texts from Perseus Digital Library',
#  'ittb': 'Index Thomisticus (Thomas Aquinas)',
#  'llct': 'Late Latin Charter Treebank',
#  'udante': "Dante's Latin works",
#  'circse': 'CIRCSE Latin treebank'}

# Use a specific treebank
reader = PROIELReader()

# Iterate sentences with UD annotations
for sent in reader.ud_sents():
    print(f"{sent._.citation}: {sent.text}")

# Access full UD token data
for token in doc:
    ud = token._.ud  # dict with all 10 CoNLL-U columns
    print(f"{token.text}: {ud['upos']} {ud['feats']}")

# Read from all treebanks at once
reader = LatinUDReader()
LatinUDReader.download_all()  # Download all 6 treebanks

Note: Unlike other readers, UDReader constructs spaCy Docs directly from gold UD annotations rather than running the spaCy NLP pipeline.

Core API

All readers provide a consistent interface:

reader.fileids()              # List available files
reader.texts(fileids=...)     # Raw text strings (generator)
reader.docs(fileids=...)      # spaCy Doc objects (generator)
reader.sents(fileids=...)     # Sentence spans (generator)
reader.tokens(fileids=...)    # Token objects (generator)
reader.metadata(fileids=...)  # File metadata (generator)

FileSelector: Fluent File Filtering

Use the select() method for complex file queries combining filename patterns and metadata:

# Filter by filename pattern (regex)
vergil_docs = reader.select().match(r"vergil\..*")

# Filter by metadata
epics = reader.select().where(genre="epic")

# Multiple conditions (AND)
vergil_epics = reader.select().where(author="Vergil", genre="epic")

# Match any of multiple values
major_authors = reader.select().where(author__in=["Vergil", "Ovid", "Horace"])

# Date ranges
augustan = reader.select().date_range(-50, 50)

# Chain multiple filters
selection = (reader.select()
    .match(r".*aen.*")
    .where(genre="epic")
    .date_range(-50, 50))

# Use with docs(), sents(), etc.
for doc in reader.docs(selection):
    print(doc._.fileid)

# Preview results
print(selection.preview(5))
print(f"Found {len(selection)} files")

Search API

# Fast regex search (no NLP)
reader.search(pattern=r"\bbell\w+")

# Form-based sentence search
reader.find_sents(forms=["amor", "amoris"])

# Lemma-based search (requires NLP)
reader.find_sents(lemma="amo")

# spaCy Matcher patterns
reader.find_sents(matcher_pattern=[{"POS": "ADJ"}, {"POS": "NOUN"}])

Text Analysis

# Build a concordance (word -> citations mapping)
conc = reader.concordance(basis="lemma")
print(conc["amor"])  # ['<catull. 1.1>', '<verg. aen. 4.1>', ...]

# Keyword in Context
for hit in reader.kwic("amor", window=5, by_lemma=True):
    print(f"{hit['left']} [{hit['match']}] {hit['right']}")
    print(f"  -- {hit['citation']}")

# N-grams
for ngram in reader.ngrams(n=2, basis="lemma"):
    print(ngram)  # "qui do", "do lepidus", ...

# Skip-grams (n-grams with gaps)
for sg in reader.skipgrams(n=2, k=1):
    print(sg)

Document Caching

Documents are cached by default for better performance when accessing the same file multiple times:

# Caching enabled by default
reader = TesseraeReader()

# Disable caching
reader = TesseraeReader(cache=False)

# Configure cache size
reader = TesseraeReader(cache_maxsize=256)

# Check cache statistics
print(reader.cache_stats())  # {'hits': 5, 'misses': 3, 'size': 3, 'maxsize': 128}

# Clear the cache
reader.clear_cache()

Annotation Levels

All linguistic annotations are provided by LatinCy spaCy-based pipelines. The full pipeline provides POS tagging, lemmatization, morphological analysis, and named entity recognition—but this can be slow for large corpora. If you don't need all annotations, you can get significant performance gains by selecting a lighter annotation level:

from latincyreaders import AnnotationLevel

# Full pipeline: POS, lemma, morphology, NER (default)
reader = TesseraeReader(annotation_level=AnnotationLevel.FULL)

# Basic: tokenization + sentence boundaries only
reader = TesseraeReader(annotation_level=AnnotationLevel.BASIC)

# Tokenization only (no sentence boundaries)
reader = TesseraeReader(annotation_level=AnnotationLevel.TOKENIZE)

# No NLP at all - use texts() for raw strings
for text in reader.texts():
    print(text)

Metadata Management

from latincyreaders import MetadataManager, MetadataSchema

# Load and merge metadata from JSON files
manager = MetadataManager("/path/to/corpus")

# Access metadata
meta = manager.get("vergil.aen.tess")
print(meta["author"], meta["date"])

# Filter files by metadata
for fileid in manager.filter_by(author="Vergil", genre="epic"):
    print(fileid)

# Date range filtering
for fileid in manager.filter_by_range("date", -50, 50):
    print(fileid)

# Validate metadata against a schema
schema = MetadataSchema(
    required={"author": str, "title": str},
    optional={"date": int, "genre": str}
)
manager = MetadataManager("/path/to/corpus", schema=schema)
result = manager.validate()
if not result.is_valid:
    print(result.errors)

Corpora Supported

Tesserae Latin Corpus
Tesserae Greek Corpus
Perseus Digital Library TEI
Latin Library
CAMENA Neo-Latin
Universal Dependencies Latin Treebanks (PROIEL, Perseus, ITTB, LLCT, UDante, CIRCSE)
Any plaintext, TEI-XML, or CoNLL-U collection

CLI Tools

Search tool in cli/:

# Lemma search (slower, finds all inflected forms)
python cli/reader_search.py --lemmas Caesar --limit 100
python cli/reader_search.py --lemmas bellum pax --fileids "cicero.*"

# Form search (fast, exact match)
python cli/reader_search.py --forms Caesar Caesarem --limit 100

# Pattern search (fast, regex)
python cli/reader_search.py --pattern "\\bTheb\\w+" --output thebes.tsv

Bibliography

Bird, S., E. Loper, and E. Klein. 2009. Natural Language Processing with Python. O'Reilly: Sebastopol, CA.
Bengfort, Benjamin, Rebecca Bilbro, and Tony Ojeda. 2018. Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning. O'Reilly: Sebastopol, CA.

Developed by Patrick J. Burns with Claude Opus 4.5. in January 2026.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.4.0

Mar 20, 2026

This version

1.2.0

Feb 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

latincy_readers-1.2.0.tar.gz (372.1 kB view details)

Uploaded Feb 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

latincy_readers-1.2.0-py3-none-any.whl (58.4 kB view details)

Uploaded Feb 20, 2026 Python 3

File details

Details for the file latincy_readers-1.2.0.tar.gz.

File metadata

Download URL: latincy_readers-1.2.0.tar.gz
Upload date: Feb 20, 2026
Size: 372.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for latincy_readers-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`54ea850985f285574b818a8b46b89cd2de21753863d8986528a8e1162719b563`
MD5	`1ad1aca27a85fb1a2e44c57623e257d0`
BLAKE2b-256	`250392edce19f8a2fbfe1d79e4af6eb463da960ec1d843b08c9644c9c77c206b`

See more details on using hashes here.

File details

Details for the file latincy_readers-1.2.0-py3-none-any.whl.

File metadata

Download URL: latincy_readers-1.2.0-py3-none-any.whl
Upload date: Feb 20, 2026
Size: 58.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for latincy_readers-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5ffa5ee3d05cac52bc07c98f9adcab6969f2daf64e58221ae02d76d837bc80ec`
MD5	`98009852bf1fea2ecaa0f442ce495389`
BLAKE2b-256	`c076912d1b48a40732f25e25f7cf6df950e9ffe089ff08812958b9351b4630b7`

See more details on using hashes here.

latincy-readers 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LatinCy Readers

Installation

Quick Start

Readers

Auto-Download

Ancient Greek (GreekTesseraeReader)

Universal Dependencies Treebanks

Core API

FileSelector: Fluent File Filtering

Search API

Text Analysis

Document Caching

Annotation Levels

Metadata Management

Corpora Supported

CLI Tools

Bibliography

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes