Skip to main content

Corpus readers for Latin texts with LatinCy/spaCy integration

Project description

LatinCy Readers

LatinCy Readers

Corpus readers for Latin and Ancient Greek texts with LatinCy NLP integration.

Version 1.6.0; Python 3.10+; LatinCy 3.9.0+

Installation

# Install the package
pip install latincy-readers

# With sentence vector search support
pip install latincy-readers[vectors]

# For development (editable install)
git clone https://github.com/latincy/latincy-readers.git
cd latincy-readers
pip install -e ".[dev]"

Models

LatinCy NLP models are hosted on Hugging Face and installed separately (mirroring spaCy's pattern for language models). Install whichever you need:

# Latin model (la_core_web_lg)
pip install https://huggingface.co/latincy/la_core_web_lg/resolve/main/la_core_web_lg-3.9.0-py3-none-any.whl

# Ancient Greek model (grc_dep_web_lg)
pip install https://huggingface.co/latincy/grc_dep_web_lg/resolve/main/grc_dep_web_lg-3.8.1-py3-none-any.whl

You can skip model installation if you only need raw text iteration or AnnotationLevel.TOKENIZE.

Quick Start

from latincyreaders import TesseraeReader, AnnotationLevel

# Auto-download corpus on first use
reader = TesseraeReader()

# Or specify a custom path
reader = TesseraeReader("/path/to/tesserae/corpus")

# Iterate over documents as spaCy Docs
for doc in reader.docs():
    print(f"{doc._.fileid}: {len(list(doc.sents))} sentences")

# Search for sentences containing specific forms
for result in reader.find_sents(forms=["Caesar", "Caesarem"]):
    print(f"{result['citation']}: {result['sentence']}")

# Get raw text (no NLP processing)
for text in reader.texts():
    print(text[:100])

Readers

Reader Format Auto-Download Description
TesseraeReader .tess Yes CLTK Latin Tesserae corpus
GreekTesseraeReader .tess Yes CLTK Greek Tesserae corpus
PlaintextReader .txt No Plain text files
LatinLibraryReader .txt Yes Latin Library corpus
TEIReader .xml No TEI-XML documents
PerseusReader .xml No Perseus Digital Library TEI
CamenaReader .xml Yes CAMENA Neo-Latin corpus
DigilibLTReader .xml No digilibLT Late-Antique Latin TEI corpus
PTAReader .xml Yes Patristic Text Archive (Greek & Latin)
CSELReader .xml No Corpus Scriptorum Ecclesiasticorum Latinorum
ProjectGutenbergReader .txt Yes (fetch) Project Gutenberg plain-text files
TxtdownReader .txtd No Txtdown format with citations and critical markup
UDReader .conllu No Universal Dependencies CoNLL-U
LatinUDReader .conllu Yes All 6 Latin UD treebanks

Auto-Download

Readers with auto-download support will automatically fetch the corpus on first use:

# Downloads to ~/latincy_data/lat_text_tesserae/texts if not found
reader = TesseraeReader()

# Disable auto-download
reader = TesseraeReader(auto_download=False)

# Use environment variable for custom location
# export TESSERAE_PATH=/custom/path
reader = TesseraeReader()

# Manual download to specific location
TesseraeReader.download("/path/to/destination")

Ancient Greek (GreekTesseraeReader)

Read Ancient Greek texts from the CLTK Greek Tesserae corpus using LatinCy Greek NLP models:

from latincyreaders import GreekTesseraeReader, AnnotationLevel

# Auto-download Greek Tesserae corpus on first use
reader = GreekTesseraeReader()

# Use TOKENIZE level (no Greek model needed)
reader = GreekTesseraeReader(annotation_level=AnnotationLevel.TOKENIZE)

# Iterate over citation lines
for citation, text in reader.texts_by_line():
    print(f"{citation}: {text[:60]}...")

# Search for Greek words
for fid, cit, text, matches in reader.search(r"Ἀχιλ"):
    print(f"{cit}: found {matches}")

# Environment variable for custom location
# export GRC_TESSERAE_PATH=/custom/path
reader = GreekTesseraeReader()

Universal Dependencies Treebanks

Access gold-standard linguistic annotations from Latin UD treebanks:

from latincyreaders import LatinUDReader, PROIELReader

# See available treebanks
LatinUDReader.available_treebanks()
# {'proiel': 'Vulgate, Caesar, Cicero, Palladius',
#  'perseus': 'Classical texts from Perseus Digital Library',
#  'ittb': 'Index Thomisticus (Thomas Aquinas)',
#  'llct': 'Late Latin Charter Treebank',
#  'udante': "Dante's Latin works",
#  'circse': 'CIRCSE Latin treebank'}

# Use a specific treebank
reader = PROIELReader()

# Iterate sentences with UD annotations
for sent in reader.ud_sents():
    print(f"{sent._.citation}: {sent.text}")

# Access full UD token data
for token in doc:
    ud = token._.ud  # dict with all 10 CoNLL-U columns
    print(f"{token.text}: {ud['upos']} {ud['feats']}")

# Read from all treebanks at once
reader = LatinUDReader()
LatinUDReader.download_all()  # Download all 6 treebanks

Note: Unlike other readers, UDReader constructs spaCy Docs directly from gold UD annotations rather than running the spaCy NLP pipeline.

Core API

All readers provide a consistent interface:

reader.fileids()              # List available files
reader.texts(fileids=...)     # Raw text strings (generator)
reader.docs(fileids=...)      # spaCy Doc objects (generator)
reader.sents(fileids=...)     # Sentence spans (generator)
reader.tokens(fileids=...)    # Token objects (generator)
reader.metadata(fileids=...)  # File metadata (generator)

FileSelector: Fluent File Filtering

Use the select() method for complex file queries combining filename patterns and metadata:

# Filter by filename pattern (regex)
vergil_docs = reader.select().match(r"vergil\..*")

# Filter by metadata
epics = reader.select().where(genre="epic")

# Multiple conditions (AND)
vergil_epics = reader.select().where(author="Vergil", genre="epic")

# Match any of multiple values
major_authors = reader.select().where(author__in=["Vergil", "Ovid", "Horace"])

# Date ranges
augustan = reader.select().date_range(-50, 50)

# Chain multiple filters
selection = (reader.select()
    .match(r".*aen.*")
    .where(genre="epic")
    .date_range(-50, 50))

# Use with docs(), sents(), etc.
for doc in reader.docs(selection):
    print(doc._.fileid)

# Preview results
print(selection.preview(5))
print(f"Found {len(selection)} files")

Search API

# Fast regex search (no NLP)
reader.search(pattern=r"\bbell\w+")

# Form-based sentence search
reader.find_sents(forms=["amor", "amoris"])

# Lemma-based search (requires NLP)
reader.find_sents(lemma="amo")

# spaCy Matcher patterns
reader.find_sents(matcher_pattern=[{"POS": "ADJ"}, {"POS": "NOUN"}])

Text Analysis

# Build a concordance (word -> citations mapping)
conc = reader.concordance(basis="lemma")
print(conc["amor"])  # ['<catull. 1.1>', '<verg. aen. 4.1>', ...]

# Keyword in Context
for hit in reader.kwic("amor", window=5, by_lemma=True):
    print(f"{hit['left']} [{hit['match']}] {hit['right']}")
    print(f"  -- {hit['citation']}")

# N-grams
for ngram in reader.ngrams(n=2, basis="lemma"):
    print(ngram)  # "qui do", "do lepidus", ...

# Skip-grams (n-grams with gaps)
for sg in reader.skipgrams(n=2, k=1):
    print(sg)

Sentence Vector Search

Find semantically similar sentences across the corpus using sentence-level embeddings. Requires the vectors extra (pip install latincyreaders[vectors]).

from latincyreaders import TesseraeReader
from latincyreaders.cache.vectors import SentenceVectorConfig, SentenceVectorStore

reader = TesseraeReader()

# Build a vector index (saved to ~/latincy_data/vectors/<collection>/)
cfg = SentenceVectorConfig(collection="tesserae")
store = SentenceVectorStore(cfg)
store.build(reader)

# Semantic search
results = store.similar_to_sent("arma virumque cano", reader.nlp, top_k=5)
for r in results:
    print(f"[{r['score']:.3f}] {r['citation']}: {r['text'][:80]}")

# Or use the reader shortcut
results = reader.find_similar("amor", top_k=5, config=cfg)

# Auto-build on first query (builds index if none exists)
results = reader.find_similar("amor", auto_build=True)

# Find sentences similar to one already in the index
results = store.similar_to_doc_sent("vergil.aeneid.part.1.tess", 0, top_k=5)

# Index statistics
print(store.stats())
# {'collection': 'tesserae', 'sentences': 15800, 'vector_dim': 300, ...}

Vectors are stored as memory-mapped NumPy arrays for efficient search without external dependencies. See notebooks/vector-search-demo.ipynb for a full walkthrough.

Document Caching

Documents are cached by default for better performance when accessing the same file multiple times:

# Caching enabled by default
reader = TesseraeReader()

# Disable caching
reader = TesseraeReader(cache=False)

# Configure cache size
reader = TesseraeReader(cache_maxsize=256)

# Check cache statistics
print(reader.cache_stats())  # {'hits': 5, 'misses': 3, 'size': 3, 'maxsize': 128}

# Clear the cache
reader.clear_cache()

Persistent Disk Cache

For large corpora, enable persistent caching to avoid re-running the NLP pipeline across sessions. Cached documents are stored as .spacy DocBin files in ~/.latincy_cache/<collection>/ by default:

from latincyreaders import TesseraeReader
from latincyreaders.cache.disk import CacheConfig

# Enable disk caching for the Tesserae corpus
config = CacheConfig(persist=True, collection="tesserae")
reader = TesseraeReader(model_name="la_core_web_lg", cache_config=config)

# First call runs NLP and caches to disk
doc = next(reader.docs(fileids="vergil.aeneid.part.1.tess"))

# Subsequent calls load from cache (~100x faster)
doc = next(reader.docs(fileids="vergil.aeneid.part.1.tess"))

# Custom cache location
config = CacheConfig(
    persist=True,
    collection="tesserae",
    cache_dir="/path/to/cache",
)

# Time-to-live (auto-expire after N days)
config = CacheConfig(persist=True, collection="tesserae", ttl_days=30)

Annotation Levels

All linguistic annotations are provided by LatinCy spaCy-based pipelines. The full pipeline provides POS tagging, lemmatization, morphological analysis, and named entity recognition—but this can be slow for large corpora. If you don't need all annotations, you can get significant performance gains by selecting a lighter annotation level:

from latincyreaders import AnnotationLevel

# Full pipeline: POS, lemma, morphology, NER (default)
reader = TesseraeReader(annotation_level=AnnotationLevel.FULL)

# Basic: tokenization + sentence boundaries only
reader = TesseraeReader(annotation_level=AnnotationLevel.BASIC)

# Tokenization only (no sentence boundaries)
reader = TesseraeReader(annotation_level=AnnotationLevel.TOKENIZE)

# No NLP at all - use texts() for raw strings
for text in reader.texts():
    print(text)

Metadata Management

from latincyreaders import MetadataManager, MetadataSchema

# Load and merge metadata from JSON files
manager = MetadataManager("/path/to/corpus")

# Access metadata
meta = manager.get("vergil.aen.tess")
print(meta["author"], meta["date"])

# Filter files by metadata
for fileid in manager.filter_by(author="Vergil", genre="epic"):
    print(fileid)

# Date range filtering
for fileid in manager.filter_by_range("date", -50, 50):
    print(fileid)

# Validate metadata against a schema
schema = MetadataSchema(
    required={"author": str, "title": str},
    optional={"date": int, "genre": str}
)
manager = MetadataManager("/path/to/corpus", schema=schema)
result = manager.validate()
if not result.is_valid:
    print(result.errors)

Corpora Supported

CLI Tools

Tools in cli/:

# Sentence search
python cli/reader_search.py --lemmas Caesar --limit 100
python cli/reader_search.py --forms Caesar Caesarem --limit 100
python cli/reader_search.py --pattern "\\bTheb\\w+" --output thebes.tsv

# Vector search — build and query sentence vector indices
python cli/vector_search.py build
python cli/vector_search.py build --collection vergil --fileids "vergil.*"
python cli/vector_search.py query "arma virumque cano" --top-k 10
python cli/vector_search.py stats

Bibliography

  • Bird, S., E. Loper, and E. Klein. 2009. Natural Language Processing with Python. O'Reilly: Sebastopol, CA.
  • Bengfort, Benjamin, Rebecca Bilbro, and Tony Ojeda. 2018. Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning. O'Reilly: Sebastopol, CA.

Developed by Patrick J. Burns with Claude Code in 2026.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

latincy_readers-1.6.1.tar.gz (91.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

latincy_readers-1.6.1-py3-none-any.whl (111.7 kB view details)

Uploaded Python 3

File details

Details for the file latincy_readers-1.6.1.tar.gz.

File metadata

  • Download URL: latincy_readers-1.6.1.tar.gz
  • Upload date:
  • Size: 91.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for latincy_readers-1.6.1.tar.gz
Algorithm Hash digest
SHA256 047d565b0432e5ad8031cfa5590c0c2cccf8e5ad73c9a85324bcf9bcf87b9987
MD5 6942b38ef1c46eae2f9597572e7facee
BLAKE2b-256 c624d8386dbcfee977d8e3968374a6bb64f103eed1ef2784ef329498599e99bb

See more details on using hashes here.

File details

Details for the file latincy_readers-1.6.1-py3-none-any.whl.

File metadata

  • Download URL: latincy_readers-1.6.1-py3-none-any.whl
  • Upload date:
  • Size: 111.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for latincy_readers-1.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f72c7744b66c9b4d4393e9e758a4c7c15fcdffe521995e3907c5520341421d7b
MD5 a208d514b4056a596b807660160c6dcc
BLAKE2b-256 8ae7313dae058dc5aa6d37180e3dcafb1f476ec6bdb0ca1a659c88612b1d2c64

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page