A specialist Python toolkit for Ancient Greek — alphabetic Greek NLP (incl. a state-of-the-art neural pipeline) and the Aegean syllabic scripts (Linear A, Linear B, Cypriot, Cypro-Minoan).

These details have not been verified by PyPI

Project description

pyaegean

A specialist Python toolkit for Ancient Greek and the Aegean syllabic scripts: alphabetic Greek and Linear A, Linear B, the Cypriot syllabary, and Cypro-Minoan, through one small, dependency-light library.

Status: v0.8.10 (beta). Usable and tested, but the API may still shift before 1.0. Analytical and generative output on the undeciphered material (Linear A, Cypro-Minoan) is exploratory: leads for a human expert, never ground truth. The bundled Linear A corpus is a normalized transcription (no full epigraphic apparatus); for edition-grade readings consult GORILA / SigLA.

What this is

The Greek world wrote in more than one script. Alphabetic Greek carries Homer, the tragedians, and the New Testament. Centuries earlier, the Aegean syllabic scripts recorded the Bronze Age: Linear B (Mycenaean Greek, deciphered), the Cypriot syllabary (Arcado-Cypriot Greek, deciphered), and two scripts we still cannot read: Linear A (Minoan) and Cypro-Minoan.

pyaegean is a narrow, deep toolkit for all of it: a script-agnostic corpus data layer, a full Greek NLP pipeline, the analytical methods of the Linear A Research Workbench ported to Python, and a grounded, multi-provider AI layer: every result is labeled with its confidence level and source data. The core installs with zero heavy dependencies and imports instantly; heavier backends (models, treebanks, lexica) are opt-in and fetched to a local cache, never bundled.

For classicists, computational philologists, linguists, and students: anyone who wants a clean, citable data layer over Greek and the Aegean scripts. The Getting Started guide assumes no prior programming.

Highlights


All four Aegean scripts, one API	`aegean.load("lineara")` gives the bundled 1,721-inscription Linear A corpus over the full Unicode Linear A sign repertoire (47 signs carry conventional sound values, the rest are undeciphered); Linear B, the Cypriot syllabary, and Cypro-Minoan add Unicode-built inventories with small illustrative text samples (bring your own corpus for Linear B: see below). The two deciphered syllabaries transliterate and bridge into Greek: `po-me → ποιμήν` (Linear B), `pa-si-le-u-se → βασιλεύς` (Cypriot).
A deep Greek NLP pipeline	Beta Code ↔ Unicode (Beta Code is the plain-ASCII way of typing polytonic Greek), tokenize, syllabify, accent & prosody, metrical scansion (scans the Odyssey's opening; rejects lines that require synizesis), reconstructed IPA (Attic / Koine), POS, morphology, and lemmatization. Opt-in backends add attested lemmas/POS (Perseus treebank), a dictionary registry (LSJ, Middle Liddell, Cunliffe, Abbott-Smith) with Logeion deep-links, and pure-Python generalizing taggers/lemmatizers.
State-of-the-art neural NLP	The opt-in neural pipeline (`greek.use_neural_pipeline()`; runs without PyTorch): one jointly-trained model for tagging, full morphology, dependency parsing (Universal Dependencies trees), and lemmatization; in plain terms, it reads a Greek sentence and tells you each word's part of speech, grammatical form, dictionary headword, and place in the sentence's structure. Measured end-to-end through this package at 97.0 UPOS / 96.0 UFeats / 94.3 lemma / 90.2 UAS / 85.6 LAS on the UD Ancient Greek (Perseus) test benchmark, to our knowledge the best published results on every metric and robust across five training seeds (LAS 85.6 ± 0.1) (protocol & tables).
Real texts on demand	`greek.load_work("tlg0012.tlg001")` fetches a complete work (the Iliad arrives as 24 books / ~127k tokens) from Perseus canonical-greekLit / First1KGreek (CC BY-SA, commit-pinned, cached) straight into the corpus model. Don't know an id? `greek.catalog(author="Plato")` searches a bundled, offline index of 1,778 Greek works (every `-grc` edition in both repos): author, title (English or Greek), or free text, and every hit's id loads with `load_work`.
Bring your own text	`aegean.io.from_text` / `from_text_file` / `from_text_dir` / `from_csv` turn a passage, a folder of `.txt`, or a CSV into a real `Corpus`: `aegean.io.from_text("ἐν ἀρχῇ ἦν ὁ λόγος.")` gives the full filter / query / analyse / export API over your own material, with Greek run through the Greek tokenizer.
The Greek New Testament, annotated	`greek.load_nt("John", ref="1.1-18")` loads the Nestle 1904 NT with a gold lemma, morphology, and Strong's number on every token; `greek.use_dodson()` adds Koine glosses (`gloss_strongs("3056") → "a word, speech…"`). So you can lemmatize, gloss, and cite a chapter, offline. Public-domain text + CC0 annotations; one book is bundled, the full 27 fetch on demand.
Accounting reconciliation	Parses Aegean decimal numerals and metrological fractions, sums each tablet's line items, and checks them against the stated KU-RO (Linear A) / to-so (Linear B) total, flagging which balance and which don't. (≈40 of the 1,721 Linear A tablets carry a checkable total; most are too fragmentary due to preservation.)
An analyst's toolkit	Ported from the Linear A Workbench: wildcard sign-pattern search (`KU--RO`), weighted phonetic distance + alignment, morphological clustering, collocation statistics* (PMI, log-likelihood, Fisher's exact), and a compound query engine with AND / OR / NOT.
A clean, citable data layer	`Corpus` / `Document` / `Token` / `Sign` value objects, a pandas `to_dataframe()`, a lossless JSON round-trip (`to_json` / `from_json`), a first-class `query()`, and schema-valid EpiDoc / CSV / Parquet export via `aegean.io` (the EpiDoc validates against the official EpiDoc RelaxNG and round-trips editorial status, and any EpiDoc edition reads back in with `from_epidoc`). Every corpus carries provenance and a one-line citation.
A browser UI for any corpus	`aegean.io.to_workbench(corpus, "my.json")` emits a file the Linear A Research Workbench opens via `?corpus=`: your own inscriptions get its 50 analysis modules, maps, and imagery browser with zero setup. `from_workbench_export()` loads the workbench's corpus exports (and its static data API) back into Python.
Map the find-sites	`aegean.geo` turns a corpus into a geopandas GeoDataFrame: a point per inscription or per site (EPSG:4326) from a bundled Aegean gazetteer, so you can map where a word clusters or how far a script reaches. `pip install pyaegean[geo]`.
Grounded, multi-provider AI	`aegean.ai` / `aegean.translate` front Anthropic, OpenAI, Grok, Gemini, and OpenRouter. Every generative reading is built on a local, deterministic grounding step from the tools above, and is labeled exploratory with its provenance: a hypothesis, never an assertion.
Measured accuracy	Deciphered Greek uses real scholarship (attested lemmas, gold POS, measured accuracy). The undeciphered material (Linear A, Cypro-Minoan) is labeled EXPLORATORY everywhere: the tools surface leads, never answers.

Install

pip install pyaegean              # core + Linear A + Greek (zero heavy dependencies)
pip install "pyaegean[cli]"       # + the `aegean` command line
pip install "pyaegean[neural]"    # + the neural Greek pipeline & lemmatizer (onnxruntime; no torch)
pip install "pyaegean[ai]"        # + Anthropic / OpenAI / Grok / Gemini / OpenRouter clients
pip install "pyaegean[mcp]"       # + the `aegean-mcp` Model Context Protocol server (for agents)
pip install "pyaegean[all]"       # the data, AI, EpiDoc, geo, CLI, and MCP extras

Try it

No install required: run the guided tour in your browser, nothing to set up:

Or try the toolkit live in your browser: the core pipeline running client-side via Pyodide, nothing to install: ryanpavlicek.github.io/pyaegean/demo.

import aegean

corpus = aegean.load("lineara")          # 1,721 inscriptions, bundled, offline
ht = corpus.filter(site="Haghia Triada") # filter by metadata (full site name)
df = corpus.to_dataframe(level="word")   # pandas-native, one row per word

from aegean.analysis import balance_check, word_matches_sign_pattern
balance_check(corpus.get("HT13"))                       # KU-RO accounting reconciliation
[w for w, _ in corpus.word_frequencies()
 if word_matches_sign_pattern(w, "KU-*-RO")]            # wildcard sign search → ['KU-MA-RO']

from aegean import greek

greek.betacode_to_unicode("mh=nin")     # 'μῆνιν'   (type Greek in plain ASCII)
greek.syllabify("ἄνθρωπος")             # ['ἄν', 'θρω', 'πος']
greek.scan_hexameter("ἄνδρα μοι ἔννεπε, Μοῦσα, πολύτροπον, ὃς μάλα πολλὰ").pattern
# '—⏑⏑|—⏑⏑|—⏑⏑|—⏑⏑|—⏑⏑|—×'             (Odyssey 1.1)

[(r.text, r.upos, r.lemma) for r in greek.pipeline("ἐν ἀρχῇ ἦν ὁ λόγος.")]
# [('ἐν','ADP','ἐν'), ('ἀρχῇ','NOUN','ἀρχή'), ('ἦν','VERB','εἰμί'), …]   one call, per-token records

greek.catalog(author="Plato")[0]   # find a work id to load — bundled, offline, instant
# {'id': 'tlg0059.tlg001', 'author': 'Plato', 'title': 'Euthyphro', 'greek_title': 'Εὐθύφρων', 'source': 'perseus'}

Or bring your own text: a string, a .txt file, a folder of texts, or a CSV becomes a full Corpus:

from aegean import io

corpus = io.from_text("ἐν ἀρχῇ ἦν ὁ λόγος.")   # offline; Greek tokenizer
[t.text for t in corpus.get("text").tokens]    # ['ἐν', 'ἀρχῇ', 'ἦν', 'ὁ', 'λόγος']
# now corpus.query(...), corpus.word_frequencies(), aegean.io.to_csv(corpus, …) — the whole API

Or skip Python entirely: the aegean CLI ([cli] extra) covers the whole toolkit, with --json on every command and stdin piping:

aegean repl                                    # interactive shell: run commands without the `aegean` prefix
aegean show lineara HT13                       # one tablet, line by line
aegean balance lineara --strict                # reconcile every stated total
aegean greek scan "ἄνδρα μοι ἔννεπε, Μοῦσα, πολύτροπον, ὃς μάλα πολλὰ"
aegean greek pipeline "ἐν ἀρχῇ ἦν ὁ λόγος." --neural --json
aegean greek catalog --author plato            # search 1,778 loadable works (offline)
aegean import myplato.txt -o myplato.json      # your own text → a corpus, then `aegean stats myplato.json`

Everything above runs offline with zero heavy dependencies. Large assets are fetched to a local cache only when you opt in (and never bundled inside the wheel): the full Linear B corpus (aegean.load("damos")), the SigLA Linear A dataset (aegean.load("sigla")), the Linear A facsimile mirror (aegean.data.fetch("lineara-images")), the AGDT-derived lexicon and models (greek.use_treebank() and friends: small prebuilt artifacts, with build-from-source as the fallback), the LSJ index (greek.use_lsj()), and the neural models (greek.use_neural_lemmatizer() / use_neural_pipeline()).

Documentation

Full documentation lives in the project wiki:

Getting Started: for newcomers to Python
Example notebook: a runnable guided tour (open in Colab)
Tutorial: two guided, end-to-end research walkthroughs
Linear A · Linear B · Cypriot · Cypro-Minoan: per-script guides
Recipes: end-to-end scholarly workflows, each ending in a citation
Greek NLP · CLI · Analysis · AI Layer · Data & Provenance: reference
API reference: every public module, class, and function, generated from the source

Roadmap

Shipped through v0.8.10: the script-agnostic core and all four Aegean scripts; the full Greek NLP track (treebank, dependency parser, generalizing tagger and lemmatizer, the neural joint pipeline, a benchmark harness, and a neutral out-of-AGDT evaluation); a pluggable lexicon registry with Middle Liddell, Cunliffe, Abbott-Smith, LSJ, and Dodson, plus Logeion deep-links; the annotated Greek New Testament with Koine glossing; the full DAMOS Linear B and SigLA Linear A corpora on demand; corpus statistics (dispersion, keyness, bootstrap), one-line plots, and cross-script phonetic comparison; a complete data layer (lossless JSON round-trip, a compound query(), schema-valid EpiDoc / CSV / Parquet export, SQLite persistence with full-text search, an opt-in analysis cache, and Pleiades-aligned find-sites); a multi-provider AI layer (Anthropic, OpenAI, Grok, Gemini, OpenRouter) with grounded, exploratory-labeled translation; the aegean command line mirroring the Python API and the aegean-mcp server; and an in-browser demo.

On the list next:

More public-domain dictionaries in the registry (Autenrieth, Slater), as their open digitizations are confirmed license-clean
SigLA editorial-apparatus decoding, richer load_work addressing, and wider Pleiades / gazetteer coverage, as the upstream apparatus data and verified coordinates become available

About the author

Ryan Pavlicek

I'm a software engineer that likes creating useful tools for exploring interesting problems.

Contact: email or create an issue on the GitHub repo.

Email: 'ryan [dot] pavlicek [dot] github [at] gmail [dot] com'

(Replace [at] with @ and [dot] with .)

Citation

If pyaegean helped with work you publish, please cite it. In the scholarly spirit, two layers:

Always cite the underlying scholarship pyaegean stands on: GORILA (Godart & Olivier 1976–1985; all five volumes are digitized in the École française d'Athènes' CEFAEL library at that link) for Linear A; the Perseus AGDT treebank, LSJ, and (for fetched works) the Perseus Digital Library / Open Greek and Latin for Greek; the Unicode Character Database for the Linear B / Cypriot / Cypro-Minoan sign data; and GreBerta/GreTa plus the AGDT, Gorman, and Pedalion treebanks behind the neural models. The editions are listed in NOTICE, and every corpus emits its own source citation via corpus.cite().
Also cite pyaegean if you used its analysis, methods, or outputs (pin the version you ran, for reproducibility). GitHub's "Cite this repository" button: generated from CITATION.cff: gives APA / BibTeX in one click, or use:

@software{pavlicek_pyaegean,
  author  = {Pavlicek, Ryan},
  title   = {{pyaegean: a Python toolkit for Ancient Greek and the Aegean syllabic scripts}},
  year    = {2026},
  version = {0.8.10},
  url     = {https://github.com/ryanpavlicek/pyaegean}
}

No obligation for casual or exploratory use — but if it helped, I'd love to hear about it.

License

Apache-2.0. Linear A corpus data is GORILA (Godart & Olivier 1976–1985) via mwenge/lineara.xyz; the Linear B / Cypriot / Cypro-Minoan sign data is from the Unicode Character Database. Facsimile imagery © École Française d'Athènes (referenced, not redistributed). The opt-in Greek backends fetch small prebuilt artifacts derived from the Perseus AGDT (CC BY-SA 3.0) and LSJ (CC BY-SA 4.0) to cache, falling back to building from upstream. The DAMOS and SigLA corpora are CC BY-NC-SA 4.0, hosted as clearly-labeled release assets and fetched to cache: NC data is never bundled inside the wheel. See NOTICE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.10.0

Jun 25, 2026

0.9.0

Jun 25, 2026

This version

0.8.10

Jun 24, 2026

0.8.9

Jun 24, 2026

0.8.8

Jun 24, 2026

0.8.7

Jun 23, 2026

0.8.6

Jun 23, 2026

0.8.5

Jun 16, 2026

0.8.4 yanked

Jun 16, 2026

Reason this release was yanked:

broken CLI startup under typer ≥ 0.26; fixed in 0.8.5

0.8.3

Jun 15, 2026

0.8.2

Jun 15, 2026

0.8.1

Jun 15, 2026

0.8.0

Jun 12, 2026

0.7.0

Jun 10, 2026

0.6.0

Jun 10, 2026

0.5.0

Jun 10, 2026

0.4.0

Jun 10, 2026

0.3.0

Jun 10, 2026

0.2.0

Jun 8, 2026

0.1.0

Jun 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyaegean-0.8.10.tar.gz (623.1 kB view details)

Uploaded Jun 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyaegean-0.8.10-py3-none-any.whl (680.6 kB view details)

Uploaded Jun 24, 2026 Python 3

File details

Details for the file pyaegean-0.8.10.tar.gz.

File metadata

Download URL: pyaegean-0.8.10.tar.gz
Upload date: Jun 24, 2026
Size: 623.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyaegean-0.8.10.tar.gz
Algorithm	Hash digest
SHA256	`09c1a24a49655838ea51318ec650b3d1e1a746d277bd76a10688b79170a6db25`
MD5	`7a4c09ec33bee8ff47bcb740e15d6e3a`
BLAKE2b-256	`6b47465049a0e4a820187afe2b0c69bdb73219e94c41fea82451d8f0ea08f9f7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyaegean-0.8.10.tar.gz:

Publisher: release.yml on ryanpavlicek/pyaegean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyaegean-0.8.10.tar.gz
- Subject digest: 09c1a24a49655838ea51318ec650b3d1e1a746d277bd76a10688b79170a6db25
- Sigstore transparency entry: 1941279624
- Sigstore integration time: Jun 24, 2026
Source repository:
- Permalink: ryanpavlicek/pyaegean@ac5693e1e97677172b2f123d747c961f7e87de42
- Branch / Tag: refs/tags/v0.8.10
- Owner: https://github.com/ryanpavlicek
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ac5693e1e97677172b2f123d747c961f7e87de42
- Trigger Event: release

File details

Details for the file pyaegean-0.8.10-py3-none-any.whl.

File metadata

Download URL: pyaegean-0.8.10-py3-none-any.whl
Upload date: Jun 24, 2026
Size: 680.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyaegean-0.8.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8cc9c1965e6fca334a4b2dae83686dc512090e5acaaa65dab9af3a4e4f6d1ff3`
MD5	`6347df08b8c3b2b4ce902b0ffc4e3850`
BLAKE2b-256	`f340284d145a15188b11f9c72261f44bc6646b68dd01002ca0ae17097042d33e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyaegean-0.8.10-py3-none-any.whl:

Publisher: release.yml on ryanpavlicek/pyaegean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyaegean-0.8.10-py3-none-any.whl
- Subject digest: 8cc9c1965e6fca334a4b2dae83686dc512090e5acaaa65dab9af3a4e4f6d1ff3
- Sigstore transparency entry: 1941279751
- Sigstore integration time: Jun 24, 2026
Source repository:
- Permalink: ryanpavlicek/pyaegean@ac5693e1e97677172b2f123d747c961f7e87de42
- Branch / Tag: refs/tags/v0.8.10
- Owner: https://github.com/ryanpavlicek
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ac5693e1e97677172b2f123d747c961f7e87de42
- Trigger Event: release

pyaegean 0.8.10

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pyaegean

What this is

Highlights

Install

Try it

Documentation

Roadmap

About the author

Citation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance