Skip to main content

A specialist Python toolkit for Ancient Greek — alphabetic Greek and the Aegean syllabic scripts (Linear A/B).

Project description

pyaegean

A specialist Python toolkit for Ancient Greek — alphabetic Greek and the Aegean syllabic scripts (Linear A, Linear B, Cypriot, and Cypro-Minoan). pyaegean focuses narrowly and deeply on Greek and the Aegean world: a script-agnostic corpus data layer, the analytical methods from the Linear A Research Workbench, translation, and a pluggable multi-provider AI layer. The excellent CLTK already serves many ancient languages broadly; pyaegean is intentionally narrower, and uses CLTK as a friendly benchmark to measure its Greek coverage against.

Status: v0.7.0 (alpha). The script-agnostic core, Linear A, Linear B (Mycenaean Greek), the Cypriot syllabary (Arcado-Cypriot Greek), and the undeciphered Cypro-Minoan script (sign inventory only) — the complete Aegean syllabic set — plus the full Greek NLP track (opt-in Perseus-treebank lemmas/POS, a generalizing tagger and lemmatizer, a neural lemmatizer for unseen forms, LSJ glossing, a baseline dependency parser, a CLTK benchmark harness, and a neutral out-of-AGDT PROIEL evaluator), and the multi-provider AI layer are all implemented. The corpus data layer adds a lossless JSON round-trip (to_json/from_json) and a compound query(), plus EpiDoc/CSV/Parquet export. Analytical output on the undeciphered Linear A material is exploratory — see the methodology and limitations.

Install

pip install pyaegean              # core + Linear A + Greek (zero heavy dependencies)
pip install "pyaegean[neural]"    # + the neural Greek lemmatizer (onnxruntime; no torch)
pip install "pyaegean[ai]"        # + Anthropic / OpenAI / Grok / Gemini clients
pip install "pyaegean[all]"       # the data, AI, EpiDoc, and geo extras

New to Python, or not a programmer? You're exactly who this tool is for. The Getting Started guide walks you from "I have nothing installed" to your first result — no prior coding assumed.

Quick start

Prefer to learn by doing? Run the guided tour in your browser — nothing to install: Open In Colab

import aegean

corpus = aegean.load("lineara")          # 1,721 inscriptions, bundled, offline
print(len(corpus))                       # 1721

ht = corpus.filter(site="Haghia Triada") # filter by metadata (full site name)
df = corpus.to_dataframe(level="word")   # pandas-native, one row per word

from aegean.analysis import balance_check, word_matches_sign_pattern
checks = balance_check(corpus.get("HT13"))          # KU-RO accounting reconciliation
hits = [w for w, _ in corpus.word_frequencies()
        if word_matches_sign_pattern(w, "KU-*-RO")] # wildcard sign search

And a taste of the Greek pipeline:

from aegean import greek

greek.betacode_to_unicode("mh=nin")     # 'μῆνιν'   (type Greek in plain ASCII)
greek.syllabify("ἄνθρωπος")             # ['ἄν', 'θρω', 'πος']
greek.scan_hexameter("ἄνδρα μοι ἔννεπε, Μοῦσα, πολύτροπον, ὃς μάλα πολλὰ").pattern
# '—⏑⏑|—⏑⏑|—⏑⏑|—⏑⏑|—⏑⏑|—×'             (Odyssey 1.1)
[str(a) for a in greek.analyze("λόγον")][:2]
# ['λόγος [NOUN acc sg masc]', 'λόγος [NOUN acc sg fem]']

The full Linear A facsimile mirror (3,368 images, ~116 MB) is not bundled; fetch it on demand: aegean.data.fetch("lineara-images") (downloaded from the workbench repo, sha256-verified, cached locally — never re-hosted). The opt-in Greek backends likewise fetch large CC BY-SA assets to cache on first use (never bundled): the Perseus AGDT treebank (~75 MB, greek.use_treebank()) and the full Perseus LSJ (~270 MB, greek.use_lsj()).

What's here

  • aegean.core — script-agnostic model: Corpus, Document, Token, Sign, SignInventory, Numeral, the Script plugin registry, provenance; a lossless JSON round-trip (to_json/from_json) and a compound query().
  • aegean.scripts.lineara — Linear A: bundled corpus + 84-sign inventory + sign→sound map + transliteration.
  • aegean.scripts.linearb — Linear B (Mycenaean Greek): 211-sign Unicode inventory + transliteration + a Greek-reading bridge (po-me → ποιμήν) + accounting; bring-your-own corpus.
  • aegean.scripts.cypriot — Cypriot syllabary (Arcado-Cypriot Greek): 55-sign Unicode inventory + transliteration + a Greek-reading bridge (pa-si-le-u-se → βασιλεύς).
  • aegean.scripts.cyprominoan — Cypro-Minoan (undeciphered Bronze Age Cyprus): 99-sign Unicode inventory + sign-sequence tokenization. No phonetics or Greek bridge — the script is undeciphered.
  • aegean.analysis — ported from the workbench: accounting reconciliation, wildcard sign-pattern search, weighted phonetic distance + alignment, morphology clustering, collocation statistics, a compound-query engine, and heuristic tablet-structure classification (all with golden-fixture parity).
  • aegean.io — export adapters: EpiDoc (TEI) write (the inverse of the bring-your-own reader, round-tripping ids/find-places/tokens/lines) plus CSV and Parquet.
  • aegean.greek — the Greek NLP track: Unicode/Beta Code normalization, word/sentence tokenization, syllabification, accent and prosody analysis, metrical scansion (dactylic hexameter + elegiac pentameter), reconstructed IPA, POS tagging, a rule-based morphological analyzer (with an optional Perseus-treebank–backed lexicon for attested, accented lemmas), and lemmatization from a rule-based baseline up to an opt-in neural lemmatizer (use_neural_lemmatizer; a GreTa seq2seq that reaches 76.3% on unseen forms), with a pure-Python edit-tree generalizer (use_lemmatizer) as the zero-dependency option. Also an opt-in generalizing POS tagger (use_tagger; ~84% on unseen forms), LSJ glossing (use_lsjgloss/lookup), a baseline dependency parser (use_parserparse; ~0.67 UAS / 0.57 LAS on projective AGDT), and a CLTK benchmark harness. aegean.load("greek") loads a small bundled sample corpus (Archaic→Koine).
  • aegean.data — bundled-data access + download-to-cache for large assets.
  • aegean.ai — multi-provider AI layer: a provider-agnostic LLMClient (Anthropic default, plus OpenAI, xAI Grok, Gemini — SDKs optional), response caching, corpus grounding, and capabilities (translate, gloss, decipherment hypotheses, NLP-assist, ask/summarize). Every generative result is labeled exploratory with provenance. aegean.translate is the hybrid lexicon+LLM front end.

Documentation

Full documentation lives in the project wiki:

Roadmap

Shipped (through v0.3): the script-agnostic core and Linear A; the multi-provider AI layer and translation; and the deep Greek NLP track — Perseus-treebank lemmas/POS, a generalizing tagger and lemmatizer, the neural lemmatizer, LSJ glossing, a baseline dependency parser, and a CLTK benchmark harness. v0.4 adds Linear B (Mycenaean Greek: a Unicode-built sign inventory, transliteration, a Greek-reading bridge, and accounting; the full corpus is bring-your-own) and the Cypriot syllabary (Arcado-Cypriot Greek). v0.5 adds Cypro-Minoan (the undeciphered Bronze Age script of Cyprus; sign inventory only), completing the Aegean set, and a neutral out-of-AGDT evaluator (greek.evaluate_on_proiel, scoring against the PROIEL treebank) backing the Greek-NLP numbers. v0.6 adds the corpus data layer's lossless JSON round-trip (to_json/from_json) and a compound query(); v0.7 adds aegean.io export adapters — EpiDoc (TEI), CSV, and Parquet — completing the EpiDoc read+write round-trip. Next: a context-aware lemmatizer and the workbench bridge toward a stable v1.0.

License

Apache-2.0. Corpus data is GORILA (Godart & Olivier 1976–1985) via mwenge/lineara.xyz; facsimile imagery © École Française d'Athènes (referenced, not redistributed). The opt-in Greek backends fetch the Perseus AGDT treebank (CC BY-SA 3.0) and Perseus LSJ (CC BY-SA 4.0) to cache — built locally, never bundled or re-hosted. See NOTICE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyaegean-0.7.0.tar.gz (235.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyaegean-0.7.0-py3-none-any.whl (233.3 kB view details)

Uploaded Python 3

File details

Details for the file pyaegean-0.7.0.tar.gz.

File metadata

  • Download URL: pyaegean-0.7.0.tar.gz
  • Upload date:
  • Size: 235.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyaegean-0.7.0.tar.gz
Algorithm Hash digest
SHA256 61c8938c651658028750162f06e83d2894b4960cc34ff84e28e25a2419da9df3
MD5 e146d3bfa988ae5fb15f571ebc504323
BLAKE2b-256 08cf7aa893f4238127bf762e94a1b84e83617279944c726997e4ceefbfa3f3ab

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyaegean-0.7.0.tar.gz:

Publisher: release.yml on ryanpavlicek/pyaegean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyaegean-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: pyaegean-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 233.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyaegean-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 96bbcd0155c995378dce33243abbd5c775a2dedf00e218cee513346d70cac8b1
MD5 d37a623d0ebb7d01f0d6305143b732bb
BLAKE2b-256 5adac29facebed6b59383642488952be4afc53511c2fca4acadbea102b0c635e

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyaegean-0.7.0-py3-none-any.whl:

Publisher: release.yml on ryanpavlicek/pyaegean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page