Skip to main content

Aeneas library for ancient text restoration and attribution.

Project description

Aeneas logo for the project

Contextualising ancient texts with generative neural networks

Yannis Assael1,*, Thea Sommerschield2,*, Alison Cooley3, Brendan Shillingford1, John Pavlopoulos4, Priyanka Suresh1, Bailey Herms5, Justin Grayston5, Benjamin Maynard5, Nicholas Dietrich1, Robbe Wulgaert6, Jonathan Prag7, Alex Mullen2, Shakir Mohamed1

1 Google DeepMind
2 University of Nottingham, UK
3 University of Warwick, UK
4 Athens University of Economics and Business, Greece
5 Google
6 Sint-Lievenscollege, Belgium
7 University of Oxford, UK

*Authors contributed equally to this work.


Citation

When using any of the source code or outputs of this project, please cite:

@article{asssome2025contextualising,
  title={Contextualising ancient texts with generative neural networks},
  author={Assael*, Yannis and Sommerschield*, Thea and Cooley, Alison and Pavlopoulos, John and Shillingford, Brendan and Herms, Bailey and Suresh, Priyanka and Maynard, Benjamin and Grayston, Justin and Wulgaert, Robbe and Prag, Jonathan and Mullen, Alex and Mohamed, Shakir},
  journal={Nature},
  volume={643},
  number={8073},
  year={2025},
  publisher={Nature Publishing}
}


Open this notebook in Google Colab

Human history is born in writing. Inscriptions, among the earliest written forms, offer direct insights into the thought, language, and history of ancient civilisations. Historians capture these insights by identifying parallels - inscriptions with shared phrasing, function, or cultural setting - to enable the contextualisation of texts within broader historical frameworks, and perform key tasks such as restoration and geographical or chronological attribution. However, current digital methods are restricted to literal matches and narrow historical scopes. We introduce Aeneas, the first generative neural network for contextualising ancient texts. Aeneas retrieves textual and contextual parallels, leverages visual inputs, handles arbitrary-length text restoration, and advances the state-of-the-art in key tasks.

Restoration of damaged inscription
Fragment of a bronze military diploma from Sardinia, issued by the Emperor Trajan to a sailor on a warship. 113/14 CE (CIL XVI, 60, The Metropolitan Museum of Art, Public Domain).

To evaluate its impact, we conduct the largest Historian-AI study to date, with historians considering Aeneas’ retrieved parallels useful research starting points in 90% of cases, improving their confidence in key tasks by 44%. Restoration and geographical attribution tasks yielded superior results when historians were paired with Aeneas, outperforming both humans and AI alone. For dating, Aeneas achieved a 13-year distance from ground-truth ranges. We demonstrate Aeneas’ contribution to historical workflows through analysis of key traits in the Res Gestae Divi Augusti, the most renowned Roman inscription, showing how integrating Science and Humanities can create transformative tools to assist historians and advance our understanding of the past.

Aeneas model architecture diagram
Given the image and textual transcription of an inscription (with damaged sections of unknown-length marked with the "#" character), Aeneas uses a transformer-based decoder, the "torso", to process the text. Specialised networks, called "heads", handle character restoration, date attribution, and geographical attribution (the latter also incorporating visual features). The torso's intermediate representations are merged into a unified, historically-enriched embedding to retrieve similar inscriptions from the LED, ranked by relevance.

References

Aeneas Inference Online

To aid further research in the field we created an online interactive python notebook, where researchers can query one of our trained models to get text restorations, visualise attention weights, and more.

Aeneas Inference Offline

Advanced users who want to perform inference using the trained model may want to do so manually using the predictingthepast library directly.

First, to install the predictingthepast library and its dependencies, run:

pip install .

Then, download the model files.

Latin Model

curl --output aeneas_117149994_2.pkl \
    https://storage.googleapis.com/ithaca-resources/models/aeneas_117149994_2.pkl
curl --output led.json \
    https://storage.googleapis.com/ithaca-resources/models/led.json
curl --output led_emb_xid117149994.pkl \
    https://storage.googleapis.com/ithaca-resources/models/led_emb_xid117149994.pkl

Ancient Greek Model

curl --output ithaca_153143996_2.pkl \
    https://storage.googleapis.com/ithaca-resources/models/ithaca_153143996_2.pkl
curl --output iphi.json \
    https://storage.googleapis.com/ithaca-resources/models/iphi.json
curl --output iphi_emb_xid153143996.pkl \
    https://storage.googleapis.com/ithaca-resources/models/iphi_emb_xid153143996.pkl

Inference Example

An example of using the library can be run via:

python inference_example.py \
    --input_file="example_input.txt" \
    --checkpoint_path="aeneas_117149994_2.pkl" \
    --dataset_path="led.json" \
    --retrieval_path="led_emb_xid117149994.pkl" \
    --language="latin"

This will run restoration and attribution on the text in example_input.txt.

To run it with different input text, use the --input argument:

python inference_example.py \
    --input="..." \
    --checkpoint_path="aeneas_117149994_2.pkl" \
    --dataset_path="led.json" \
    --retrieval_path="led_emb_xid117149994.pkl" \
    --language="latin"

Or use text in a UTF-8 encoded text file:

python inference_example.py \
    --input_file="some_other_input_file.txt" \
    --checkpoint_path="aeneas_117149994_2.pkl" \
    --dataset_path="led.json" \
    --retrieval_path="led_emb_xid117149994.pkl" \
    --language="latin"

The restoration or attribution JSON can be saved to a file:

python inference_example.py \
    --input_file="example_input.txt" \
    --checkpoint_path="aeneas_117149994_2.pkl" \
    --dataset_path="led.json" \
    --retrieval_path="led_emb_xid117149994.pkl" \
    --language="latin" \
    --attribute_json="attribute.json" \
    --restore_json="restore.json"

For full help, run:

python inference_example.py --help

Dataset Generation

For Latin, Aeneas was trained on data from:

  • Epigraphic Database Roma (EDR)1: Made available pursuant to a Creative Commons Attribution 4.0 International License (CC-BY) on Zenodo. EDR is also available at edr-edr.it.
  • Epigraphic Database Heidelberg (EDH)2: Made available pursuant to a Creative Commons Attribution-ShareAlike 4.0 International License (CC-BY-SA) on Zenodo. EDH is also available at edh.ub.uni-heidelberg.de.
  • ETL repository for Epigraphic Database Clauss Slaby (EDCS_ETL)3: Made available pursuant to a Creative Commons Attribution 4.0 International License (CC-BY) on Zenodo. EDCS_ETL is also available at manfredclauss.de and github.com/sdam-au/EDCS_ETL.

For ancient Greek, Aeneas was trained on Searchable Greek Inscriptions of The Packard Humanities Institute. The processed version is available at: I.PHI dataset.

Training Aeneas

See train/README.md for instructions.

License & Disclaimer

Copyright 2025 Google LLC

All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0

All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode

The dataset contains modified data from the Epigraphic Database Heidelberg dataset. That data is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC-BY-SA). You may obtain a copy of the CC-BY-SA license at: https://creativecommons.org/licenses/by-sa/4.0/legalcode.en

Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0, CC-BY-SA or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.

This is not an official Google product.

  1. Silvio Panciera, Giuseppe Camodeca, Giovanni Cecconi, Silvia Orlandi, Lanfranco Fabriani, & Silvia Evangelisti. (2019). EDR - Epigraphic Database Roma EpiDoc files [Data set]. Zenodo.

  2. James M.S. Cowey, Francisca Feraudi-Gruénais, Brigitte Gräf, Frank Grieshaber, Regine Klar, & Jonas Osnabrügge. (2019). Epigraphic Database Heidelberg EpiDoc files [Data set]. Zenodo.

  3. Heřmánková, P. (2022). EDCS (2.0) [Data set]. Zenodo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

predictingthepast-0.1.0.tar.gz (56.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

predictingthepast-0.1.0-py3-none-any.whl (64.5 kB view details)

Uploaded Python 3

File details

Details for the file predictingthepast-0.1.0.tar.gz.

File metadata

  • Download URL: predictingthepast-0.1.0.tar.gz
  • Upload date:
  • Size: 56.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for predictingthepast-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3f2f1647e4d179547ad4d057a0c9148c316069134dcb6a6f4ec2ba07a0bc754b
MD5 55d52df86221b9ce1100ff4b21484d94
BLAKE2b-256 05ace5e08e9a1f718e80ad0eb839f6d19d2fabda933ce6848c156834f9c4e409

See more details on using hashes here.

Provenance

The following attestation bundles were made for predictingthepast-0.1.0.tar.gz:

Publisher: release.yml on google-deepmind/predictingthepast

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file predictingthepast-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for predictingthepast-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fe85f7e3b8bbaf76fd912f8b26ff784cbb08e7791e669a33b00edc7fce1c3b86
MD5 5b661382ba5028ace870ddf4214ff167
BLAKE2b-256 c8a9c9357af28d86dbb62373aea75afeb499560e68fafb012c0bac4f37f87492

See more details on using hashes here.

Provenance

The following attestation bundles were made for predictingthepast-0.1.0-py3-none-any.whl:

Publisher: release.yml on google-deepmind/predictingthepast

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page