Translating Akkadian signs to transliteration using NLP algorithms
Project description
Akkademia
Akkademia is a tool for automatically transliterating Unicode cuneiform glyphs. It is written in python script and uses HMM, MEMM and BiLSTM neural networks to determine appropriate sign-readings and segmentation.
We trained these algorithms on the RINAP corpora (Royal Inscriptions of the Neo-Assyrian Period), which are available in JSON and XML/TEI formats thanks to the efforts of the Official Inscriptions of the Middle East in Antiquity (OIMEA) Munich Project of Karen Radner and Jamie Novotny, funded by the Alexander von Humboldt Foundation, available here. We achieve accuracy rates of 89.5% with HMM, 94% with MEMM, and 96.7% with BiLSTM on the trained corpora. Our model can also be used on texts from other periods and genres, with varying levels of success.
Getting Started
Akkademia can be accessed in three different ways:
- Website
- Python package
- Github clone
The website and python package are meant to be accessible to people without advanced programming knowledge.
Website
Go to the Babylonian Engine website (under development)
Go to the "Akkademia" tab and follow the instructions there for transliterating your signs.
Python Package
Our python package "akkadian" will enable you to use Akkademia on your local machine.
Prerequisites
You will need a Python 3.7.x installed. Our package currently does not work with other versions of python. You can follow the installation instructions here or go straight ahead to python's downloads page and pick an appropriate version.
Mac comes preinstalled with python 2.7, which may remain the default python version even after installing 3.7.x. To check, type python --version
into terminal. If the running version is python 2.7, the simplest short-term solution is to type python3
or pip3
in Terminal throughout instead of python
and pip
as in the instructions below.
Package Installation
You can install the package using the pip install function. If you do not have pip installed on your computer, or you are not sure whether it is installed or not, you can follow the instructions here
Before installing the package akkadian, you will need to install the torch package. For Windows, copy the following into Command Prompt (CMD):
pip install torch==1.0.0 torchvision==0.2.1 -f https://download.pytorch.org/whl/torch_stable.html
For Mac and Linux copy the following into Terminal:
pip install torch torchvision
Then, type the following in Command Prompt (Windows), or Terminal (Mac and Linux):
pip install akkadian
your installation should be executed. This will take several minutes.
Running
Open a python IDE (Integrated development environment) where a python code can be run. There are many possible IDEs, see realpython's guide or wiki python's list. For beginners, we recommend using Jupyter Notebook: see downloading instructions here, or see downloading instructions and beginners' tutorial here.
First, import akkadian.transliterate
into your coding environment:
import akkadian.transliterate as akk
Then, you can use HMM, MEMM, or BiLSTM to transliterate the signs. The functions are:
akk.transliterate_hmm("Unicode_signs_here")
akk.transliterate_memm("Unicode_signs_here")
akk.transliterate_bilstm("Unicode_signs_here")
akk.transliterate_bilstm_top3("Unicode_signs_here")
akk.transliterate_bilstm_top3
gives the top three BiLSTM options, while akk.transliterate_bilstm
gives only the top one.
For an immediate output of the results, put the akk.transliterate()
function inside the print()
function. Here are some examples with their output:
print(akk.transliterate_hmm("๐ป๐
๐๐ฟ๐ฌ๐๐
๐ฒ๐ ๐๐พ"))
ลกaโ nak-ba-i-mu-ru iลก-di-ma-a-ti
print(akk.transliterate_memm("๐ป๐
๐๐ฟ๐ฌ๐๐
๐ฒ๐ ๐๐พ"))
ลกaโ SILIM ba-i-mu-ru-iลก-di-ma-a-ti
print(akk.transliterate_bilstm("๐ป๐
๐๐ฟ๐ฌ๐๐
๐ฒ๐ ๐๐พ"))
ลกaโ nak-ba-i-mu-ru iลก-di-ma-a-ti
print(akk.transliterate_bilstm_top3("๐ป๐
๐๐ฟ๐ฌ๐๐
๐ฒ๐ ๐๐พ"))
('ลกaโ nak-ba-i-mu-ru iลก-di-ma-a-ti ', 'ลกaโ-di-ba i mu ru-iลก di ma tukul-tu ', 'MUN kis BA ลกe-MU-ลกub-ลกah-แนญi-nab-nu-ti-')
This line was taken from the first line of the Epic of Gilgamesh: ลกaโ naq-ba i-mu-ru iลก-di ma-a-ti; "He who saw the Deep, the foundation of the country" (George, A.R. 2003. The Babylonian Gilgamesh Epic: Introduction, Critical Edition and Cuneiform Texts. 2 vols. Oxford: Oxford University Press). Although the algorithms were not trained on this text genre, they show promising, useful results.
Github
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Prerequisites
You will need a Python 3.7.x installed. Our package currently does not work with other versions of python. Go to python's downloads page and pick an appropriate version.
If you don't have git installed, install git here (Choose the appropriate operating system).
If you don't have a Github user, create one here.
Installing the python dependencies
In order to run the code, you will need the torch and allennlp libraries. If you have already installed the package akkadian, these were installed on your computer and you can skip to the next step.
Install torch: For Windows, copy the following to Command Prompt
pip install torch===1.3.1 torchvision===0.4.2 -f https://download.pytorch.org/whl/torch_stable.html
for Mac and Linux, copy the following to Terminal
pip install torch torchvision
Install allennlp: copy the following to Command Prompt (with windows) or Terminal (with mac):
pip install allennlp==0.8.5
Cloning the project
Copy the following into Command Prompt (with windows) or Terminal (with mac) to clone the project:
git clone https://github.com/gaigutherz/Translating-Akkadian-using-NLP.git
Running
Now you can develop the Akkademia repository and add your improvements!
Training
Use the file train.py in order to train the models using the datasets. There is a function for each model that trains, stores the pickle and tests its performance on a specific corpora.
The functions are as follows:
hmm_train_and_test(corpora)
memm_train_and_test(corpora)
biLSTM_train_and_test(corpora)
Transliterating
Use the file transliterate.py in order to transliterate using the models. There is a function for each model that takes Unicode cuneiform signs as parameter and returns its transliteration.
Example of usage:
cuneiform_signs = "๐ป๐
๐๐ฟ๐ฌ๐๐
๐ฒ๐ ๐๐พ"
print(transliterate(cuneiform_signs))
print(transliterate_bilstm(cuneiform_signs))
print(transliterate_bilstm_top3(cuneiform_signs))
print(transliterate_hmm(cuneiform_signs))
print(transliterate_memm(cuneiform_signs))
Datasets
For training the algorithms, we used the RINAP corpora (Royal Inscriptions of the Neo-Assyrian Period), which are available in JSON and XML/TEI formats thanks to the efforts of the Humboldt Foundation-funded Official Inscriptions of the Middle East in Antiquity (OIMEA) Munich Project led by Karen Radner and Jamie Novotny, available here. The current output in our website, package and code is based on training done on these corpora alone.
For additional future training, we added the following corpora (in JSON file format) to the repository:
These corpora were all prepared by the Munich Open-access Cuneiform Corpus Initiative (MOCCI) and OIMEA project teams, both led by Karen Radner and Jamie Novotny, and are fully accessible for download in JSON or XML/TEI format in their respective project webpages (see left side-panel on project webpages and look for project-name downloads).
We also included a separate dataset which includes all the corpora in XML/TEI format.
Datasets deployment
All the dataset are taken from their respective project webpages (see left side-panel on project webpages and look for project_name downloads) and are fully accessible from there.
In our repository the datasets are located in the "raw_data" directory. They can also be downloaded from the Github repository using git clone or zip download.
Project structure
BiLSTM_input:
Contains dictionaries used for transliteration by BiLSTM.
NMT_input:
Contains dictionaries used for natural machine translation.
akkadian.egg-info:
Information and settings for akkadian python package.
akkadian:
Sources and train's output.
output: Train's output for HMM, MEMM and BiLSTM - mostly pickles.
__init__.py: Init script for akkadian python package. Initializes global variables.
bilstm.py: Class for BiLSTM train and prediction using AllenNLP implementation.
build_data.py: Code for organizing the data in dictionaries.
check_translation.py: Code for translation accuracy checking.
combine_algorithms.py: Code for prediction using both HMM, MEMM and BiLSTM.
data.py: Utils for accuracy checks and dictionaries interpretations.
full_translation_build_data.py: Code for organizing the data for full translation task.
get_texts_details.py: Util for getting more information about the text.
hmm.py: Implementation of HMM for train and prediction.
memm.py: Implementation of MEMM for train and prediction.
parse_json: Json parsing used for data organizing.
parse_xml.py: XML parsing used for data organizing.
train.py: API for training all 3 algorithms and store the output.
translation_tokenize.py: Code for tokenization of translation task.
transliterate.py: API for transliterating using all 3 algorithms.
build/lib/akkadian:
Information and settings for akkadian python package.
dist:
Akkadian python package - wheel and tar.
raw_data:
Databases used for training the models:
RINAP 1, 3-5
Additional databases for future training:
RIAO
RIBO
SAAO
SUHU
Miscellanea:
tei - the same databases (RINAP, RIAO, RIBO, SAAO, SUHU) in XML/TEI format.
random - 4 texts used for testing texts outside of the training corpora. They were randomly selected from RIAO and RIBO.
Licensing
This repository is made freely available under the Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) license. This means you are free to share and adapt the code and datasets, under the conditions that you cite the project appropriately, note any changes you have made to the original code and datasets, and if you are redistributing the project or a part thereof, you must release it under the same license or a similar one.
For more information about the license, see here.
Issues and Bugs
If you are experiencing any issues with the website, the python package akkadian or the git repository, please contact us at dhl.arieluni@gmail.com, and we would gladly assist you. We would also much appreciate feedback about using the code via the website or the python package, or about the repository itself, so please send us any comments or suggestions.
Authors
- Gai Gutherz
- Ariel Elazary
- Avital Romach
- Shai Gordin
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file akkadian-1.0.7.tar.gz
.
File metadata
- Download URL: akkadian-1.0.7.tar.gz
- Upload date:
- Size: 37.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f52c53e4235eddeff6ef3970ebc837365e04a46ba9696c48852561a898675052 |
|
MD5 | 2015e56c497d160fff8210f4042343c3 |
|
BLAKE2b-256 | 43fb356926077dfc23394ae65387b64163789229d2e25a169fa2dc48092d988c |
File details
Details for the file akkadian-1.0.7-py3-none-any.whl
.
File metadata
- Download URL: akkadian-1.0.7-py3-none-any.whl
- Upload date:
- Size: 101.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 460e878be1ee677a84016d51333b5e57cc200ee22a6a19637ef50c871bc9243c |
|
MD5 | af6f73b9d326e66f07175b6fad6d4f27 |
|
BLAKE2b-256 | 208ef0a5f0aac5bd38c7e263b6cb8358049b1edcdeff0d331219ba56add1c5eb |