A package collecting various functions to work with ancient Mediterranean datasets (textual, spatial, etc.)
Project description
anda
[toc]
pip install anda
This is a Python package for collecting, manipulation and visualizing various ancient Mediterranean data. It focus on their temporal, textual and spatial aspects. It is structured into several gradually evolving submodules, namely gr
, imda
, concs
, and textnet
.
anda.gr
from anda import gr
This module is dedicated to preprocessing of ancient Greek textual data. It contains functions for lemmatization, posttagging and translation. It relies heavely on Morhesus Dictionary.
Lemmatization
A minimal usage is to lemmatize individual word. You can either ask for only the first lemma (return_first_lemma()
) or for all possibilities (return_all_unique_lemmata()
. In most cases , the outcome is the same:
gr.return_first_lemma("ἐπιστήμην")
> 'ἐπιστήμη'
gr.return_all_unique_lemmata("ἐπιστήμην")
> 'ἐπιστήμη'
Above these are functions lemmatize_string()
and gr.get_lemmatized_sentences()
. Both work with string of any length. The first returns a list of lemmata. The second returns a list of lemmatized sentences.
string = "Πρότασις μὲν οὖν ἐστὶ λόγος καταφατικὸς ἢ ἀποφατικὸς τινὸς κατά τινος. Οὗτος δὲ ἢ καθόλου ἢ ἐν μέρει ἢ ἀδιόριστος. Λέγω δὲ καθόλου μὲν τὸ παντὶ ἢ μηδενὶ ὑπάρχειν, ἐν μέρει δὲ τὸ τινὶ ἢ μὴ τινὶ ἢ μὴ παντὶ ὑπάρχειν, ἀδιόριστον δὲ τὸ ὑπάρχειν ἢ μὴ ὑπάρχειν ἄνευ τοῦ καθόλου, ἢ κατὰ μέρος, οἷον τὸ τῶν ἐναντίων εἶναι τὴν αὐτὴν ἐπιστήμην ἢ τὸ τὴν ἡδονὴν μὴ εἶναι ἀγαθόν."
gr.lemmatize_string(string)
> ['πρότασις', 'λόγος', 'καταφατικός', 'ἀποφατικός', 'καθόλου', 'μέρος', 'ἀδιόριστος', 'λέγω', 'καθόλου', 'πᾶς', 'μηδείς', 'ὑπάρχω', 'μέρος', 'πᾶς', 'ὑπάρχω', 'ἀδιόριστον', 'ὑπάρχω', 'ὑπάρχω', 'ἄνευ', 'καθόλου', 'μέρος', 'οἷος', 'ἐναντίος', 'αὐτην', 'ἐπιστήμη', 'ἡδονην', 'ἀγαθός']
gr.get_lemmatized_sentences(string)
> [['πρότασις', 'λόγος', 'καταφατικός', 'ἀποφατικός'], ['καθόλου', 'μέρος', 'ἀδιόριστος'], ['λέγω', 'καθόλου', 'πᾶς', 'μηδείς', 'ὑπάρχω', 'μέρος', 'πᾶς', 'ὑπάρχω', 'ἀδιόριστον', 'ὑπάρχω', 'ὑπάρχω', 'ἄνευ', 'καθόλου', 'μέρος', 'οἷος', 'ἐναντίος', 'αὐτην', 'ἐπιστήμη', 'ἡδονην', 'ἀγαθός']]
All lemmatization functions can be further parametrized by several arguments
all_lemmata=False
:filter_by_postag=["n","a","v"]
: returns only nouns ("n"), adjectives ("a") and verbs ("v")involve_unknown=True
, ifFalse
, it returns only words found in the dictionary
Thus, you can run:
lemmatized_sentences = gr.get_lemmatized_sentences(string, all_lemmata=False, filter_by_postag=["n","a","v"], involve_unknown=False)
print(lemmatized_sentences)
> [['λόγος'], ['μέρος'], ['πᾶς', 'μηδείς', 'ὑπάρχω', 'μέρος', 'πᾶς', 'ὑπάρχω', 'ὑπάρχω', 'ὑπάρχω', 'ἄνω/ἀνίημι', 'μέρος', 'οἷος', 'ἐναντίος', 'ἐπιστήμη', 'ἀγαθός']]
(1) get_lemmatized_sentences(string, all_lemmata=False, filter_by_postag=None, involve_unknown=False)
: it receives a raw Greek text of any kind and extent as its input Such input is processed by a series of subsequent functions embedded within each other, which might be also used independently
(1) get_sentences()
splits the string into sentences by common sentence separators.
(2) lemmatize_string(sentence)
first calls tokenize_string()
, which makes a basic cleaning and stopwords filtering for the sentence, and returns a list of words. Subsequently, each word from the tokenized sentence is sent either to return_first_lemma()
or to return_all_unique_lemmata()
, on the basis of the value of the parameter all_lemmata=
(set to False
by default).
(4) return_all_unique_lemmata()
goes to the morpheus_dict
values and returns all unique lemmata.
(5) Parameter filter_by_postag=
(default None
) enables to sub-select chosen word types from the tokens, on the basis of first character in the tag "p" . Thus, to choose only nouns, adjectives, and verbs, you can set filter_by_postag=["n", "a", "v"].
Translation
Next to the lemmatization, there is also a series of functions for translations, like return_all_unique_translations(word, filter_by_postag=None, involve_unknown=False)
, useful for any wordform, and lemma_translator(word)
, where we already have a lemma.
gr.return_all_unique_translations("ὑπάρχειν", filter_by_postag=None, involve_unknown=False)
> 'to begin, make a beginning'
gr.lemma_translator("λόγος")
> 'the word'
Morphological analysis
You can also do a morphological analysis of a string
gr.morphological_analysis(string)[1:4]
> [{'i': '564347',
'f': 'μέν',
'b': 'μεν',
'l': 'μέν',
'e': 'μεν',
'p': 'g--------',
'd': '20753',
's': 'on the one hand, on the other hand',
'a': None},
{'i': '642363',
'f': 'οὖν',
'b': 'ουν',
'l': 'οὖν',
'e': 'ουν',
'p': 'g--------',
'd': '23870',
's': 'really, at all events',
'a': None},
{'i': '264221',
'f': 'ἐστί',
'b': 'εστι',
'l': 'εἰμί',
'e': 'ειμι',
'p': 'v3spia---',
'd': '9722',
's': 'I have',
'a': None}]
imda
This module will serve for importing various ancient Mediterranean resources. Most of them will be imported directly from open third-party online resources. However, some of them have been preprocessed as part of the SDAM project.
The ideal is that it will work like this:
imda.list_datasets()
>>> ['roman_provinces_117', 'EDH', 'roman_cities_hanson', 'orbis_network']
And:
rp = imda.import_dataset("roman_provinces_117", "gdf")
type(rp)
>>>geopandas.geodataframe
concs
This module contains functions for working
textnet
This module contains functions for generating, analyzing and visualizing word co-occurrence networks. It has been designed especially for working with textual data in ancient Greek.
Versions history
- 0.0.1 - initial installation
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.