TEXTA Multilingual Processor (MLP)

Project description

TEXTA MLP Python package

http://pypi.texta.ee/texta-mlp/

Installation

Requirements

apt-get install python3-lxml

From PyPI

pip3 install texta-mlp/

From Git

pip3 install git+https://git.texta.ee/texta/texta-mlp-python.git

Testing

python3 -m pytest -v tests

Usage

Load MLP

Supported languages: https://stanzanlp.github.io/stanzanlp/models.html

>>> from texta_mlp.mlp import MLP
>>> mlp = MLP(language_codes=["et","en","ru"])

Process & Lemmatize Estonian

>>> mlp.process("Selle eestikeelse lausega võiks midagi ehk öelda.")
{'text': {'text': 'Selle eestikeelse lausega võiks midagi ehk öelda .', 'lang': 'et', 'lemmas': 'see eestikeelne lause võima miski ehk ütlema .', 'pos_tags': 'P A S V P J V Z'}, 'texta_facts': []}
>>>
>>> mlp.lemmatize("Selle eestikeelse lausega võiks midagi ehk öelda.")
'see eestikeelne lause võima miski ehk ütlema .'

You can use the "analyzers" argument to limit the amount of data you want to be analyzed and returned, thus speeding up the process. Accepted options are: ["lemmas", "pos_tags", "transliteration", "ner", "contacts", "entity_mapper", "all"] where "all" signifies that you want to use all analyzers (takes the most time). By the default, this value is "all".

>>> mlp.process("Selle eestikeelse lausega võiks midagi ehk öelda.", analyzers=["lemmas", "postags"])

Process & Lemmatize Russian

>>> mlp.process("Лукашенко заявил о договоренности Москвы и Минска по нефти.")
{'text': {'text': 'Лукашенко заявил о договоренности Москвы и Минска по нефти .', 'lang': 'ru', 'lemmas': 'лукашенко заявить о договоренность москва и минск по нефть .', 'pos_tags': 'X X X X X X X X X X', 'transliteration': 'Lukašenko zajavil o dogovorennosti Moskvõ i Minska po nefti .'}, 'texta_facts': []}
>>>
>>> mlp.lemmatize("Лукашенко заявил о договоренности Москвы и Минска по нефти.")
'лукашенко заявить о договоренность москва и минск по нефть .

Process & Lemmatize English

>>> mlp.process("Test sencences are rather difficult to come up with.")
{'text': {'text': 'Test sencences are rather difficult to come up with .', 'lang': 'en', 'lemmas': 'Test sencence be rather difficult to come up with .', 'pos_tags': 'NN NNS VBP RB JJ TO VB RB IN .'}, 'texta_facts': []}
>>>
>>> mlp.lemmatize("Test sencences are rather difficult to come up with.")
'Test sencence be rather difficult to come up with .'

Make MLP Throw an Exception on Unknown Languages

By default, MLP will default to Estonian if language is unknown. To not do so, one must provide use_default_language_code=False when initializing MLP.

>>> mlp.process("المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق. وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء.")
{'text': {'text': 'المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق . وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء .', 'lang': 'et', 'lemmas': 'lee 1 يولد جميع الناس leele leele في leele leele . وقد وهبوا عقلاً leele lee أن يعامل بعضهم بعضًا بروح lee .', 'pos_tags': 'S N S S S S S S S S Z S S S S S S S S Y Y Y Z'}, 'texta_facts': []}
>>>
>>> mlp = MLP(language_codes=["et","en","ru"], use_default_language_code=False)
>>> mlp.process("المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق. وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء.")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rsirel/dev/texta-mlp-package/texta_mlp/mlp.py", line 150, in process
    document = self.generate_document(raw_text, loaded_analyzers)
  File "/home/rsirel/dev/texta-mlp-package/texta_mlp/mlp.py", line 96, in generate_document
    lang = self.detect_language(processed_text)
  File "/home/rsirel/dev/texta-mlp-package/texta_mlp/mlp.py", line 89, in detect_language
    raise LanguageNotSupported("Detected language is not supported: {}.".format(lang))
texta_mlp.exceptions.LanguageNotSupported: Detected language is not supported: ar.

Change Default Language Code

Do use some other language as default, one must provide default_language_code when initializing MLP.

>>> mlp = MLP(language_codes=["et", "en", "ru"], default_language_code="en")
>>>
>>> mlp.process("المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق. وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء.")
{'text': {'text': 'المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق . وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء .', 'lang': 'en', 'lemmas': 'المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق . وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء .', 'pos_tags': 'NN CD , NN NN NN NN IN NN NN . UH NN NN NN NN NN NN NN NN NN NN .'}, 'texta_facts': []}

Process Arabic (for real this time)

>>> mlp = MLP(language_codes=["et","en","ru", "ar"])
>>> mlp.process("المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق. وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء.")
{'text': {'text': 'المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق . وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضا بروح الإخاء .', 'lang': 'ar', 'lemmas': 'مَادَّة 1 وَلَّد جَمِيع إِنسَان حَرَر مُتَسَاوِي فِي كَرَامَة والحقوق . وَقَد وَ عَقَل وضميراً وعليهم أَنَّ يعامل بعضهم بَعض بروح إِخَاء .', 'pos_tags': 'N------S1D Q--------- VIIA-3MS-- N------S4R N------P2D N------P4I A-----MP4I P--------- N------S2D U--------- G--------- U--------- VP-A-3MP-- N------S4I A-----MS4I U--------- C--------- VISA-3MS-- U--------- N------S4I U--------- N------S2D G---------', 'transliteration': "AlmAdp 1 ywld jmyE AlnAs >HrArFA mtsAwyn fy AlkrAmp wAlHqwq . wqd whbwA EqlAF wDmyrFA wElyhm >n yEAml bEDhm bEDA brwH Al<xA' ."}, 'texta_facts': []}
>>>
>>> mlp.lemmatize("المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق. وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضا بروح الإخاء.")
'مَادَّة 1 وَلَّد جَمِيع إِنسَان حَرَر مُتَسَاوِي فِي كَرَامَة والحقوق . وَقَد وَ عَقَل وضميراً وعليهم أَنَّ يعامل بعضهم بَعض بروح إِخَاء .'

Load MLP with Custom Resource Path

>>> mlp = MLP(language_codes=["et","en","ru"], resource_dir="/home/kalevipoeg/mlp_resources/")

Concatenate close entities

Let`s test MLP() and Concatenator() on the following three letters. Letter 1:

Dear all, 

Let`s not forget that I intend to concure the whole of Persian Empire!

Best wishes,
Alexander Great
aleksandersuur356eKr@mail.ee
phone: 76883266

Letter 2:

От: Terry Pratchett < tpratchett@gmail.com >
Кому: Joe Abercrombie < jabercrombie@gmail.com >
Название: Разъяснение

Дорогой Joe,

Как вы? Надеюсь, у тебя все хорошо. Последний месяц я писал свой новый роман, 
который обещал представить в начале лета. Я тоже немного почитал и обожаю твою 
новую книгу!

Я просто хотел уточнить, что Alexander Great жил в Македонии.

Лучший,
Terry

Letter 3:

Dear Terry!

Terry Pratchett already created Discworld. This name is taken. Other than that I found 
the piece fascanating and see great potential in you! I strongly encourage you to take 
action in publishing your works. Btw, if you would like to show your works to Pratchett 
as well, he`s interested. I talked about you to him. His email is tpratchett@gmail.com. 
Feel free to write him!

Joe


From: Terry Berry < bigfan@gmail.com >
To: Joe Abercrombie < jabercrombie@gmail.com >
Title: Question

Hi Joe,

I finally finished my draft and I`m sending it to you. The hardest part 
was creating new places. What do you think of the names of the places I created?

Terry Berry

Let`s read all those letters into a list called "mailbox". We will process the letters as discribed above and save them into a jsonlines file.

from texta_mlp.mlp import MLP
mlp = MLP(language_codes=["et","en","ru"])
processed_letters = []
for letter in mailbox:
    processed_letters += [mlp.process(letter)]
   
import jsonlines
with jsonlines.open("letters.jsonl", mode="w") as writer:
    writer.write_all(processed_letters)

MLP() already creates a fact BOUNDED which bounds the closest entities within the letter together. In order to sort out the info in whole mailbox we have to concatenate the BOUNDED facts. It means creating a database of personal info gotten from different letters. For that we use the Concatenator(), which input is processed letters.

from texta_mlp.concatenator import Concatenator

cn = Concatenator()
cn.load_bounded_from_jsonl(path = "letters.jsonl")
#cn.load_bounded_fron_jsonl() uses default path "mlpanalyzed.jsonl"

Then we will concatenate the BOUNDED facts. Be aware that with large mailboxes it might take 2 hours!

cn.concatenate()

We can check the length of the database lists and the content with functions:

cn._just_pers_infos() (type "close_persons", persons appearing close in letter(s)),
cn._bounded() (the original unconcatenated bounded),
cn._unsure_infos() (type "unsure_whose_entities", enities that have >=2 candidate persons, not sure to whom it belongs),
cn._no_personas_infos() (type "no_per_close_entities", entities appearing close in letter(s) without persons nearby),
cn._persona_infos() (type "person_info", the real deal, entities with its person).

All of that can be saved to .jsonl file.

cn.save_to_jsonlines(path="concatenated_bounds_from_mailbox.jsonl")
#cn.save_to_jsonlines() uses default path "concatenated_bounds.jsonl"

Output of "concatenated_bounds_from_mailbox.jsonl":

{"type": "person_info", "PER": "Alexander Great", "LOC": ["Македония", "Persian Empire"], "EMAIL": ["aleksandersuur356eKr@mail.ee"], "PHONE": ["76883266"]}
{"type": "person_info", "PER": "Joe Abercrombie", "EMAIL": ["jabercrombie@gmail.com"]}
{"type": "person_info", "PER": "Terry Berry", "EMAIL": ["bigfan@gmail.com"]}
{"type": "person_info", "PER": "Terry Pratchett", "EMAIL": ["tpratchett@gmail.com"]}

Dealing with Elasticsearch

We can also use Elasticsearch with Concatenator(). Here`s a snippet for getting from Elasticsearch and processing documents already processed by MLP() and then uploading them to a new index.

from texta_mlp.concatenator import Concatenator
cn = Concatenator()
cn.load_bounded_from_elastic(es_url= 'http://localhost:8888', index_name = "mlp_processed_mails")
cn.concatenate()
cn.save_to_elasticsearch(index_name = 'http://localhost:8888', es_url = "mails_concatenated_bounded")

Using just cn.load_bounded_from_elastic() uses default settings:

cn.load_bounded_from_elasticsearch(es_url= 'http://elastic-dev.texta.ee:9200', index_name = "mlp_processed_mails")

Using just cn.save_to_elasticsearch() uses default settings:

cn.save_to_elasticsearch(index_name = 'http://elastic-dev.texta.ee:9200', es_url = "concatenated_BOUNDED")

Project details

Release history Release notifications | RSS feed

1.22.0

Sep 18, 2024

1.21.0

Nov 29, 2023

1.20.0

Nov 15, 2023

1.19.0

Nov 6, 2023

1.18.0

Jun 1, 2023

1.17.3

Feb 6, 2023

1.17.2

Sep 15, 2022

1.17.1

Sep 8, 2022

1.17.0

Aug 16, 2022

1.16.0

May 25, 2022

1.15.7

Apr 20, 2022

1.15.6

Apr 11, 2022

1.15.5

Feb 2, 2022

1.15.4

Oct 25, 2021

1.15.3

Oct 22, 2021

1.15.2

Oct 22, 2021

1.15.1

Oct 22, 2021

1.15.0

Oct 18, 2021

1.14.5

Oct 14, 2021

1.14.4

Oct 14, 2021

1.14.3

Oct 12, 2021

1.14.2

Oct 11, 2021

1.14.1

Oct 8, 2021

1.14.0

Oct 7, 2021

1.13.0

Sep 29, 2021

1.12.1

Sep 28, 2021

1.12.0

Sep 28, 2021

1.11.6

Sep 27, 2021

1.11.5

Sep 15, 2021

1.11.4

Aug 12, 2021

1.11.3

Jun 29, 2021

1.11.2

Jun 7, 2021

1.11.1

May 14, 2021

1.11.0

May 14, 2021

1.10.5

May 7, 2021

1.10.4

Apr 1, 2021

1.10.3

Mar 31, 2021

1.10.2

Mar 23, 2021

1.10.1

Mar 12, 2021

1.10.0

Mar 10, 2021

1.9.1

Mar 10, 2021

1.9.0

Mar 3, 2021

1.8.3

Mar 1, 2021

1.8.2

Mar 1, 2021

1.8.0

Feb 23, 2021

1.7.5

Feb 15, 2021

1.7.4

Feb 8, 2021

1.7.3

Feb 5, 2021

1.7.1

Feb 4, 2021

1.7.0

Feb 2, 2021

1.6.4

Feb 1, 2021

1.6.3

Jan 25, 2021

1.6.2

Jan 22, 2021

1.6.1

Jan 22, 2021

1.6.0

Jan 21, 2021

1.5.4

Jan 7, 2021

1.5.3

Jan 6, 2021

1.5.2

Dec 22, 2020

1.5.1

Dec 8, 2020

1.5.0

Nov 12, 2020

This version

1.4.2

Sep 1, 2020

1.4.1

Aug 27, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

texta-mlp-1.4.2.tar.gz (42.6 kB view hashes)

Uploaded Sep 1, 2020 Source

Hashes for texta-mlp-1.4.2.tar.gz

Hashes for texta-mlp-1.4.2.tar.gz
Algorithm	Hash digest
SHA256	`8a4574300d153cc1e715b90d436658703c5e0dd3fffcc4c5ad6d0da4469e7cae`
MD5	`0f472b27792f780d87c1fb5efdb164fd`
BLAKE2b-256	`69edfb847d78aa19a7ed84f7d7cf5ef1965f55bb974ca75a8c1dcca0b591e416`