Transformer-based model to estimate the coherence of Ukrainian-language texts

Project description

Python package to evaluate the coherence of Ukrainian-language texts

This package represents a pre-trained Transformer-based coherence estimation model for a Ukrainian corpus. This model uses a neural network that was previously trained on the set of Ukrainian news. The detailed description of the model will be considered on the 2020 IEEE 2nd International Conference on Advanced Trends in Information Theory (ATIT).

Installation

Use pip tool to install

pip install coherence-ua

Caution: package has several dependencies. Package udpipe requires some extra utilities to compile some parts of code.

Usage

from coherence_ua.transformer_coherence import CoherenceModel

model = CoherenceModel()
text = """Мед – густа солодка маса, яку бджоли виробляють з нектару квітів.

Загалом у світі є до 320 видів меду. Вони різняться за своїми смаковими якостями та поживною цінністю.

В меді дійсно є такі поживні речовини як цинк, калій та залізо. Проте, на жаль, в дуже мізерних кількостях.

Наприклад, в одній столовій ложці меду заліза всього 0,5%. Проте цей продукт має велику кількість вуглеводів та калорій. 1 столова ложка меду еквівалентна 17 грамам цукру та 64 кілокалоріям.

Мед містить незначну кількість антиоксидантів та має доволі сильну бактеріальну дію. Антиоксиданти захищають клітини нашого організму від вільних радикалів.

Вільні радикали – це молекули, які виробляються, коли наш організм перетравлює їжу або ви були під впливом тютюну чи радіації."""

# Show output probabilities for each clique of a text (clique_number = 3) 
print(model.get_prediction_series(text))

# Evaluate the coherence of a text as the product of output probabilities
print(model.evaluate_coherence_as_product(text))

# Calculate the coherence of the text as the ratio of a number of coherent cliques over all cliques
# according to the corresponding threshold
print(model.evaluate_coherence_using_threshold(text, 0.1))

See folder examples for more details. As it can be seen from the example, model implements 3 methods:

get_prediction_series - estimate the output probabilities for each clique of a text (clique_number = 3). A term "clique" implies the set of sentences of a text with an unitary offset. For instance, <s1, s2, s3>, <s2, s3, s4>, <s3, s4, s5> where <si> denotes a separate sentence.
evaluate_coherence_as_product - evaluate the coherence of a text as the product of output probabilities of cliques.
evaluate_coherence_using_threshold - calculate the coherence of the text as the ratio of a number of coherent cliques over all cliques according to the given threshold.

=====================================================

Програмний пакет Python для оцінки когерентності україномовних текстів

Цей пакет реалізує попередньо натреновану модель оцінки когерентності україномовного корпусу на основі архітектури Transformer. Модель використовує нейронну мережу, що була натренована на множині українських новин. Детальний опис моделі буде розглянуто на конференції the 2020 IEEE 2nd International Conference on Advanced Trends in Information Theory (ATIT).

Встановлення

Використовуйте інструмент pip для встановлення

pip install coherence-ua

Попередження: пакет містить декілька залежностей. Пакет udpipe потребує додаткових ресурсів для компіляції певних частин коду.

Використання

from coherence_ua.transformer_coherence import CoherenceModel

model = CoherenceModel()
text = """Мед – густа солодка маса, яку бджоли виробляють з нектару квітів.

Загалом у світі є до 320 видів меду. Вони різняться за своїми смаковими якостями та поживною цінністю.

В меді дійсно є такі поживні речовини як цинк, калій та залізо. Проте, на жаль, в дуже мізерних кількостях.

Наприклад, в одній столовій ложці меду заліза всього 0,5%. Проте цей продукт має велику кількість вуглеводів та калорій. 1 столова ложка меду еквівалентна 17 грамам цукру та 64 кілокалоріям.

Мед містить незначну кількість антиоксидантів та має доволі сильну бактеріальну дію. Антиоксиданти захищають клітини нашого організму від вільних радикалів.

Вільні радикали – це молекули, які виробляються, коли наш організм перетравлює їжу або ви були під впливом тютюну чи радіації."""

# Show output probabilities for each clique of a text (clique_number = 3) 
print(model.get_prediction_series(text))

# Evaluate the coherence of a text as the product of output probabilities
print(model.evaluate_coherence_as_product(text))

# Calculate the coherence of the text as the ratio of a number of coherent cliques over all cliques
# according to the corresponding threshold
print(model.evaluate_coherence_using_threshold(text, 0.1))

Дивіться папку examples для уточнення деталей використання. Модель реалізує 3 методи:

get_prediction_series - оцінка вихідних ймовірностей для кожної групи тексту (clique_number = 3). Під терміном "група" мається на увазі набір речень тексту з одинарним зсувом. Наприклад, <s1, s2, s3>, <s2, s3, s4>, <s3, s4, s5>, де <si> відповідає окремому реченню тексту.
evaluate_coherence_as_product - оцінка когерентності тексту як добутку вихідних ймовірностей груп.
evaluate_coherence_using_threshold - розрахунок когерентності тексту як відношення кількості когерентних груп до їх загальної кількості відповідно до встановленого порогового значення.

Project details

Release history Release notifications | RSS feed

This version

0.0.4

Nov 8, 2020

0.0.3

Aug 18, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coherence-ua-0.0.4.tar.gz (66.0 MB view details)

Uploaded Nov 8, 2020 Source

Built Distribution

coherence_ua-0.0.4-py3-none-any.whl (66.0 MB view details)

Uploaded Nov 8, 2020 Python 3

File details

Details for the file coherence-ua-0.0.4.tar.gz.

File metadata

Download URL: coherence-ua-0.0.4.tar.gz
Upload date: Nov 8, 2020
Size: 66.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.6.5

File hashes

Hashes for coherence-ua-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`60562f89297f9dd0b3a90b2afcba15d540a1d30aa26f3c6a98686f1ac10e00a0`
MD5	`e82651956f76058e7556ab18eeb30ab1`
BLAKE2b-256	`d20909eac3329c26e927cbfd4b7b873d379884afec4f0f6a2a382d663f675967`

See more details on using hashes here.

File details

Details for the file coherence_ua-0.0.4-py3-none-any.whl.

File metadata

Download URL: coherence_ua-0.0.4-py3-none-any.whl
Upload date: Nov 8, 2020
Size: 66.0 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.6.5

File hashes

Hashes for coherence_ua-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ed5f2a7a63b08f7e6ff229cd58e54aee1c2adfccc122e56379ef2c3febc058af`
MD5	`400426f827d56acce02df96776548fc5`
BLAKE2b-256	`f0f1dcd83116707337715c8cf3aaef73e4245a36beab24c81ca986e56c57ffe2`

See more details on using hashes here.

coherence-ua 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Python package to evaluate the coherence of Ukrainian-language texts

Installation

Usage

=====================================================

Програмний пакет Python для оцінки когерентності україномовних текстів

Встановлення

Використання

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes