Skip to main content

A library for introducing state-of-the-art metrics on measuring linguistic complexity

Project description

LingX
PyPI version Contributions welcome License: MIT

LingX

A library for introducing state-of-the-art metrics on measuring linguistic complexity developed by ContentSide and CRITT at Kent State University.


LingX is:

  • A library for calculating some of the psycholinguistics complexity metrics.
  • A library for obtaining helpful metrics for translational process studies.
  • A library for different factors related to the text analysis.
  • A library with extended modules to easily integerate translational studies in CRITT TPR-DB.

How does LingX generally work?

LingX calculates different token-based and segment-based mono-bilingual complexity metrics. It internaly parses a given text into a dependency grammar graph. Using the graph and other linguistic information such as part-of-speech tagging, it can caculates different psycholinguistics, linguistic and translational process metrics. See the reference section for detailed information.

LingX uses Stanza state-of-the-arts NLP library for different language-based tasks. Stanza is a collection of accurate and efficient tools for the linguistic analysis of many human languages. Stanza brings state-of-the-art NLP models to different languages.

Quick Start

Requirements and Installation

The project is based on Stanza 1.2.1 and Python 3.6+. If you do not have Python 3.6, install it first. Then, in your favorite virtual environment, simply do:

pip install lingx

If you are running project in Jupyter Notebook or Google Colab enviroments run the following command instead:

!pip install lingx

Example Usage

Let's run a simple token-based psycholingual incomplete complexity theory (IDT) metric as a test. All you need to do is to make import related methods and codes as follows:

from lingx.utils import download_lang_models
from lingx.core.lang_model import get_nlp_object
from lingx.utils.lx import get_sentence_lx

nlp_en = get_nlp_object("en", use_critt_tokenization = False, package="partut")

input = "The reporter who the senator who John met attacked disliked the editor."

tokens_scores_list, aggregated_score = get_sentence_lx(
                                                       input,
                                                       nlp_en,
                                                       result_format="segment",
                                                       complexity_type="idt", 
                                                       aggregation_type="sum")

print(f"Tokens Scores List == {tokens_scores_list}")
print(f"Aggregated Score == {aggregated_score}")

This should print the incomplete complexity theory (IDT) metric list with related tokens and aggregated score using aggregated function sum:

Tokens Scores List == [['The', 1], ['reporter', 2], ['who', 3], ['the', 4], ['senator', 3], ['who', 4], ['John', 5], ['met', 2], ['attacked', 2], ['disliked', 2], ['the', 3], ['editor', 1], ['.', 0]]
Aggregated Score == 32

Tutorials

We provide a set of quick tutorials to get you started with the library:

The tutorials explain how the base metrics can be obtained. Let us know if anything is unclear.

CRITT Translation Process Database (TPR-DB)

The CRITT Translation Process Database (TPR-DB) is released under Creative Commons License (CC BY-NC-SA). Note that the available EN-ZH_IMBst18 database in this github belongs to CRITT TPR-DB.


Citing LingX

Please cite:

@inproceedings{Zou2021
  title={Syntactic Complexity and Translation Performance in English-to-Chinese Sight Translation},
  author={Zou, Longhui and Mirzapour, Mehdi and Jacquenet, Hélène},
  booktitle={Applied Linguistics and Professional Practice 2021},
  year={2021},
  publisher={Translational Data Analytics Institute, The Ohio State University}
}

For IDT-based and DLT-based complexities, please cite this paper:

@incollection{mirzapour2020,
  title={Measuring Linguistic Complexity: Introducing a New Categorial Metric},
  author={Mirzapour, Mehdi and Prost, Jean-Philippe and Retor{\'e}, Christian},
  booktitle={Logic and Algorithms in Computational Linguistics 2018 (LACompLing2018)},
  pages={95--123},
  year={2020},
  publisher={Springer}
}

Contact

Please email your questions or comments to Mehdi Mirzapour.

License

LingX is licensed under the following MIT License (MIT) Copyright © 2021 ContentSide and CRITT at Kent State University.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lingx-0.1.6.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

lingx-0.1.6-py3-none-any.whl (22.8 kB view details)

Uploaded Python 3

File details

Details for the file lingx-0.1.6.tar.gz.

File metadata

  • Download URL: lingx-0.1.6.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.6

File hashes

Hashes for lingx-0.1.6.tar.gz
Algorithm Hash digest
SHA256 9e35e0b09b8967a1a793b1d016b7e8faf5dda2681bc9aa608ee78bf0428f7b9a
MD5 c60032b50d54b24c9ec36218ca40f86b
BLAKE2b-256 5f739a20f232d4b441f0bc501a2c4e4d32fb18188814c1c7123770c381ce7761

See more details on using hashes here.

File details

Details for the file lingx-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: lingx-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 22.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.6

File hashes

Hashes for lingx-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 475b0133c25fef0c8e8674d76b89ba0dc65bbfe99a25dc1d14d8b8c1b1880d1b
MD5 a66b1929b1c90dec49834edf199d03d5
BLAKE2b-256 11a7d37390be31179025a5a9b47120b0afa8d1605211fe79ea78c6eae160a4d1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page