Skip to main content

A library for introducing state-of-the-art metrics on measuring linguistic complexity

Project description

LingX
PyPI version Contributions welcome License: MIT

LingX

A library for introducing state-of-the-art metrics on measuring linguistic complexity developed by ContentSide and CRITT at Kent State University.


LingX is:

  • A library for calculating some of the psycholinguistics complexity metrics.
  • A library for obtaining helpful metrics for translational process studies.
  • A library for different factors related to the text analysis.
  • A library with extended modules to easily integerate translational studies in CRITT TPR-DB.

How does LingX generally work?

LingX calculates different token-based and segment-based mono-bilingual complexity metrics. It internaly parses a given text into a dependency grammar graph. Using the graph and other linguistic information such as part-of-speech tagging, it can caculates different psycholinguistics, linguistic and translational process metrics. See the reference section for detailed information.

LingX uses Stanza state-of-the-arts NLP library for different language-based tasks. Stanza is a collection of accurate and efficient tools for the linguistic analysis of many human languages. Stanza brings state-of-the-art NLP models to different languages.

Quick Start

Requirements and Installation

The project is based on Stanza 1.2.1 and Python 3.6+. If you do not have Python 3.6, install it first. Then, in your favorite virtual environment, simply do:

pip install lingx

If you are running project in Jupyter Notebook or Google Colab enviroments run the following command instead:

!pip install lingx

Example Usage

Let's run a simple token-based psycholingual incomplete complexity theory (IDT) metric as a test. All you need to do is to make import related methods and codes as follows:

from lingx.utils import download_lang_models
from lingx.core.lang_model import get_nlp_object
from lingx.utils.lx import get_sentence_lx

nlp_en = get_nlp_object("en", use_critt_tokenization = False, package="partut")

input = "The reporter who the senator who John met attacked disliked the editor."

tokens_scores_list, aggregated_score = get_sentence_lx(
                                                       input,
                                                       nlp_en,
                                                       result_format="segment",
                                                       complexity_type="idt", 
                                                       aggregation_type="sum")

print(f"Tokens Scores List == {tokens_scores_list}")
print(f"Aggregated Score == {aggregated_score}")

This should print the incomplete complexity theory (IDT) metric list with related tokens and aggregated score using aggregated function sum:

Tokens Scores List == [['The', 1], ['reporter', 2], ['who', 3], ['the', 4], ['senator', 3], ['who', 4], ['John', 5], ['met', 2], ['attacked', 2], ['disliked', 2], ['the', 3], ['editor', 1], ['.', 0]]
Aggregated Score == 32

Tutorials

We provide a set of quick tutorials to get you started with the library:

The tutorials explain how the base metrics can be obtained. Let us know if anything is unclear.

CRITT Translation Process Database (TPR-DB)

The CRITT Translation Process Database (TPR-DB) is released under Creative Commons License (CC BY-NC-SA). Note that the available EN-ZH_IMBst18 database in this github belongs to CRITT TPR-DB.


Citing LingX

Please cite:

@inproceedings{Zou2021
  title={Syntactic Complexity and Translation Performance in English-to-Chinese Sight Translation},
  author={Zou, Longhui and Mirzapour, Mehdi and Jacquenet, Hélène},
  booktitle={Applied Linguistics and Professional Practice 2021},
  year={2021},
  publisher={Translational Data Analytics Institute, The Ohio State University}
}

For IDT-based and DLT-based complexities, please cite this paper:

@incollection{mirzapour2020,
  title={Measuring Linguistic Complexity: Introducing a New Categorial Metric},
  author={Mirzapour, Mehdi and Prost, Jean-Philippe and Retor{\'e}, Christian},
  booktitle={Logic and Algorithms in Computational Linguistics 2018 (LACompLing2018)},
  pages={95--123},
  year={2020},
  publisher={Springer}
}

Contact

Please email your questions or comments to Mehdi Mirzapour.

License

LingX is licensed under the following MIT License (MIT) Copyright © 2021 ContentSide and CRITT at Kent State University.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lingx-0.1.6.tar.gz (16.2 kB view hashes)

Uploaded Source

Built Distribution

lingx-0.1.6-py3-none-any.whl (22.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page