A library for introducing state-of-the-art metrics on measuring linguistic complexity
Project description
LingX
A library for introducing state-of-the-art metrics on measuring linguistic complexity developed by ContentSide and CRITT at Kent State University.
LingX is:
- A library for calculating some of the psycholinguistics complexity metrics.
- A library for obtaining helpful metrics for translational process studies.
- A library for different factors related to the text analysis.
- A library with extended modules to easily integerate translational studies in CRITT TPR-DB.
How does LingX generally work?
LingX calculates different token-based and segment-based mono-bilingual complexity metrics. It internaly parses a given text into a dependency grammar graph. Using the graph and other linguistic information such as part-of-speech tagging, it can caculates different psycholinguistics, linguistic and translational process metrics. See the reference section for detailed information.
LingX uses Stanza state-of-the-arts NLP library for different language-based tasks. Stanza is a collection of accurate and efficient tools for the linguistic analysis of many human languages. Stanza brings state-of-the-art NLP models to different languages.
Quick Start
Requirements and Installation
The project is based on Stanza 1.2.1 and Python 3.6+. If you do not have Python 3.6, install it first. Then, in your favorite virtual environment, simply do:
pip install lingx
If you are running project in Jupyter Notebook or Google Colab enviroments run the following command instead:
!pip install lingx
Example Usage
Let's run a simple token-based psycholingual incomplete complexity theory (IDT) metric as a test. All you need to do is to make import related methods and codes as follows:
from lingx.utils import download_lang_models
from lingx.core.lang_model import get_nlp_object
from lingx.utils.lx import get_sentence_lx
nlp_en = get_nlp_object("en", use_critt_tokenization = False, package="partut")
input = "The reporter who the senator who John met attacked disliked the editor."
tokens_scores_list, aggregated_score = get_sentence_lx(
input,
nlp_en,
result_format="segment",
complexity_type="idt",
aggregation_type="sum")
print(f"Tokens Scores List == {tokens_scores_list}")
print(f"Aggregated Score == {aggregated_score}")
This should print the incomplete complexity theory (IDT) metric list with related tokens and aggregated score using aggregated function sum
:
Tokens Scores List == [['The', 1], ['reporter', 2], ['who', 3], ['the', 4], ['senator', 3], ['who', 4], ['John', 5], ['met', 2], ['attacked', 2], ['disliked', 2], ['the', 3], ['editor', 1], ['.', 0]]
Aggregated Score == 32
Tutorials
We provide a set of quick tutorials to get you started with the library:
- Tutorial 1: Basics
- Tutorial 2: Getting Incomplete Dependency Theory(=IDT) Metric
- Tutorial 3: Getting Dependency Locality Theory(=DLT) Metric
- Tutorial 4: Getting Combined IDT and DLT Metric
- Tutorial 5: Getting Left-Embededness Metric
- Tutorial 6: Getting Nested Nouns Distance Metric
- Tutorial 7: Getting Bilingual Complexity Ratio Metric
- Tutorial 8: Reading and Working with CRITT TPR-DB
The tutorials explain how the base metrics can be obtained. Let us know if anything is unclear.
CRITT Translation Process Database (TPR-DB)
The CRITT Translation Process Database (TPR-DB) is released under Creative Commons License (CC BY-NC-SA). Note that the available EN-ZH_IMBst18 database in this github belongs to CRITT TPR-DB.
Citing LingX
Please cite:
@inproceedings{Zou2021
title={Syntactic Complexity and Translation Performance in English-to-Chinese Sight Translation},
author={Zou, Longhui and Mirzapour, Mehdi and Jacquenet, Hélène},
booktitle={Applied Linguistics and Professional Practice 2021},
year={2021},
publisher={Translational Data Analytics Institute, The Ohio State University}
}
For IDT-based and DLT-based complexities, please cite this paper:
@incollection{mirzapour2020,
title={Measuring Linguistic Complexity: Introducing a New Categorial Metric},
author={Mirzapour, Mehdi and Prost, Jean-Philippe and Retor{\'e}, Christian},
booktitle={Logic and Algorithms in Computational Linguistics 2018 (LACompLing2018)},
pages={95--123},
year={2020},
publisher={Springer}
}
Contact
Please email your questions or comments to Mehdi Mirzapour.
License
LingX is licensed under the following MIT License (MIT) Copyright © 2021 ContentSide and CRITT at Kent State University.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file lingx-0.1.6.tar.gz
.
File metadata
- Download URL: lingx-0.1.6.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9e35e0b09b8967a1a793b1d016b7e8faf5dda2681bc9aa608ee78bf0428f7b9a |
|
MD5 | c60032b50d54b24c9ec36218ca40f86b |
|
BLAKE2b-256 | 5f739a20f232d4b441f0bc501a2c4e4d32fb18188814c1c7123770c381ce7761 |
File details
Details for the file lingx-0.1.6-py3-none-any.whl
.
File metadata
- Download URL: lingx-0.1.6-py3-none-any.whl
- Upload date:
- Size: 22.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 475b0133c25fef0c8e8674d76b89ba0dc65bbfe99a25dc1d14d8b8c1b1880d1b |
|
MD5 | a66b1929b1c90dec49834edf199d03d5 |
|
BLAKE2b-256 | 11a7d37390be31179025a5a9b47120b0afa8d1605211fe79ea78c6eae160a4d1 |