LangDive is a library for measuring the level of linguistic diversity in multilingual NLP datasets

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
License
- OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

LangDive

LangDive is a PyPi-hosted library for measuring the level of linguistic diversity in multilingual NLP datasets.

The measures implemented here have been proposed and described in the following NAACL 2024 paper: A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets

Installation

  pip install langdive

OS specific instructions

This library has PyICU as one of its dependencies, and its installation procedure is OS-specific.

Windows

You can find wheels for Windows for the PyICU here. Download the wheel for your Python version and install it within your environment. Run the pip install afterwards.

MacOS

The easiest way to set PyICU on a Mac is to first install Homebrew. Then, run the following commands:

# install libicu (keg-only)
brew install pkg-config icu4c

# let setup.py discover keg-only icu4c via pkg-config
export PATH="/usr/local/opt/icu4c/bin:/usr/local/opt/icu4c/sbin:$PATH"
export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:/usr/local/opt/icu4c/lib/pkgconfig"

Finally, the PyICU package will be automatically installed by pip during the installation of langdive.

Ubuntu

PyICU installation instructions can be found here

In addition, make sure to have PyQt6 installed to ensure proper functioning of plots.

Included Datasets

The library includes several datasets that have already been processed with a sample_size of 10000.

They are listed here in the following format: library_id - name of the dataset : number of languages

ud - Universal Dependencies (UD): 106 languages
bible - Bible 100: 102 languages
mbert - mBERT: 97 languages
xtreme - XTREME: 40 languages
xglue - XGLUE: 19 languages
xnli - XNLI: 15 languages
xcopa - XCOPA: 11 languages
tydiqa - TyDiQA: 11 languages
xquad - XQuAD: 12 languages
teddi - TeDDi sample: 86 languages

Usage Example

from langdive import process_corpus
from langdive import LangDive

process_corpus("C:/Users/Name/corpus_folder_name" )

lang = LangDive()
lang.jaccard_morphology("./RESULTS_corpus_folder_name/corpus_folder_name.10000.stats.tsv", "teddi", plot=True, scaled=True)

In this example, the provided corpus files are first processed to calculate the measurements and statistics necessary for other library functions.

Afterwards, the library class is instantiated with the default arguments.

Finally, the Jaccard similarity index is calculated by comparing the distributions of the mean word length between the newly processed corpus and the selected built-in reference corpus (TeDDi). This calculation is performed using scaled values, and both distributions are also shown side-by-side on a plot.

API

process_corpus

process_corpus(input_folder_path, is_ISO6393 = False, output_folder_path = "default", sample_size_array = [10000])

Creates a results folder containing various measurements and statistics calculated based on the provided input corpus. The input corpus folder should contain textual files encoded in UTF-8. If the user wishes to utilize all functions of this library, it is necessary to ensure all corpus file names (without the file extension) are equal to their respective ISO-6393 language codes, and that the is_ISO6393 argument is set to True. If these conditions are not met, only the measures based on mean word length can be used, while those relying on syntactic features will report an error.

The created folder "RESULTS_corpus_folder_name" will be placed in the chosen output directory with one or more subfolders "freqs_sample_size" and one or more "corpus_folder_name.sample_size.stats.tsv" files. These files contain various measures for each corpus file, one line per file. Their number depends on the number of different sampling size settings, as defined by the sample_size_array function arguments. The "freqs_sample_size" subfolders contain word frequency count files for each file in the corpus folder, calculated for every sampling size setting.

input_folder_path - absolute or relative path to the input corpus folder

is_ISO6393 - boolean indicating whether the names of the input corpus files (without the file extension) correspond to the ISO-6393 language code standard

output_folder_path - absolute or relative path to the output folder. The default setting will place the outputs in the current working directory.

sample_size_array - the size of the text sample to be taken from each language file, measured in tokens. Each sample represents a contiguous section of text, with a randomly chosen starting point, containing the selected number of tokens. For example, for sample_size_array = [10000, 20000], there will be 2 result sets: one using samples of 10000 tokens per corpus file, and another using samples of 20000 tokens per corpus file.

process_file

process_file(input_file_path, is_ISO6393, output_file_path, sample_size=10000)

Does the same thing as process_corpus but for a single file.

input_file_path - absolute or relative path to the input corpus file

is_ISO6393 - boolean indicating whether the name of the input corpus file (without the file extension) corresponds to the ISO6393 language code standard

output_file_path - absolute or relative path to the output file where the results will be stored; the freq folder will be placed in the same directory as the output file

sample_size - the size of the text sample to be taken, measured in tokens

LangDive

The class for working with the processed datasets.

constructor

LangDive(min = 1, max = 13, increment = 1, typological_index_binsize = 1)

min, max, increment - controls the bin sizes to be used in the Jaccard measure based on mean word length (will also affect the result plots). The stated default values have been determined experimentally.

typological_index_binsize - controls the bin size for the typological indexes

jaccard_morphology

jaccard_morphology(dataset_path, reference_path, plot = True, scaled = False)

Returns the Jaccard score calculated by comparing the distributions of the mean word length between the given and the reference dataset.

dataset_path, reference_path - absolute or relative path to the processed corpus TSV file. One of the included datasets that has already been processed can be used by stating its library_id.

plot - boolean that determines whether a plot will be shown

scaled - boolean that determines whether the datasets should be scaled. Each dataset is normalized indepedently.

jaccard_syntax

jaccard_syntax(dataset_path, reference_path, plot = True, scaled = False)

Returns the Jaccard score calculated by comparing the values of 103 syntactic features from lang2vec between the given and the reference dataset.

dataset_path, reference_path - absolute or relative path to the processed corpus TSV file. One of the included datasets that has already been processed can be used by stating its library_id.

plot - boolean that determines whether a plot will be shown

scaled - boolean that determines whether the datasets should be scaled. Each dataset is normalized indepedently.

typological_index_syntactic_features

typological_index_syntactic_features(dataset_path)

Returns the typological index that uses the 103 syntactic features from lang2vec. The value ranges from 0 to 1 and values closer to 1 indicate higher diversity.

dataset_path - absolute or relative path to the processed corpus TSV file. One of the included datasets that has already been processed can be used by stating its library_id.

typological_index_word_length

typological_index_word_length(dataset_path)

Returns the typological index adapted to use mean word length for calculations.

dataset_path - absolute or relative path to the processed corpus TSV file. One of the included datasets that has already been processed can be used by stating its library_id.

get_l2v

get_l2v(dataset_df)

Returns the values of 103 syntactic features from lang2vec for the given set of languages.

dataset_df - pandas dataframe of a processed dataset, containing an ISO_6393 column

get_dict

get_dict(dataset_df)

Returns a dataframe containing pairs of bins and dictionaries (region:number of languages) based on the provided processed dataset (measures)

dataset_df - pandas dataframe of a processed dataset

Acknowledgements

Polyglot - A part of the langdive library (the polyglot_tokenizer file) has been taken from the Polyglot project. The reason for this is difficulty with installation on Windows and MacOS. If the library gets updated, this file will be removed.

Authors and maintainers

This library has been developed and is maintained by the following members of the Natural Language Processing group at the Innovation Center of the School of Electrical Engineering in Belgrade:

This effort was made possible thanks to collaboration and consultations with dr Tanja Samardžić, University of Zurich.

License

GNU GPL 3

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
License
- OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.0

Aug 27, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langdive-1.0.tar.gz (47.9 kB view details)

Uploaded Aug 27, 2024 Source

Built Distribution

langdive-1.0-py3-none-any.whl (45.8 kB view details)

Uploaded Aug 27, 2024 Python 3

File details

Details for the file langdive-1.0.tar.gz.

File metadata

Download URL: langdive-1.0.tar.gz
Upload date: Aug 27, 2024
Size: 47.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.6

File hashes

Hashes for langdive-1.0.tar.gz
Algorithm	Hash digest
SHA256	`02f627a9a1cca75d6a3c848102478c1406596eca227f839803e7b84a5195f65c`
MD5	`548fe7fca038661c0e00582b40198e5d`
BLAKE2b-256	`6e04a69c7b708f6d70c557026335a405a0ef15c85ece40f748525a48fe18b0ac`

See more details on using hashes here.

File details

Details for the file langdive-1.0-py3-none-any.whl.

File metadata

Download URL: langdive-1.0-py3-none-any.whl
Upload date: Aug 27, 2024
Size: 45.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.6

File hashes

Hashes for langdive-1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1c99c59039aa74a0fbd9464c8d4801a49121ccec025732dfd3a26240b371d919`
MD5	`4d8deab7994c2fca627852da1f8b125f`
BLAKE2b-256	`ec9683edc18f695d0e8bb5507d040436687b237ae4c8138ea99544bc124dd6f4`

See more details on using hashes here.

langdive 1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LangDive

Installation

OS specific instructions

Windows

MacOS

Ubuntu

Included Datasets

Usage Example

API

process_corpus

process_file

LangDive

constructor

jaccard_morphology

jaccard_syntax

typological_index_syntactic_features

typological_index_word_length

get_l2v

get_dict

Acknowledgements

Authors and maintainers

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes