Skip to main content

LangDive is a library for measuring the level of linguistic diversity in multilingual NLP datasets

Project description

LangDive

LangDive is a PyPi-hosted library for measuring the level of linguistic diversity in multilingual NLP datasets.

The measures implemented here have been proposed and described in the following NAACL 2024 paper: A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets

Installation

  pip install langdive 

OS specific instructions

This library has PyICU as one of its dependencies, and its installation procedure is OS-specific.

Windows

You can find wheels for Windows for the PyICU here. Download the wheel for your Python version and install it within your environment. Run the pip install afterwards.

MacOS

The easiest way to set PyICU on a Mac is to first install Homebrew. Then, run the following commands:

# install libicu (keg-only)
brew install pkg-config icu4c

# let setup.py discover keg-only icu4c via pkg-config
export PATH="/usr/local/opt/icu4c/bin:/usr/local/opt/icu4c/sbin:$PATH"
export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:/usr/local/opt/icu4c/lib/pkgconfig"

Finally, the PyICU package will be automatically installed by pip during the installation of langdive.

Ubuntu

PyICU installation instructions can be found here

In addition, make sure to have PyQt6 installed to ensure proper functioning of plots.

Included Datasets

The library includes several datasets that have already been processed with a sample_size of 10000.

They are listed here in the following format: library_id - name of the dataset : number of languages

Usage Example

from langdive import process_corpus
from langdive import LangDive

process_corpus("C:/Users/Name/corpus_folder_name" )

lang = LangDive()
lang.jaccard_morphology("./RESULTS_corpus_folder_name/corpus_folder_name.10000.stats.tsv", "teddi", plot=True, scaled=True)

In this example, the provided corpus files are first processed to calculate the measurements and statistics necessary for other library functions.

Afterwards, the library class is instantiated with the default arguments.

Finally, the Jaccard similarity index is calculated by comparing the distributions of the mean word length between the newly processed corpus and the selected built-in reference corpus (TeDDi). This calculation is performed using scaled values, and both distributions are also shown side-by-side on a plot.

API

process_corpus

process_corpus(input_folder_path, is_ISO6393 = False, output_folder_path = "default", sample_size_array = [10000])

Creates a results folder containing various measurements and statistics calculated based on the provided input corpus. The input corpus folder should contain textual files encoded in UTF-8. If the user wishes to utilize all functions of this library, it is necessary to ensure all corpus file names (without the file extension) are equal to their respective ISO-6393 language codes, and that the is_ISO6393 argument is set to True. If these conditions are not met, only the measures based on mean word length can be used, while those relying on syntactic features will report an error.

The created folder "RESULTS_corpus_folder_name" will be placed in the chosen output directory with one or more subfolders "freqs_sample_size" and one or more "corpus_folder_name.sample_size.stats.tsv" files. These files contain various measures for each corpus file, one line per file. Their number depends on the number of different sampling size settings, as defined by the sample_size_array function arguments. The "freqs_sample_size" subfolders contain word frequency count files for each file in the corpus folder, calculated for every sampling size setting.

input_folder_path - absolute or relative path to the input corpus folder

is_ISO6393 - boolean indicating whether the names of the input corpus files (without the file extension) correspond to the ISO-6393 language code standard

output_folder_path - absolute or relative path to the output folder. The default setting will place the outputs in the current working directory.

sample_size_array - the size of the text sample to be taken from each language file, measured in tokens. Each sample represents a contiguous section of text, with a randomly chosen starting point, containing the selected number of tokens. For example, for sample_size_array = [10000, 20000], there will be 2 result sets: one using samples of 10000 tokens per corpus file, and another using samples of 20000 tokens per corpus file.

process_file

process_file(input_file_path, is_ISO6393, output_file_path, sample_size=10000)

Does the same thing as process_corpus but for a single file.

input_file_path - absolute or relative path to the input corpus file

is_ISO6393 - boolean indicating whether the name of the input corpus file (without the file extension) corresponds to the ISO6393 language code standard

output_file_path - absolute or relative path to the output file where the results will be stored; the freq folder will be placed in the same directory as the output file

sample_size - the size of the text sample to be taken, measured in tokens

LangDive

The class for working with the processed datasets.

constructor

LangDive(min = 1, max = 13, increment = 1, typological_index_binsize = 1)

min, max, increment - controls the bin sizes to be used in the Jaccard measure based on mean word length (will also affect the result plots). The stated default values have been determined experimentally.

typological_index_binsize - controls the bin size for the typological indexes

jaccard_morphology

jaccard_morphology(dataset_path, reference_path, plot = True, scaled = False)

Returns the Jaccard score calculated by comparing the distributions of the mean word length between the given and the reference dataset.

dataset_path, reference_path - absolute or relative path to the processed corpus TSV file. One of the included datasets that has already been processed can be used by stating its library_id.

plot - boolean that determines whether a plot will be shown

scaled - boolean that determines whether the datasets should be scaled. Each dataset is normalized indepedently.

jaccard_syntax

jaccard_syntax(dataset_path, reference_path, plot = True, scaled = False)

Returns the Jaccard score calculated by comparing the values of 103 syntactic features from lang2vec between the given and the reference dataset.

dataset_path, reference_path - absolute or relative path to the processed corpus TSV file. One of the included datasets that has already been processed can be used by stating its library_id.

plot - boolean that determines whether a plot will be shown

scaled - boolean that determines whether the datasets should be scaled. Each dataset is normalized indepedently.

typological_index_syntactic_features

typological_index_syntactic_features(dataset_path)

Returns the typological index that uses the 103 syntactic features from lang2vec. The value ranges from 0 to 1 and values closer to 1 indicate higher diversity.

dataset_path - absolute or relative path to the processed corpus TSV file. One of the included datasets that has already been processed can be used by stating its library_id.

typological_index_word_length

typological_index_word_length(dataset_path)

Returns the typological index adapted to use mean word length for calculations.

dataset_path - absolute or relative path to the processed corpus TSV file. One of the included datasets that has already been processed can be used by stating its library_id.

get_l2v

get_l2v(dataset_df)

Returns the values of 103 syntactic features from lang2vec for the given set of languages.

dataset_df - pandas dataframe of a processed dataset, containing an ISO_6393 column

get_dict

get_dict(dataset_df)

Returns a dataframe containing pairs of bins and dictionaries (region:number of languages) based on the provided processed dataset (measures)

dataset_df - pandas dataframe of a processed dataset

Acknowledgements

  • Polyglot - A part of the langdive library (the polyglot_tokenizer file) has been taken from the Polyglot project. The reason for this is difficulty with installation on Windows and MacOS. If the library gets updated, this file will be removed.

Authors and maintainers

This library has been developed and is maintained by the following members of the Natural Language Processing group at the Innovation Center of the School of Electrical Engineering in Belgrade:

This effort was made possible thanks to collaboration and consultations with dr Tanja Samardžić, University of Zurich.

License

GNU GPL 3

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langdive-1.0.tar.gz (47.9 kB view details)

Uploaded Source

Built Distribution

langdive-1.0-py3-none-any.whl (45.8 kB view details)

Uploaded Python 3

File details

Details for the file langdive-1.0.tar.gz.

File metadata

  • Download URL: langdive-1.0.tar.gz
  • Upload date:
  • Size: 47.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.6

File hashes

Hashes for langdive-1.0.tar.gz
Algorithm Hash digest
SHA256 02f627a9a1cca75d6a3c848102478c1406596eca227f839803e7b84a5195f65c
MD5 548fe7fca038661c0e00582b40198e5d
BLAKE2b-256 6e04a69c7b708f6d70c557026335a405a0ef15c85ece40f748525a48fe18b0ac

See more details on using hashes here.

File details

Details for the file langdive-1.0-py3-none-any.whl.

File metadata

  • Download URL: langdive-1.0-py3-none-any.whl
  • Upload date:
  • Size: 45.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.6

File hashes

Hashes for langdive-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1c99c59039aa74a0fbd9464c8d4801a49121ccec025732dfd3a26240b371d919
MD5 4d8deab7994c2fca627852da1f8b125f
BLAKE2b-256 ec9683edc18f695d0e8bb5507d040436687b237ae4c8138ea99544bc124dd6f4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page