Skip to main content

PyLaDe - Language Detection tool written in Python.

Project description

PyLaDe

Build Status

pylade is a lightweight language detection tool written in Python. The tool provides a ready-to-use command-line interface, along with more complex scaffolding for customized tasks.

The current version of pylade implements the Cavnar-Trenkle N-Gram-based approach. However, the tool can be further expanded with customized language identification implementations.

Installation

You can install using pip:

$ pip install pylade

Usage

For a quick use, simply give the following command from terminal:

$ pylade "Put text here"
en

Done!

If you want to get deeper and use some more advanced features, please keep reading. Note: you can obtain more information about each of the following commands using the --help flag.

Train a model on a training set

$ pylade_train \
    training_set.csv \
    --implementation CavnarTrenkleImpl \
    --corpus-reader TwitterCorpusReader \
    --output model.json \
    --train-args '{"limit": 5000, "verbose": "True"}'

--train-args is a dictionary of arguments to be passed to the train() method of the chosen implementation (CavnarTrenkleImpl in the example above). For an accurate description of the arguments please refer to the train() method docstring.

NOTE: to define a new training set, you can check the format of the file tests/test_files/training_set_example.csv.

Evaluate a model on a test set

$ pylade_eval \
    test_set.csv \
    --model model.json \
    --implementation CavnarTrenkleImpl \
    --corpus-reader TwitterCorpusReader \
    --output results.json \
    --eval-args '{"languages": ["it", "de"], "error_values": 8000}'

--eval-args is a dictionary of arguments to be passed to the evaluate() method of the chosen implementation (CavnarTrenkleImpl in the example above). For an accurate description of the arguments please refer to the evaluate() method docstring.

Detect language of a text using a trained model

$ pylade \
    "Put text here" \
    --model model.json \
    --implementation CavnarTrenkleImpl \
    --output detected_language.txt \
    --predict-args '{"error_value": 8000}'

--predict-args is a dictionary of arguments to be passed to the predict_language() method of the chosen implementation (CavnarTrenkleImpl in the example above). For an accurate description of the arguments please refer to the predict_language() method docstring.

Custom implementations and corpora

Different language detection approaches can be implemented creating new classes that inherit from the Implementation class. This class should be considered as an interface whose methods are meant to be implemented by the inheriting class.

Customized corpus readers can be created the same way, inheriting from the CorpusReader interface instead.

Development

Testing

You can install development requirements using Poetry (poetry install). This will also install requirements needed for testing.

To run tests, just run tox from the package root folder.

Generating documentation with Sphinx

PyLaDe's documentation is generated using Sphinx. If you want to update the docs, you can install the necessary dependencies with Poetry:

$ poetry install --with docs

Documentation files are automatically generated from code docstrings. To rebuild the documentation to take changes into consideration, just run the following:

$ cd docs
$ make html

Notes

The default model (data/model.json) has been trained using limit = 5000. This value provides a good balance between computational performance and accuracy. Please note that this might change if you use your own data to train a new model.

References

  • Cavnar, William B., and John M. Trenkle. "N-gram-based text categorization." Ann Arbor MI 48113.2 (1994): 161-175.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pylade-0.3.0.tar.gz (255.0 kB view details)

Uploaded Source

Built Distribution

pylade-0.3.0-py3-none-any.whl (263.7 kB view details)

Uploaded Python 3

File details

Details for the file pylade-0.3.0.tar.gz.

File metadata

  • Download URL: pylade-0.3.0.tar.gz
  • Upload date:
  • Size: 255.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.10.2 Darwin/22.3.0

File hashes

Hashes for pylade-0.3.0.tar.gz
Algorithm Hash digest
SHA256 30aa5335c3f6ed4aec7009552df8c07ab91b892210dd6be7b16ab36febddf3a4
MD5 4a4589be97a882bf0def004e2637afaf
BLAKE2b-256 979658d4b546014c44ea7bad1868ab99aca576fe2fd4b5bbead51bc3ee3fec01

See more details on using hashes here.

File details

Details for the file pylade-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: pylade-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 263.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.10.2 Darwin/22.3.0

File hashes

Hashes for pylade-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 750d5afd39c6f554be761a18a26ddb76abc0b85a15e33493a4c6da87f2d9a3fb
MD5 2fcec98f00b4f5239cbb7d5cdf1f28da
BLAKE2b-256 0341a43a4523b4c8d1a502ddca1e81f7904cf7c6089becf57c47613fdc8b17f5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page