Skip to main content

PyLaDe - Language Detection tool written in Python.

Project description

PyLaDe

Build Status

pylade is a lightweight language detection tool written in Python. The tool provides a ready-to-use command-line interface, along with a more complex scaffolding for customized tasks.

The current version of pylade implements the Cavnar-Trenkle N-Gram-based approach. However, the tool can be further expanded with customized language identification implementations.

Requirements

  • python 3.7 up to 3.11 (included)
  • nltk

Installation

Download the repository and install using pip (locally):

$ git clone git@github.com:fievelk/PyLaDe.git
$ cd pylade
$ pip install .

Usage

For a quick use, simply give the following command from terminal:

pylade "Put text here"
# en

Done!

If you want to get deeper and use some more advanced features, please keep reading. Note: you can obtain more information about each of the following commands using the --help flag.

Train a model on a training set

pylade_train \
    training_set.csv \
    --implementation CavnarTrenkleImpl \
    --corpus-reader TwitterCorpusReader \
    --output model.json \
    --train-args '{"limit": 5000, "verbose": "True"}'

--train-args is a dictionary of arguments to be passed to the train() method of the chosen implementation (CavnarTrenkleImpl in the example above). For an accurate description of the arguments please refer to the train() method docstring.

NOTE: to define a new training set, you can check the format of the file tests/test_files/training_set_example.csv.

Evaluate a model on a test set

pylade_eval \
    test_set.csv \
    --model model.json \
    --implementation CavnarTrenkleImpl \
    --corpus-reader TwitterCorpusReader \
    --output results.json \
    --eval-args '{"languages": ["it", "de"], "error_values": 8000}'

--eval-args is a dictionary of arguments to be passed to the evaluate() method of the chosen implementation (CavnarTrenkleImpl in the example above). For an accurate description of the arguments please refer to the evaluate() method docstring.

Detect language of a text using a trained model

pylade \
    "Put text here" \
    --model model.json \
    --implementation CavnarTrenkleImpl \
    --output detected_language.txt \
    --predict-args '{"error_value": 8000}'

--predict-args is a dictionary of arguments to be passed to the predict_language() method of the chosen implementation (CavnarTrenkleImpl in the example above). For an accurate description of the arguments please refer to the predict_language() method docstring.

Info

The default model (data/model.json) has been trained using limit = 5000. This value provides a good balance between computational performance and accuracy. Please note that this might change if you use your own data to train a new model.

Tests

Give the command tox from the package root in order to perform tests.

Tests with tox require the following dependencies:

  • tox
  • pytest

Customization

Different language detection approaches can be implemented creating new classes that inherit from the Implementation class. This class should be considered as an interface whose methods are meant to be implemented by the inheriting class.

Customized corpus readers can be created the same way, inheriting from the CorpusReader interface instead.

References

  • Cavnar, William B., and John M. Trenkle. "N-gram-based text categorization." Ann Arbor MI 48113.2 (1994): 161-175.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pylade-0.2.0.tar.gz (254.4 kB view details)

Uploaded Source

Built Distribution

pylade-0.2.0-py3-none-any.whl (263.4 kB view details)

Uploaded Python 3

File details

Details for the file pylade-0.2.0.tar.gz.

File metadata

  • Download URL: pylade-0.2.0.tar.gz
  • Upload date:
  • Size: 254.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.10.2 Darwin/22.3.0

File hashes

Hashes for pylade-0.2.0.tar.gz
Algorithm Hash digest
SHA256 6c2053c1c921f994bf81a83ebfb0a8ef42dd93a87ae825fd25696da802b43fdb
MD5 45ae76dc761dbabfb793d68c706fbb00
BLAKE2b-256 18d9ce53c54a93bbe017f9278497dfb1e9f6a2805d16b948b59ad841dd1aa47d

See more details on using hashes here.

File details

Details for the file pylade-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: pylade-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 263.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.10.2 Darwin/22.3.0

File hashes

Hashes for pylade-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3024650c1e2f0719c21262e6216412a2fe89897d83bf21628f8e4e077012ed57
MD5 b54d722c8047ca991257364ed12021d1
BLAKE2b-256 c494ecbea6703985f19f979741a8b66ddea5ad32dbad33f7a3236d380e02affc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page