Language identification Toolkit
Project description
lidtk
lidtk - the language identification toolkit - was written in order to investigate the current state of language performance.
Installation
The recommended way to install clana is:
$ pip install lidtk --user
If you want the latest version:
$ git clone https://github.com/MartinThoma/lidtk.git; cd lidtk
$ pip install -e . --user
I recommend getting the WiLI-2018 dataset.
Usage
$ lidtk --help
Usage: lidtk [OPTIONS] COMMAND [ARGS]...
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
analyze-data Utility function for the languages...
analyze-unicode-block Analyze how important a Unicode block is for...
char-distrib Use the character distribution language...
cld2 Use the CLD-2 language classifier.
create-dataset Create sharable dataset from downloaded...
download Download 1000 documents of each language.
google-cloud Use the CLD-2 language classifier.
langdetect Use the langdetect language classifier.
langid Use the langid language classifier.
map Map predictions to something known by WiLI
nn Use a neural network classifier.
textcat Use the CLD-2 language classifier.
tfidf_nn Use the TfidfNNClassifier classifier.
For example:
$ lidtk cld2 predict --text 'This is a test.'
eng
The usual order is:
lidtk download
: Please use WiLI-2018 instead of downloading the dataset on your own.lidtk create-dataset
: This step can be skipped if you use WiLI-2018lidtk analyze-unicode-block --start 0 --end 128
lidtk tfidf_nn train vectorizer --config lidtk/classifiers/config/tfidf_nn.yaml
lidtk tfidf_nn train vectorizer --config lidtk/classifiers/config/tfidf_nn.yaml
lidtk tfidf_nn wili --config lidtk/classifiers/config/tfidf_nn.yaml
Or to use one directly:
$ lidtk cld2 predict --text 'This text is written in some language.'
eng
Development
Check tests with tox
.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
lidtk-0.2.1.tar.gz
(36.1 kB
view hashes)
Built Distribution
lidtk-0.2.1-py3-none-any.whl
(42.5 kB
view hashes)