Language identification Toolkit
Project description
lidtk
lidtk - the language identification toolkit - was written in order to investigate the current state of language performance.
Installation
The recommended way to install clana is:
$ pip install lidtk --user
If you want the latest version:
$ git clone https://github.com/MartinThoma/lidtk.git; cd lidtk
$ pip install -e . --user
I recommend getting the WiLI-2018 dataset.
Usage
$ lidtk --help
Usage: lidtk [OPTIONS] COMMAND [ARGS]...
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
analyze-data Utility function for the languages...
analyze-unicode-block Analyze how important a Unicode block is for...
char-distrib Use the character distribution language...
cld2 Use the CLD-2 language classifier.
create-dataset Create sharable dataset from downloaded...
download Download 1000 documents of each language.
google-cloud Use the CLD-2 language classifier.
langdetect Use the langdetect language classifier.
langid Use the langid language classifier.
map Map predictions to something known by WiLI
nn Use a neural network classifier.
textcat Use the CLD-2 language classifier.
tfidf_nn Use the TfidfNNClassifier classifier.
For example:
$ lidtk cld2 predict --text 'This is a test.'
eng
The usual order is:
lidtk download: Please use WiLI-2018 instead of downloading the dataset on your own.lidtk create-dataset: This step can be skipped if you use WiLI-2018lidtk analyze-unicode-block --start 0 --end 128lidtk tfidf_nn train vectorizer --config lidtk/classifiers/config/tfidf_nn.yamllidtk tfidf_nn train vectorizer --config lidtk/classifiers/config/tfidf_nn.yamllidtk tfidf_nn wili --config lidtk/classifiers/config/tfidf_nn.yaml
Or to use one directly:
$ lidtk cld2 predict --text 'This text is written in some language.'
eng
Development
Check tests with tox.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lidtk-0.3.0.tar.gz.
File metadata
- Download URL: lidtk-0.3.0.tar.gz
- Upload date:
- Size: 38.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29f277d41ba39648b446a78c25eaafdd6bc96374badd7518b6d6ed130e557fe8
|
|
| MD5 |
e7849e262b236ff0c9175d966df6ac87
|
|
| BLAKE2b-256 |
7c8652f8e3acd4548e04a8904f94662db9646ddac4eda66b17b8b07210688e00
|
File details
Details for the file lidtk-0.3.0-py3-none-any.whl.
File metadata
- Download URL: lidtk-0.3.0-py3-none-any.whl
- Upload date:
- Size: 54.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
953560374940a4ad5f0fb4325e29a76a473ac16b7382ee38c586985ccc21d0c0
|
|
| MD5 |
f8d0eb4e00a3dc2bebc4de60ab1e2b24
|
|
| BLAKE2b-256 |
6aab2bbace881056c7f2b0b999cafd8bb1d8dd3e68f10241b469475a6c55deda
|