Skip to main content

Language identification Toolkit

Project description

DOI PyPI version Python Support Build Status Code style: black GitHub last commit GitHub commits since latest release (by SemVer) CodeFactor

lidtk

lidtk - the language identification toolkit - was written in order to investigate the current state of language performance.

Installation

The recommended way to install clana is:

$ pip install lidtk --user

If you want the latest version:

$ git clone https://github.com/MartinThoma/lidtk.git; cd lidtk
$ pip install -e . --user

I recommend getting the WiLI-2018 dataset.

Usage

$ lidtk --help

Usage: lidtk [OPTIONS] COMMAND [ARGS]...

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  analyze-data           Utility function for the languages...
  analyze-unicode-block  Analyze how important a Unicode block is for...
  char-distrib           Use the character distribution language...
  cld2                   Use the CLD-2 language classifier.
  create-dataset         Create sharable dataset from downloaded...
  download               Download 1000 documents of each language.
  google-cloud           Use the CLD-2 language classifier.
  langdetect             Use the langdetect language classifier.
  langid                 Use the langid language classifier.
  map                    Map predictions to something known by WiLI
  nn                     Use a neural network classifier.
  textcat                Use the CLD-2 language classifier.
  tfidf_nn               Use the TfidfNNClassifier classifier.

For example:

$ lidtk cld2 predict --text 'This is a test.'
eng

The usual order is:

  1. lidtk download: Please use WiLI-2018 instead of downloading the dataset on your own.
  2. lidtk create-dataset: This step can be skipped if you use WiLI-2018
  3. lidtk analyze-unicode-block --start 0 --end 128
  4. lidtk tfidf_nn train vectorizer --config lidtk/classifiers/config/tfidf_nn.yaml
  5. lidtk tfidf_nn train vectorizer --config lidtk/classifiers/config/tfidf_nn.yaml
  6. lidtk tfidf_nn wili --config lidtk/classifiers/config/tfidf_nn.yaml

Or to use one directly:

$ lidtk cld2 predict --text 'This text is written in some language.'

eng

Development

Check tests with tox.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lidtk-0.3.0.tar.gz (38.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lidtk-0.3.0-py3-none-any.whl (54.7 kB view details)

Uploaded Python 3

File details

Details for the file lidtk-0.3.0.tar.gz.

File metadata

  • Download URL: lidtk-0.3.0.tar.gz
  • Upload date:
  • Size: 38.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for lidtk-0.3.0.tar.gz
Algorithm Hash digest
SHA256 29f277d41ba39648b446a78c25eaafdd6bc96374badd7518b6d6ed130e557fe8
MD5 e7849e262b236ff0c9175d966df6ac87
BLAKE2b-256 7c8652f8e3acd4548e04a8904f94662db9646ddac4eda66b17b8b07210688e00

See more details on using hashes here.

File details

Details for the file lidtk-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: lidtk-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 54.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for lidtk-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 953560374940a4ad5f0fb4325e29a76a473ac16b7382ee38c586985ccc21d0c0
MD5 f8d0eb4e00a3dc2bebc4de60ab1e2b24
BLAKE2b-256 6aab2bbace881056c7f2b0b999cafd8bb1d8dd3e68f10241b469475a6c55deda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page