Skip to main content

Does language identification

Project description

whichlang

whichlang is a Python library for identifying the language of the given text

Installation

Use the package manager pip to install whichlang.

pip install whichlang

Usage

import whichlang

from whichlang import which_lang



f = open('sample-test-files\\sample-hindi.txt','r')

data = f.read()



# returns tuple of top 3 probable languages, first one being most probable language

print (which_lang(data))

>>> ('Hindi', 'Marathi', 'Punjabi') #Hindi is most probable. 

# For training a language model

# assamese.txt is train data

# Assamese is the language model created

python train_lang_models.py -f train-data\as\assamese.txt -l Assamese

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

Available Languages

Hindi, Telugu, Tamil, Kannada, Malayalam, Punjabi, Marathi, Gujarati, Oriya, Assamese.

Acknowledgements

  1. We would like to thank the Leipzig Corpora collection where we collected data for training models.

    Dirk Goldhahn, Thomas Eckart and Uwe Quasthoff (2012): Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 2012

  2. whichlang is based on N-gram based Text categorization: Cavnar, William B., and John M. Trenkle. "N-gram-based text categorization." Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval. Vol. 161175. 1994.

The same approach was used in library langdetect. We found this approach quite effective and wanted to explore for Indian languages. In whichlang, we train, optimize and make models readily available for Indian languages since these languages have been less explored.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

whichlang-0.0.1.tar.gz (3.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

whichlang-0.0.1-py3-none-any.whl (3.1 kB view details)

Uploaded Python 3

File details

Details for the file whichlang-0.0.1.tar.gz.

File metadata

  • Download URL: whichlang-0.0.1.tar.gz
  • Upload date:
  • Size: 3.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.21.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.3

File hashes

Hashes for whichlang-0.0.1.tar.gz
Algorithm Hash digest
SHA256 693c30c7a77388fa45f502cf30c342c07dbdb9733882d7901f70d7803f88f731
MD5 d9b7f1bd136f5fd9bd255c4dd5494b0f
BLAKE2b-256 9b1c2581971088379387e6d1251e35684788b1f0d744bf573098f986d1cdb097

See more details on using hashes here.

File details

Details for the file whichlang-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: whichlang-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 3.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.21.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.3

File hashes

Hashes for whichlang-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9e40f4d212814b931c9e779ecb369f75dd73b9048bf4528a71a6885972769134
MD5 69323de025e301d9e8065c5f63ebb7f4
BLAKE2b-256 37c4860266945ee2ff74c24d06e00dae89cdf4e1e6903fab1e851a0c38547d6a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page