Skip to main content

Does language identification for Indian languages

Project description

whichlang

whichlang is a Python library for identifying the language of the given text

Installation

Use the package manager pip to install whichlang.

pip install whichlang

Usage

import whichlang

from whichlang import which_lang



f = open('sample-test-files\\sample-hindi.txt','r')

data = f.read()



# returns tuple of top 3 probable languages, first one being most probable language

print (which_lang(data))

>>> ('Hindi', 'Marathi', 'Punjabi') #Hindi is most probable. 

# For training a language model

# assamese.txt is train data

# Assamese is the language model created

python train_lang_models.py -f train-data\as\assamese.txt -l Assamese

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

Available Languages

Hindi, Telugu, Tamil, Kannada, Malayalam, Punjabi, Marathi, Gujarati, Oriya, Assamese.

Acknowledgements

  1. We would like to thank the Leipzig Corpora collection where we collected data for training models.

    Dirk Goldhahn, Thomas Eckart and Uwe Quasthoff (2012): Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 2012

  2. whichlang is based on N-gram based Text categorization: Cavnar, William B., and John M. Trenkle. "N-gram-based text categorization." Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval. Vol. 161175. 1994.

The same approach was used in library langdetect. We found this approach quite effective and wanted to explore for Indian languages. In whichlang, we train, optimize and make models readily available for Indian languages since these languages have been less explored.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

whichlang-0.0.3.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

whichlang-0.0.3-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file whichlang-0.0.3.tar.gz.

File metadata

  • Download URL: whichlang-0.0.3.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.21.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.3

File hashes

Hashes for whichlang-0.0.3.tar.gz
Algorithm Hash digest
SHA256 a2b02d7cbec4268287da77fbf685f9207a390da324d7e4965186e5baf68872bc
MD5 43cf3f858eaa0d50c3c69bb60e1b3a0a
BLAKE2b-256 11c01faabe59dc2cef4183536eb90fadfc6dc334b55d4bbc57add6d25fd2d9dd

See more details on using hashes here.

File details

Details for the file whichlang-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: whichlang-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.21.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.3

File hashes

Hashes for whichlang-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3ec67099f2ca049a86af47046a369162a2642b87d436af7361967f9e8035be83
MD5 b6fade0a128be2bf135d89af3a23c782
BLAKE2b-256 b25e22b0828f6c0cc01c57c6e84d930c29323173d6233ec3b379a6ba4e28188e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page