Skip to main content

Does language identification for Indian languages

Project description

whichlang

whichlang is a Python library for identifying the language of the given text

Installation

Use the package manager pip to install whichlang.

pip install whichlang

Usage

from whichlang import whichlang as wl



f = open('sample-test-files\\sample-hindi.txt','r')

data = f.read()



# returns tuple of top 3 probable languages, first one being most probable language

print (wl.which_lang(data))

>>> ('Hindi', 'Marathi', 'Punjabi') #Hindi is most probable. 

# For training a language model

# assamese.txt is train data

# Assamese is the language model created

python train_lang_models.py -f train-data\as\assamese.txt -l Assamese

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

Available Languages

Hindi, Telugu, Tamil, Kannada, Malayalam, Punjabi, Marathi, Gujarati, Oriya, Assamese.

Acknowledgements

  1. We would like to thank the Leipzig Corpora collection where we collected data for training models.

    Dirk Goldhahn, Thomas Eckart and Uwe Quasthoff (2012): Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 2012

  2. whichlang is based on N-gram based Text categorization: Cavnar, William B., and John M. Trenkle. "N-gram-based text categorization." Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval. Vol. 161175. 1994.

The same approach was used in library langdetect. We found this approach quite effective and wanted to explore for Indian languages. In whichlang, we train, optimize and make models readily available for Indian languages since these languages have been less explored.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

whichlang-0.0.4.tar.gz (82.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

whichlang-0.0.4-py3-none-any.whl (163.5 kB view details)

Uploaded Python 3

File details

Details for the file whichlang-0.0.4.tar.gz.

File metadata

  • Download URL: whichlang-0.0.4.tar.gz
  • Upload date:
  • Size: 82.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.21.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.3

File hashes

Hashes for whichlang-0.0.4.tar.gz
Algorithm Hash digest
SHA256 4bcfd4042fbd8a4c251c6816617aed778eda789c1fea558da82133ea962fbc19
MD5 0c55ed7d9cb3862752e44d2e121afca4
BLAKE2b-256 63d5dbd25ab5fdf4a0eaea0601158872129d53fd28e117ffd7a9b2b7f0782d84

See more details on using hashes here.

File details

Details for the file whichlang-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: whichlang-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 163.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.21.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.3

File hashes

Hashes for whichlang-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5a16720abec7a2141bf58b3805aa7e71a4544ebb09cc4cb817962e9d1990d311
MD5 ab92d711672665ffa182a15dab1239b2
BLAKE2b-256 56a9f925c75478619946160ca9dd455a79415e229ff8d0dcd8a2b20dd8e5d292

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page