Skip to main content

a tool to detecting the language for a small piece of unicode text without any dependency to other libraries.

Project description

# lang-detect: a tool to detect language

Detecting the language for a small piece of unicode text without any dependency to other libraries.

Currently we support detecting de, en, es, fr, it, ja, nl, pl, ru, zh-hans, zh-hant, and zh-yue.

After some simple testing, we found that the result for long sentence is better.

## Method

We focus on the Basic Multilingual Plane in unicode encoding, and current language support set could be extended.

For each language, we use a uniformed ngram vector to represent the language itself. This vector can be seen at the data folder.

When we detect a text, we generate the uniformed ngram vector for this text, and just comparing the cosine value of the angle between the text vector and the language vector.

To get the language vector, we use feature articles on Wikipedia as corpus.

## Usage

cd to the project root

bin/langdetect YOUR_SENTENCE_HERE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

lang_detect-0.0.1-py2.6.egg (1.4 kB view details)

Uploaded Egg

File details

Details for the file lang_detect-0.0.1-py2.6.egg.

File metadata

File hashes

Hashes for lang_detect-0.0.1-py2.6.egg
Algorithm Hash digest
SHA256 c1fa4a594eab61f1d2cbf9fece10f91cd5b507a7155410764c8b579c4c6e8a09
MD5 f80c43a3beb93cf25d7acfdde1d95603
BLAKE2b-256 ce63a28dd6e7a709c6d758d70cc6690b834dc769300960affd1621cbacd170c8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page