Skip to main content

CLD3 Python bindings

Project description

pycld3

Python bindings to the Compact Language Detector v3 (CLD3).

This package contains Python bindings (via Cython) to Google's CLD3 library.

Installation

Install via Pip:

python -m pip install pycld3

Developers: see also Building from Source.

Usage

cld3 exports two module-level functions, get_language() and get_frequent_languages():

>>> import cld3

>>> cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")
LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

>>> cld3.get_language("This is a test")
LanguagePrediction(language='en', probability=0.9999980926513672, is_reliable=True, proportion=1.0)

>>> for lang in cld3.get_frequent_languages(
...     "This piece of text is in English. Този текст е на Български.",
...     num_langs=3
... ):
...     print(lang)
...
LanguagePrediction(language='bg', probability=0.9173890948295593, is_reliable=True, proportion=0.5853658318519592)
LanguagePrediction(language='en', probability=0.9999790191650391, is_reliable=True, proportion=0.4146341383457184)

FAQ

cld3 incorrectly detects my input. How can I fix this?

A first resort is to preprocess (clean) your input text based on conditions specific to your program.

A salient example is to remove URLs and email addresses from the input. CLD3 (unlike CLD2) does almost none of this cleaning for you, in the spirit of not penalizing other users with overhead that they may not need.

Here's such an example using a simplified URL regex from Regular Expressions Cookbook, 2nd ed.:

>>> import re
>>> import cld3

# cld3 does not ignore the URL components by default
>>> s = "Je veux que: https://site.english.com/this/is/a/url/path/component#fragment"
>>> cld3.get_language(s)
LanguagePrediction(language='en', probability=0.5319557189941406, is_reliable=False, proportion=1.0)

>>> url_re = r"\b(?:https?://|www\.)[a-z0-9-]+(\.[a-z0-9-]+)+(?:[/?].*)?"
>>> new_s = re.sub(url_re, "", s)
>>> new_s
'Je veux que: '
>>> cld3.get_language(new_s)
LanguagePrediction(language='fr', probability=0.9799421429634094, is_reliable=True, proportion=1.0)

Note: This URL regex aims for simplicity. It requires a domain name, and doesn't allow a username or password; it allows the scheme (http or https) to be omitted if it can be inferred from the subdomain (www). Source: Regular Expressions Cookbook, 2nd ed. - Goyvaerts & Levithan.

In some other cases, you cannot fix the incorrect detection. Language detection algorithms in general may perform poorly with very short inputs. Rarely should you trust the output of something like detect("hi"). Keep this limitation in mind regardless of what library you are using.

How do I fix an error telling me "The Protobuf compiler, protoc, could not be found"?

The Protobuf compiler, protoc, is required for building this package. (However, if you are installing from PyPI with pip, then the .h and .cc files generated with protoc will already be included.)

Below are some quick install commands, but please consult the official protobuf repository for information on installing Protobuf.

Ubuntu Linux:

sudo apt-get update
sudo apt-get install protobuf-compiler

Mac OSX:

brew update && brew install protobuf

Authors

This repository contains a fork of google/cld3 at commit 06f695f. The license for google/cld3 can be found at LICENSES/CLD3_LICENSE.

This repository is a combination of changes introduced by various forks of google/cld3 by the following people:

For Developers: Building from Source

To build this extension from scratch, you will need:

  • Cython
  • Protobuf, including the protoc Protobuf compiler available as an executable

Building the extension does not require the Chromium repository.

With these installed, you can run the following from the project root:

python setup.py bdist_wheel
python setup.py build_ext --inplace

Testing:

make test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycld3-0.10.tar.gz (685.9 kB view details)

Uploaded Source

Built Distribution

pycld3-0.10-cp37-cp37m-macosx_10_14_x86_64.whl (514.3 kB view details)

Uploaded CPython 3.7m macOS 10.14+ x86-64

File details

Details for the file pycld3-0.10.tar.gz.

File metadata

  • Download URL: pycld3-0.10.tar.gz
  • Upload date:
  • Size: 685.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for pycld3-0.10.tar.gz
Algorithm Hash digest
SHA256 8eb7b3bf55c2f2401a493dc3d690893dc44f9d910522df73915d449db87d27f3
MD5 0fadd8d9b663809823fa914aed584282
BLAKE2b-256 c0f6810b7df8f41c866c99b2f307f7c51c9cdc881b43cef2fa9d644e6b2393fd

See more details on using hashes here.

File details

Details for the file pycld3-0.10-cp37-cp37m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: pycld3-0.10-cp37-cp37m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 514.3 kB
  • Tags: CPython 3.7m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for pycld3-0.10-cp37-cp37m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 c63e7e7a1b95467c904d16d551607e3c40cb7e258d27d71b820c66dbb54e97bf
MD5 2fa6da7c60d8d36788112b1d4118b8f9
BLAKE2b-256 3a5984cc4bfb576894c72b4fc0699ae55e95544f0577d365e354302c63b4ca80

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page