Skip to main content

CLD3 Python bindings

Project description

pycld3

Python bindings to the Compact Language Detector v3 (CLD3).

CircleCI License PyPI Status Python Implementation Size

This package contains Python bindings (via Cython) to Google's CLD3 library.

Installation

Note: If you're using CPython 3.7 on a Mac, you can skip this section and simply pip install pycld3, since there are cp37 wheels included in the PyPI distribution. It's on my to-do list to add wheels for other platforms/versions soon.

This package requires a bit more than a one-line pip install to get up and running. You'll also need the Protobuf compiler (the protoc executable), as well as the Protobuf development headers. Follow along below; I promise this will be painless:

Ubuntu Linux: protobuf-compiler installs protoc, while libprotobuf-dev contains the Protobuf development headers and static libraries.

sudo apt-get update
sudo apt-get install protobuf-compiler libprotobuf-dev

RHEL: Install from source.

curl -s -o protobuf-all-3.10.0.tar.gz \
    https://github.com/protocolbuffers/protobuf/releases/download/v3.10.0/protobuf-all-3.10.0.tar.gz
tar -xzf protobuf-all-3.10.0.tar.gz && rm -rf protobuf-all-3.10.0.tar.gz
cd protobuf-all-3.10.0
./configure && make && make install

Mac OS X: brew install protobuf will handle installing both protoc and placing the header files where they need to be (typically at /usr/local/Cellar/protobuf/x.y.z/include/).

brew update && brew install protobuf

Above are some quick install commands, but please consult the official protobuf repository for information on installing Protobuf.

Okay, now you're ready for the easy part; install via Pip:

python -m pip install pycld3

Usage

cld3 exports two module-level functions, get_language() and get_frequent_languages():

>>> import cld3

>>> cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")
LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

>>> cld3.get_language("This is a test")
LanguagePrediction(language='en', probability=0.9999980926513672, is_reliable=True, proportion=1.0)

>>> for lang in cld3.get_frequent_languages(
...     "This piece of text is in English. Този текст е на Български.",
...     num_langs=3
... ):
...     print(lang)
...
LanguagePrediction(language='bg', probability=0.9173890948295593, is_reliable=True, proportion=0.5853658318519592)
LanguagePrediction(language='en', probability=0.9999790191650391, is_reliable=True, proportion=0.4146341383457184)

FAQ

cld3 incorrectly detects my input. How can I fix this?

A first resort is to preprocess (clean) your input text based on conditions specific to your program.

A salient example is to remove URLs and email addresses from the input. CLD3 (unlike CLD2) does almost none of this cleaning for you, in the spirit of not penalizing other users with overhead that they may not need.

Here's such an example using a simplified URL regex from Regular Expressions Cookbook, 2nd ed.:

>>> import re
>>> import cld3

# cld3 does not ignore the URL components by default
>>> s = "Je veux que: https://site.english.com/this/is/a/url/path/component#fragment"
>>> cld3.get_language(s)
LanguagePrediction(language='en', probability=0.5319557189941406, is_reliable=False, proportion=1.0)

>>> url_re = r"\b(?:https?://|www\.)[a-z0-9-]+(\.[a-z0-9-]+)+(?:[/?].*)?"
>>> new_s = re.sub(url_re, "", s)
>>> new_s
'Je veux que: '
>>> cld3.get_language(new_s)
LanguagePrediction(language='fr', probability=0.9799421429634094, is_reliable=True, proportion=1.0)

Note: This URL regex aims for simplicity. It requires a domain name, and doesn't allow a username or password; it allows the scheme (http or https) to be omitted if it can be inferred from the subdomain (www). Source: Regular Expressions Cookbook, 2nd ed. - Goyvaerts & Levithan.

In some other cases, you cannot fix the incorrect detection. Language detection algorithms in general may perform poorly with very short inputs. Rarely should you trust the output of something like detect("hi"). Keep this limitation in mind regardless of what library you are using.

Authors

This repository contains a fork of google/cld3 at commit 06f695f. The license for google/cld3 can be found at LICENSES/CLD3_LICENSE.

This repository is a combination of changes introduced by various forks of google/cld3 by the following people:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycld3-0.13.tar.gz (686.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycld3-0.13-cp37-cp37m-macosx_10_14_x86_64.whl (514.7 kB view details)

Uploaded CPython 3.7mmacOS 10.14+ x86-64

File details

Details for the file pycld3-0.13.tar.gz.

File metadata

  • Download URL: pycld3-0.13.tar.gz
  • Upload date:
  • Size: 686.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for pycld3-0.13.tar.gz
Algorithm Hash digest
SHA256 dc0c8ff2b2f053f20d70a35a8936c3656b590310bca7c29f1affdddb1c7948b9
MD5 16b1bee5abbb38a922c51f8ee2bb8bbd
BLAKE2b-256 1721cc55f58de3acafc629648ad088a74ed8a8fdb6d458e696321f98b696fb95

See more details on using hashes here.

File details

Details for the file pycld3-0.13-cp37-cp37m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: pycld3-0.13-cp37-cp37m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 514.7 kB
  • Tags: CPython 3.7m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for pycld3-0.13-cp37-cp37m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 be36b790440acf2ed59e50f3f157f8e5646c64dd4dbd2b9ec944843e0c3f28f5
MD5 5ec4354b1e2e1006e6a5755b241ff3e2
BLAKE2b-256 4e235c343f3e513c5c6b62df7cf0c1a720177a0d1b5ee5a247c27cbe0061131e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page