Skip to main content

CLD3 Python bindings

Project description

pycld3

Python bindings to the Compact Language Detector v3 (CLD3).

CircleCI License PyPI Wheel Status Python Implementation

Newer Alternative: gcld3

Note: Since the original publication of this pycld3, Google's cld3 authors have published the Python package gcld3, which are official Python bindings built with pybind. Please check that project out as it is part of the canonical cld3 repository and will likely stay in better lock step with any cld3 changes over time.

Overview

This package contains Python bindings (via Cython) to Google's CLD3 library.

>>> import cld3
>>> cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")
LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

The library outputs BCP-47-style language codes. For some languages, output is differentiated by script. Language and script names from Unicode CLDR. It supports over 100 languages/scripts. See full list of supported languages/scripts in Google's CLD3 documentation.

Installing with Wheels: Supported Versions and Platforms

This project supports CPython versions 3.5 through 3.9.

We publish wheels for the following matrix:

  • MacOS: CPython 3.5 thru 3.9
  • Linux: CPython 3.5 thru 3.9; (manylinux1)

The wheels for both MacOS and manylinux1 include the external protobuf library copied into the wheel itself via auditwheel or delocate so that you won't need to install any extra non-PyPI dependencies.

If you are installing on one of the variants listed above, you should not need to have protoc or libprotobuf installed:

python -m pip install -U pycld3

Installing from Source: Prerequisites

If you are not on a platform variant that is eligible to use a wheel, you may still be able to use pycld3 via its source distribution (tar.gz), but a bit more work is required to install. Namely, you'll also need:

  • the Protobuf compiler (the protoc executable)
  • the Protobuf development headers and libprotoc library
  • a compiler, preferably g++

Please consult the official protobuf repository for information on installing Protobuf. The project contains an Installation README that covers installation on Windows and Unix.

If for whatever reason you are on a Unix host but unable to use the wheels (for instance, if you have an i686 architecture), here is a quick-and-dirty guide to installing.

Debian/Ubuntu

sudo apt-get update -y
sudo apt-get install -y --no-install-recommends \
    g++ \
    protobuf-compiler \
    libprotobuf-dev
python -m pip install -U pycld3

Alpine Linux

Note: Alpine Linux does not support PyPI wheels as of April 2020. The steps below are mandatory on Alpine Linux because you will need to install from the source distribution. If the situation permits, using a Debian distro should be much easier (and faster).

apk --update add g++ protobuf protobuf-dev
python -m pip install -U pycld3

CentOS/RHEL

Install from source, as root/UID 0:

sudo su -
set -ex
pushd /opt
PROTOBUF_VERSION='3.11.4'
yum update -y
yum install -y autoconf automake gcc-c++ glibc-headers gzip libtool make python3-devel zlib-devel
curl -Lo /opt/protobuf.tar.gz \
    "https://github.com/protocolbuffers/protobuf/releases/download/v${PROTOBUF_VERSION}/protobuf-cpp-${PROTOBUF_VERSION}.tar.gz"
tar -xzvf protobuf.tar.gz
rm -f protobuf.tar.gz
pushd "protobuf-${PROTOBUF_VERSION}"
./configure --with-zlib --disable-debug && make && make install && ldconfig --verbose
popd && rm -rf "protobuf-${PROTOBUF_VERSION}" && popd && set +ex

python -m pip install -U pycld3

Note: the steps above are for CentOS 8. For earlier versions, you may need to replace:

  • gcc-c++ with g++
  • python3-devel with python-devel

MacOS/Homebrew

brew update
brew upgrade protobuf || brew install -v protobuf
python -m pip install -U pycld3

Windows

Please consult Protobuf's C++ Installation - Windows section for help with installing Protobuf on Windows.

If you would like to help contribute Windows wheels (preferably as a job within the project's CI/CD pipelines), please file an issue.

Usage

cld3 exports two module-level functions, get_language() and get_frequent_languages():

>>> import cld3

>>> cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")
LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

>>> cld3.get_language("This is a test")
LanguagePrediction(language='en', probability=0.9999980926513672, is_reliable=True, proportion=1.0)

>>> for lang in cld3.get_frequent_languages(
...     "This piece of text is in English. Този текст е на Български.",
...     num_langs=3
... ):
...     print(lang)
...
LanguagePrediction(language='bg', probability=0.9173890948295593, is_reliable=True, proportion=0.5853658318519592)
LanguagePrediction(language='en', probability=0.9999790191650391, is_reliable=True, proportion=0.4146341383457184)

FAQ

cld3 incorrectly detects my input. How can I fix this?

A first resort is to preprocess (clean) your input text based on conditions specific to your program.

A salient example is to remove URLs and email addresses from the input. CLD3 (unlike CLD2) does almost none of this cleaning for you, in the spirit of not penalizing other users with overhead that they may not need.

Here's such an example using a simplified URL regex from Regular Expressions Cookbook, 2nd ed.:

>>> import re
>>> import cld3

# cld3 does not ignore the URL components by default
>>> s = "Je veux que: https://site.english.com/this/is/a/url/path/component#fragment"
>>> cld3.get_language(s)
LanguagePrediction(language='en', probability=0.5319557189941406, is_reliable=False, proportion=1.0)

>>> url_re = r"\b(?:https?://|www\.)[a-z0-9-]+(\.[a-z0-9-]+)+(?:[/?].*)?"
>>> new_s = re.sub(url_re, "", s)
>>> new_s
'Je veux que: '
>>> cld3.get_language(new_s)
LanguagePrediction(language='fr', probability=0.9799421429634094, is_reliable=True, proportion=1.0)

Note: This URL regex aims for simplicity. It requires a domain name, and doesn't allow a username or password; it allows the scheme (http or https) to be omitted if it can be inferred from the subdomain (www). Source: Regular Expressions Cookbook, 2nd ed. - Goyvaerts & Levithan.

In some other cases, you cannot fix the incorrect detection. Language detection algorithms in general may perform poorly with very short inputs. Rarely should you trust the output of something like detect("hi"). Keep this limitation in mind regardless of what library you are using.

Please remember that, at the end of the day, this project is just a Python wrapper to the CLD3 C++ library that does the actual heavy-lifting.

I'm seeing an error during pip installation. How can I fix this?

First, please make sure you have read the installation section that that you have installed Protobuf if necessary.

If that doesn't help, please file an issue in this repository. The build process for this project is somewhat complex because it involves both Cython and Protobuf, but I do my best to make it work everywhere possible.

Protobuf is installed, but I'm still seeing "cannot open shared object file"

If you've installed Protobuf, but are seeing an error such as:

ImportError: libprotobuf.so.22: cannot open shared object file: No such file or directory

This likely means that Python is not finding the libprotobuf shared object, possibly because ldconfig didn't do what it was supposed to. You may need to tell it where to look.

You can find where the library sits via:

$ find /usr -name 'libprotoc.so' \( -type l -o -type f \)
/usr/local/lib/libprotoc.so

Then, you can add the directory containing this file to LD_LIBRARY_PATH:

export LD_LIBRARY_PATH="$(dirname $(find /usr -name 'libprotoc.so' \( -type l -o -type f \))):$LD_LIBRARY_PATH"

You can quickly test that this worked:

$ python -c 'import cld3; print(cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度"))'
LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

Authors

This repository contains a fork of google/cld3 at commit 06f695f. The license for google/cld3 can be found at LICENSES/CLD3_LICENSE.

This repository is a combination of changes introduced by various forks of google/cld3 by the following people:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycld3-0.21.tar.gz (653.0 kB view details)

Uploaded Source

Built Distributions

pycld3-0.21-cp39-cp39-manylinux1_x86_64.whl (13.6 MB view details)

Uploaded CPython 3.9

pycld3-0.21-cp39-cp39-macosx_10_14_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9 macOS 10.14+ x86-64

pycld3-0.21-cp38-cp38-manylinux1_x86_64.whl (13.6 MB view details)

Uploaded CPython 3.8

pycld3-0.21-cp38-cp38-macosx_10_14_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.8 macOS 10.14+ x86-64

pycld3-0.21-cp37-cp37m-manylinux1_x86_64.whl (13.6 MB view details)

Uploaded CPython 3.7m

pycld3-0.21-cp37-cp37m-macosx_10_14_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.7m macOS 10.14+ x86-64

pycld3-0.21-cp36-cp36m-manylinux1_x86_64.whl (13.6 MB view details)

Uploaded CPython 3.6m

pycld3-0.21-cp36-cp36m-macosx_10_14_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.6m macOS 10.14+ x86-64

pycld3-0.21-cp35-cp35m-manylinux1_x86_64.whl (13.6 MB view details)

Uploaded CPython 3.5m

pycld3-0.21-cp35-cp35m-macosx_10_14_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.5m macOS 10.14+ x86-64

File details

Details for the file pycld3-0.21.tar.gz.

File metadata

  • Download URL: pycld3-0.21.tar.gz
  • Upload date:
  • Size: 653.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.21.tar.gz
Algorithm Hash digest
SHA256 ff8120c3306a69bf5e62d908c57794f7823819a4c4d41bf0fcae7e8c482db7b1
MD5 4d0ed0524654cb8e4d3a3ae2ef2f6f9f
BLAKE2b-256 ac12a81c86db3cfda165d0f37cf830ab6b3d1601e8a77dd286bc5083035323f2

See more details on using hashes here.

File details

Details for the file pycld3-0.21-cp39-cp39-manylinux1_x86_64.whl.

File metadata

  • Download URL: pycld3-0.21-cp39-cp39-manylinux1_x86_64.whl
  • Upload date:
  • Size: 13.6 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.21-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e95b919a48c0645df5de0108e9a0bc20c1c16d841bc575041dc4fe3f9a416c60
MD5 c9076d14c4184df590c5a73bc523a215
BLAKE2b-256 77473eb5445b2c13572357373046741d16a8a8491ca832f73d6330424005bc4c

See more details on using hashes here.

File details

Details for the file pycld3-0.21-cp39-cp39-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: pycld3-0.21-cp39-cp39-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.9, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.21-cp39-cp39-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 a5d6b7ee07cfb6463db584fd8f2cb53aaf54b3e38964673702e1b7815a49a27c
MD5 99cd4cf05b35c888e4a18b55e0c67d18
BLAKE2b-256 1ec11c14c3146a0d82c7d85391f775e8711cb037c52a0594121a1e3dcdc741c7

See more details on using hashes here.

File details

Details for the file pycld3-0.21-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: pycld3-0.21-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 13.6 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.21-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ec03ce7a31c69041756e688da3ff5f575e1e942945597e9debcac272c0829ee6
MD5 29195373c1d920c2d57ff3c9cf39d0a5
BLAKE2b-256 7a9d7d368bd262e1469353b411f4c4983108c98937b3eadca0656dc87a5cc0a7

See more details on using hashes here.

File details

Details for the file pycld3-0.21-cp38-cp38-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: pycld3-0.21-cp38-cp38-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.8, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.21-cp38-cp38-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 cb1cd6e02be725e9828777b52c944625a79aa86c6ac822afbf481a866808ae8c
MD5 50cbeff69d56d11ceca8c5b03edd07c4
BLAKE2b-256 dbe19515c98c8bd5ee048734c5a630926b5aaefc71676978cfa30bfb3e2634ca

See more details on using hashes here.

File details

Details for the file pycld3-0.21-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pycld3-0.21-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 13.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.21-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a46fd7456a7be1bbe7d2d2b1a223988ad9dbcb3f8bc7782b5fbfa54588efc64d
MD5 7500de32b1791b29719a043120497d6f
BLAKE2b-256 08a72a34b8e9e609b1a98209dca6f80a01bceeb4bbfc3383fd10a541d85fe188

See more details on using hashes here.

File details

Details for the file pycld3-0.21-cp37-cp37m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: pycld3-0.21-cp37-cp37m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.7m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.21-cp37-cp37m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 3b15deff35e010df6ea34f8b7e1eb226eb1b9c05542796b353804744af99b513
MD5 03a1e6067f60f0f7b16362f031eb8ef2
BLAKE2b-256 f3d960ccd5e13a075c8dc0b09a64bf1bbb29034690ea2882ae1e8c9a31bf674b

See more details on using hashes here.

File details

Details for the file pycld3-0.21-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pycld3-0.21-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 13.6 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.21-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 acec04d479fc141852c3344e37f2ee7a48149b1be2182cef4a4d06bd280f2d66
MD5 9655435c9b2d3ddefd861a0a4ac2ab4c
BLAKE2b-256 b3e99b1a83e3c4f942bdc7b51b817b1aa3b671cf474ec4e6be2194eace92d1ea

See more details on using hashes here.

File details

Details for the file pycld3-0.21-cp36-cp36m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: pycld3-0.21-cp36-cp36m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.6m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.21-cp36-cp36m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 6b693bc755238ef2480c394d740cdf2b9efd8a5ee2ff46bbf69e1154e476a2ba
MD5 275e22e4433e228f8eb11ecb13852a76
BLAKE2b-256 29a844fedce2f05f6882f2e43bffe622043c9d3abff9b3487ec1e8ad02fc810b

See more details on using hashes here.

File details

Details for the file pycld3-0.21-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pycld3-0.21-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 13.6 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.21-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c4af42fef068b33addf7ee86b6555fe8934ddf2b16afad60fbfe958f392754ce
MD5 9db98daf9afefc84f6fb0a55bbeac962
BLAKE2b-256 3e33e2c357a294812dc883a4b8238ebd622d2c9e47508d7a53996c70cb5692c3

See more details on using hashes here.

File details

Details for the file pycld3-0.21-cp35-cp35m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: pycld3-0.21-cp35-cp35m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.5m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.21-cp35-cp35m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 2c21528d089a7b09e8f24e950c4bc2c765841a8fc1084c3826cef67b26c405ac
MD5 114d72237f1eb59cd41eee05b447c628
BLAKE2b-256 8d833e1c3f0dca9a517477b8cace6b819146393e247c1f05675a8b2f400dd49d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page