Skip to main content

CLD3 Python bindings

Project description

pycld3

Python bindings to the Compact Language Detector v3 (CLD3).

CircleCI License PyPI Wheel Status Python Implementation

Newer Alternative: gcld3

Note: Since the original publication of this pycld3, Google's cld3 authors have published the Python package gcld3, which are official Python bindings built with pybind. Please check that project out as it is part of the canonical cld3 repository and will likely stay in better lock step with any cld3 changes over time.

Overview

This package contains Python bindings (via Cython) to Google's CLD3 library.

>>> import cld3
>>> cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")
LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

The library outputs BCP-47-style language codes. For some languages, output is differentiated by script. Language and script names from Unicode CLDR. It supports over 100 languages/scripts. See full list of supported languages/scripts in Google's CLD3 documentation.

Installing with Wheels: Supported Versions and Platforms

This project supports CPython versions 3.6 through 3.9.

We publish wheels for the following matrix:

  • MacOS: CPython 3.6 thru 3.9
  • Linux: CPython 3.6 thru 3.9; (manylinux1)

The wheels for both MacOS and manylinux1 include the external protobuf library copied into the wheel itself via auditwheel or delocate so that you won't need to install any extra non-PyPI dependencies.

If you are installing on one of the variants listed above, you should not need to have protoc or libprotobuf installed:

python -m pip install -U pycld3

Installing from Source: Prerequisites

If you are not on a platform variant that is eligible to use a wheel, you may still be able to use pycld3 via its source distribution (tar.gz), but a bit more work is required to install. Namely, you'll also need:

  • the Protobuf compiler (the protoc executable)
  • the Protobuf development headers and libprotoc library
  • a compiler, preferably g++

Please consult the official protobuf repository for information on installing Protobuf. The project contains an Installation README that covers installation on Windows and Unix.

If for whatever reason you are on a Unix host but unable to use the wheels (for instance, if you have an i686 architecture), here is a quick-and-dirty guide to installing.

Debian/Ubuntu

sudo apt-get update -y
sudo apt-get install -y --no-install-recommends \
    g++ \
    protobuf-compiler \
    libprotobuf-dev
python -m pip install -U pycld3

Alpine Linux

Note: Alpine Linux does not support PyPI wheels as of April 2020. The steps below are mandatory on Alpine Linux because you will need to install from the source distribution. If the situation permits, using a Debian distro should be much easier (and faster).

apk --update add g++ protobuf protobuf-dev
python -m pip install -U pycld3

CentOS/RHEL

Install from source, as root/UID 0:

sudo su -
set -ex
pushd /opt
PROTOBUF_VERSION='3.11.4'
yum update -y
yum install -y autoconf automake gcc-c++ glibc-headers gzip libtool make python3-devel zlib-devel
curl -Lo /opt/protobuf.tar.gz \
    "https://github.com/protocolbuffers/protobuf/releases/download/v${PROTOBUF_VERSION}/protobuf-cpp-${PROTOBUF_VERSION}.tar.gz"
tar -xzvf protobuf.tar.gz
rm -f protobuf.tar.gz
pushd "protobuf-${PROTOBUF_VERSION}"
./configure --with-zlib --disable-debug && make && make install && ldconfig --verbose
popd && rm -rf "protobuf-${PROTOBUF_VERSION}" && popd && set +ex

python -m pip install -U pycld3

Note: the steps above are for CentOS 8. For earlier versions, you may need to replace:

  • gcc-c++ with g++
  • python3-devel with python-devel

MacOS/Homebrew

brew update
brew upgrade protobuf || brew install -v protobuf
python -m pip install -U pycld3

Windows

Please consult Protobuf's C++ Installation - Windows section for help with installing Protobuf on Windows.

If you would like to help contribute Windows wheels (preferably as a job within the project's CI/CD pipelines), please file an issue.

Usage

cld3 exports two module-level functions, get_language() and get_frequent_languages():

>>> import cld3

>>> cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")
LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

>>> cld3.get_language("This is a test")
LanguagePrediction(language='en', probability=0.9999980926513672, is_reliable=True, proportion=1.0)

>>> for lang in cld3.get_frequent_languages(
...     "This piece of text is in English. Този текст е на Български.",
...     num_langs=3
... ):
...     print(lang)
...
LanguagePrediction(language='bg', probability=0.9173890948295593, is_reliable=True, proportion=0.5853658318519592)
LanguagePrediction(language='en', probability=0.9999790191650391, is_reliable=True, proportion=0.4146341383457184)

FAQ

cld3 incorrectly detects my input. How can I fix this?

A first resort is to preprocess (clean) your input text based on conditions specific to your program.

A salient example is to remove URLs and email addresses from the input. CLD3 (unlike CLD2) does almost none of this cleaning for you, in the spirit of not penalizing other users with overhead that they may not need.

Here's such an example using a simplified URL regex from Regular Expressions Cookbook, 2nd ed.:

>>> import re
>>> import cld3

# cld3 does not ignore the URL components by default
>>> s = "Je veux que: https://site.english.com/this/is/a/url/path/component#fragment"
>>> cld3.get_language(s)
LanguagePrediction(language='en', probability=0.5319557189941406, is_reliable=False, proportion=1.0)

>>> url_re = r"\b(?:https?://|www\.)[a-z0-9-]+(\.[a-z0-9-]+)+(?:[/?].*)?"
>>> new_s = re.sub(url_re, "", s)
>>> new_s
'Je veux que: '
>>> cld3.get_language(new_s)
LanguagePrediction(language='fr', probability=0.9799421429634094, is_reliable=True, proportion=1.0)

Note: This URL regex aims for simplicity. It requires a domain name, and doesn't allow a username or password; it allows the scheme (http or https) to be omitted if it can be inferred from the subdomain (www). Source: Regular Expressions Cookbook, 2nd ed. - Goyvaerts & Levithan.

In some other cases, you cannot fix the incorrect detection. Language detection algorithms in general may perform poorly with very short inputs. Rarely should you trust the output of something like detect("hi"). Keep this limitation in mind regardless of what library you are using.

Please remember that, at the end of the day, this project is just a Python wrapper to the CLD3 C++ library that does the actual heavy-lifting.

I'm seeing an error during pip installation. How can I fix this?

First, please make sure you have read the installation section that that you have installed Protobuf if necessary.

If that doesn't help, please file an issue in this repository. The build process for this project is somewhat complex because it involves both Cython and Protobuf, but I do my best to make it work everywhere possible.

Protobuf is installed, but I'm still seeing "cannot open shared object file"

If you've installed Protobuf, but are seeing an error such as:

ImportError: libprotobuf.so.22: cannot open shared object file: No such file or directory

This likely means that Python is not finding the libprotobuf shared object, possibly because ldconfig didn't do what it was supposed to. You may need to tell it where to look.

You can find where the library sits via:

$ find /usr -name 'libprotoc.so' \( -type l -o -type f \)
/usr/local/lib/libprotoc.so

Then, you can add the directory containing this file to LD_LIBRARY_PATH:

export LD_LIBRARY_PATH="$(dirname $(find /usr -name 'libprotoc.so' \( -type l -o -type f \))):$LD_LIBRARY_PATH"

You can quickly test that this worked:

$ python -c 'import cld3; print(cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度"))'
LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

Authors

This repository contains a fork of google/cld3 at commit 06f695f. The license for google/cld3 can be found at LICENSES/CLD3_LICENSE.

This repository is a combination of changes introduced by various forks of google/cld3 by the following people:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycld3-0.22.tar.gz (726.2 kB view details)

Uploaded Source

Built Distributions

pycld3-0.22-cp39-cp39-manylinux1_x86_64.whl (13.6 MB view details)

Uploaded CPython 3.9

pycld3-0.22-cp39-cp39-macosx_10_14_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9 macOS 10.14+ x86-64

pycld3-0.22-cp38-cp38-manylinux1_x86_64.whl (13.6 MB view details)

Uploaded CPython 3.8

pycld3-0.22-cp38-cp38-macosx_10_14_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.8 macOS 10.14+ x86-64

pycld3-0.22-cp37-cp37m-manylinux1_x86_64.whl (13.6 MB view details)

Uploaded CPython 3.7m

pycld3-0.22-cp37-cp37m-macosx_10_14_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.7m macOS 10.14+ x86-64

pycld3-0.22-cp36-cp36m-manylinux1_x86_64.whl (13.6 MB view details)

Uploaded CPython 3.6m

pycld3-0.22-cp36-cp36m-macosx_10_14_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.6m macOS 10.14+ x86-64

File details

Details for the file pycld3-0.22.tar.gz.

File metadata

  • Download URL: pycld3-0.22.tar.gz
  • Upload date:
  • Size: 726.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.22.tar.gz
Algorithm Hash digest
SHA256 2de0f6895342a116bce6032e6b5bea747de1dcad5d511c6583ae4343a0708dd5
MD5 c41527f8ebe6c0063d78335f0d1b6c02
BLAKE2b-256 6bd0b180a38c983062877f72dffe876de58dad216a5be26d05b04f9ae4050e4b

See more details on using hashes here.

File details

Details for the file pycld3-0.22-cp39-cp39-manylinux1_x86_64.whl.

File metadata

  • Download URL: pycld3-0.22-cp39-cp39-manylinux1_x86_64.whl
  • Upload date:
  • Size: 13.6 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.22-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d614a9d5fbd7e1286fc8f3061332d83d34bec52f6e9c627246c14f245f0a3852
MD5 e1cc7ca430b0615b92dd5abb6d1d3915
BLAKE2b-256 7275711b4642fccb0fd496509e9601d51f0ada1b7416da987a3bdcf349970ef2

See more details on using hashes here.

File details

Details for the file pycld3-0.22-cp39-cp39-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: pycld3-0.22-cp39-cp39-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.9, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.22-cp39-cp39-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 1e0c8d6e64e0f0ae5e464806af5793234c403e10a16bd796df742d910a1dd464
MD5 d406b7f851095c9bc440b4d12bb46a63
BLAKE2b-256 ec6a8ee18d280c7f202959d43c5dc1c73cf2c4ac85c75dcb4afebbd1271101bb

See more details on using hashes here.

File details

Details for the file pycld3-0.22-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: pycld3-0.22-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 13.6 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.22-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 903d33e62a717ff0cf3f75d7cfb835c786b78f5c78f309e2c0bb8ad1fd0e05b6
MD5 19e59baf4e703486d91df49ad8353f73
BLAKE2b-256 8ac7e6c01ba26a82f19d23d1a1ba02308045a977d13975a77309cf39a89df14a

See more details on using hashes here.

File details

Details for the file pycld3-0.22-cp38-cp38-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: pycld3-0.22-cp38-cp38-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.8, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.22-cp38-cp38-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 d7327567bcbae275921fb441b037f815b76adcad7349e8b9451fb5dbfcf57b8a
MD5 501add4238307e97c6404194f850e4a9
BLAKE2b-256 9294e6a6370dbb1e0330a8ee393ebf6a69f18c5d7e160dc7d1e8f7ad3dcb5317

See more details on using hashes here.

File details

Details for the file pycld3-0.22-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pycld3-0.22-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 13.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.22-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a1c55b2146ddd3c6294b4c1590eddafdf66cccfea0db95aa6194c08c36dcee8a
MD5 747368a3cda0ce0c5b5895029ffdf6c9
BLAKE2b-256 fb62be163710d231a8cfdba8c9036db039ec4dc8293fd44c49ed06c59917c909

See more details on using hashes here.

File details

Details for the file pycld3-0.22-cp37-cp37m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: pycld3-0.22-cp37-cp37m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.7m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.22-cp37-cp37m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 5572632b64911c3ecf8e3919fa70ed8f1ee7d595090fbc95ca640d6aec6ba1cc
MD5 615e5c96cc2e11c5c3712cc335d50d04
BLAKE2b-256 1ae68655e29b5d0a430cdbeb3ebd088f4a896b8d0c682005e71f160be1ea8678

See more details on using hashes here.

File details

Details for the file pycld3-0.22-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pycld3-0.22-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 13.6 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.22-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 407ef96f7638ee499afb66d8b29c46ad3f86808c6ea9584b580b8b9c4bb029ef
MD5 b04355619df340e00b1724a8cc0a9d0f
BLAKE2b-256 b8447e1ad0fcd87fc1ab06356f280f1cccfebfa4338292ecac31705cb20752ae

See more details on using hashes here.

File details

Details for the file pycld3-0.22-cp36-cp36m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: pycld3-0.22-cp36-cp36m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.6m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pycld3-0.22-cp36-cp36m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 a5ed75d673719cd03174d4bd8f387933181fe64ad2482a975efc76c965a2709e
MD5 8530899e92e8c9246846bc7c4f46a82f
BLAKE2b-256 0629675f3211c8e84898016ed04f6972e49da20b455d169019f2cbfeb99ed80f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page