CLD3 Python bindings
Project description
pycld3
Python bindings to the Compact Language Detector v3 (CLD3).
This package contains Python bindings (via Cython) to Google's CLD3 library.
Installation
Note: The PyPI package contains one platform wheel, for Mac OS X 10.14 / CPython 3.7. If this describes your platform & Python version, you can skip this section and simply pip install pycld3
. It's on my to-do list to add wheels for other platforms/versions soon.
This package requires a bit more than a one-line pip install
to get up and running. You'll also need the Protobuf compiler (the protoc
executable), as well as the Protobuf development headers. Follow along below; I promise this will be painless:
Ubuntu Linux: protobuf-compiler
installs protoc
, while libprotobuf-dev
contains the Protobuf development headers and static libraries.
sudo apt-get update
sudo apt-get install protobuf-compiler libprotobuf-dev
Alpine Linux: If you do Docker multi-stage builds, protobuf-dev
is needed at compile time. The final stage meant for runtime needs only protobuf
.
In build stage (compile time):
apk --update add protobuf protobuf-dev
In final stage (for runtime):
apk --update add protobuf
RHEL: Install from source.
curl -s -o protobuf-all-3.10.0.tar.gz \
https://github.com/protocolbuffers/protobuf/releases/download/v3.10.0/protobuf-all-3.10.0.tar.gz
tar -xzf protobuf-all-3.10.0.tar.gz && rm -rf protobuf-all-3.10.0.tar.gz
cd protobuf-all-3.10.0
./configure && make && make install
Mac OS X: brew install protobuf
will handle installing both protoc
and placing the header files where they need to be (typically at /usr/local/Cellar/protobuf/x.y.z/include/
).
brew update && brew install protobuf
Above are some quick install commands, but please consult the official protobuf repository for information on installing Protobuf.
Okay, now you're ready for the easy part; install via Pip:
python -m pip install pycld3
Usage
cld3
exports two module-level functions, get_language()
and get_frequent_languages()
:
>>> import cld3
>>> cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")
LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)
>>> cld3.get_language("This is a test")
LanguagePrediction(language='en', probability=0.9999980926513672, is_reliable=True, proportion=1.0)
>>> for lang in cld3.get_frequent_languages(
... "This piece of text is in English. Този текст е на Български.",
... num_langs=3
... ):
... print(lang)
...
LanguagePrediction(language='bg', probability=0.9173890948295593, is_reliable=True, proportion=0.5853658318519592)
LanguagePrediction(language='en', probability=0.9999790191650391, is_reliable=True, proportion=0.4146341383457184)
FAQ
cld3
incorrectly detects my input. How can I fix this?
A first resort is to preprocess (clean) your input text based on conditions specific to your program.
A salient example is to remove URLs and email addresses from the input. CLD3 (unlike CLD2) does almost none of this cleaning for you, in the spirit of not penalizing other users with overhead that they may not need.
Here's such an example using a simplified URL regex from Regular Expressions Cookbook, 2nd ed.:
>>> import re
>>> import cld3
# cld3 does not ignore the URL components by default
>>> s = "Je veux que: https://site.english.com/this/is/a/url/path/component#fragment"
>>> cld3.get_language(s)
LanguagePrediction(language='en', probability=0.5319557189941406, is_reliable=False, proportion=1.0)
>>> url_re = r"\b(?:https?://|www\.)[a-z0-9-]+(\.[a-z0-9-]+)+(?:[/?].*)?"
>>> new_s = re.sub(url_re, "", s)
>>> new_s
'Je veux que: '
>>> cld3.get_language(new_s)
LanguagePrediction(language='fr', probability=0.9799421429634094, is_reliable=True, proportion=1.0)
Note: This URL regex aims for simplicity. It requires a domain name, and doesn't allow a username or password; it allows the scheme (http or https) to be omitted if it can be inferred from the subdomain (www). Source: Regular Expressions Cookbook, 2nd ed. - Goyvaerts & Levithan.
In some other cases, you cannot fix the incorrect detection.
Language detection algorithms in general may perform poorly with very short inputs.
Rarely should you trust the output of something like detect("hi")
. Keep this limitation in mind regardless
of what library you are using.
Authors
This repository contains a fork of google/cld3
at commit 06f695f. The license for google/cld3
can be found at
LICENSES/CLD3_LICENSE.
This repository is a combination of changes introduced by various forks of google/cld3
by the following people:
- Johannes Baiter (@jbaiter)
- Elizabeth Myers (@Elizafox)
- Witold Bołt (@houp)
- Alfredo Luque (@iamthebot)
- WISESIGHT (@ThothMedia)
- RNogales (@RNogales94)
- Brad Solomon (@bsolomon1124)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pycld3-0.16.tar.gz
.
File metadata
- Download URL: pycld3-0.16.tar.gz
- Upload date:
- Size: 721.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a19b0632b2fbce3532e24a8872e6d470282b55b4929a9801a9e4cbc3f879662 |
|
MD5 | df0e4576539c715844d307a18b985b23 |
|
BLAKE2b-256 | dff98a3c0b13e94102aec5fdb7a0b9873dd093987739f83081248ccea02b4038 |
File details
Details for the file pycld3-0.16-cp37-cp37m-macosx_10_15_x86_64.whl
.
File metadata
- Download URL: pycld3-0.16-cp37-cp37m-macosx_10_15_x86_64.whl
- Upload date:
- Size: 516.8 kB
- Tags: CPython 3.7m, macOS 10.15+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 330e1e828e8315b37bde755f8bcfa64cb85e96028bb7c54c5728d1020ea82b90 |
|
MD5 | 658b202e3b09bf683f8c909f8e1397cf |
|
BLAKE2b-256 | e00d00936a0b72d10da52d10aeceb83539fb453d1743c17c52fffd2980a314a7 |