py3langid

Fork of the language identification tool langid.py, featuring a modernized codebase and faster execution times.

These details have not been verified by PyPI

Project links

Project description

py3langid is a fork of the standalone language identification tool langid.py by Marco Lui.

Original license: BSD-2-Clause. Fork license: BSD-3-Clause.

Changes in this fork

Execution speed has been improved and the code base has been optimized for Python 3.6+:

Import: Loading the package (import py3langid) is about 30% faster
Startup: Loading the default classification model is 25-30x faster
Execution: Language detection with langid.classify is 5-6x faster on paragraphs (less on longer texts)

For implementation details see this blog post: How to make language detection with langid.py faster.

For more information and older Python versions see changelog.

Usage

Drop-in replacement

Install the package:
- pip3 install py3langid (or pip where applicable)
Use it:
- with Python: import py3langid as langid
- on the command-line: langid

With Python

Basics:

>>> import py3langid as langid

>>> text = 'This text is in English.'
# identified language and probability
>>> langid.classify(text)
('en', -56.77429)
# unpack the result tuple in variables
>>> lang, prob = langid.classify(text)
# all potential languages
>>> langid.rank(text)

More options:

>>> from py3langid.langid import LanguageIdentifier, MODEL_FILE

# subset of target languages
>>> identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE)
>>> identifier.set_languages(['de', 'en', 'fr'])
# this won't work well...
>>> identifier.classify('这样不好')
('en', -81.831665)

# normalization of probabilities to an interval between 0 and 1
>>> identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
>>> identifier.classify('This should be enough text.')
('en', 1.0)

Note: the Numpy data type for the feature vector has been changed to optimize for speed. If results are inconsistent, try restoring the original setting:

>>> langid.classify(text, datatype='uint32')

On the command-line

# basic usage with probability normalization
$ echo "This should be enough text." | langid -n
('en', 1.0)

# define a subset of target languages
$ echo "This won't be recognized properly." | langid -n -l fr,it,tr
('it', 0.97038305)

Legacy documentation

The docs below are provided for reference, only part of the functions are currently tested and maintained.

Introduction

langid.py is a standalone Language Identification (LangID) tool.

The design principles are as follows:

Fast
Pre-trained over a large number of languages (currently 97)
Not sensitive to domain-specific features (e.g. HTML/XML markup)
Single .py file with minimal dependencies
Deployable as a web service

All that is required to run langid.py is Python >= 3.6 and numpy.

The accompanying training tools are still Python2-only.

langid.py is WSGI-compliant. langid.py will use fapws3 as a web server if available, and default to wsgiref.simple_server otherwise.

langid.py comes pre-trained on 97 languages (ISO 639-1 codes given):

af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu

The training data was drawn from 5 different sources:

JRC-Acquis
ClueWeb 09
Wikipedia
Reuters RCV2
Debian i18n

Usage

langid [options]

optional arguments:

-h, --help: show this help message and exit
-s, --serve: launch web service
--host=HOST: host/ip to bind to
--port=PORT: port to listen on
-v: increase verbosity (repeat for greater effect)
-m MODEL: load model from file
-l LANGS, --langs=LANGS: comma-separated set of target ISO639 language codes (e.g en,de)
-r, --remote: auto-detect IP address for remote access
-b, --batch: specify a list of files on the command line
-d, --dist: show full distribution over languages
-u URL, --url=URL: langid of URL
--line: process pipes line-by-line rather than as a document
-n, --normalize: normalize confidence scores to probability values

The simplest way to use langid.py is as a command-line tool, and you can invoke using python langid.py. If you installed langid.py as a Python module (e.g. via pip install langid), you can invoke langid instead of python langid.py -n (the two are equivalent). This will cause a prompt to display. Enter text to identify, and hit enter:

>>> This is a test
('en', -54.41310358047485)
>>> Questa e una prova
('it', -35.41771221160889)

langid.py can also detect when the input is redirected (only tested under Linux), and in this case will process until EOF rather than until newline like in interactive mode:

python langid.py < README.rst
('en', -22552.496054649353)

The value returned is the unnormalized probability estimate for the language. Calculating the exact probability estimate is disabled by default, but can be enabled through a flag:

python langid.py -n < README.rst
('en', 1.0)

More details are provided in this README in the section on Probability Normalization.

You can also use langid.py as a Python library:

# python
Python 2.7.2+ (default, Oct  4 2011, 20:06:09)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import langid
>>> langid.classify("This is a test")
('en', -54.41310358047485)

Finally, langid.py can use Python’s built-in wsgiref.simple_server (or fapws3 if available) to provide language identification as a web service. To do this, launch python langid.py -s, and access http://localhost:9008/detect . The web service supports GET, POST and PUT. If GET is performed with no data, a simple HTML forms interface is displayed.

The response is generated in JSON, here is an example:

{"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

A utility such as curl can be used to access the web service:

# curl -d "q=This is a test" localhost:9008/detect
{"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

You can also use HTTP PUT:

# curl -T readme.rst localhost:9008/detect
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                               Dload  Upload   Total   Spent    Left  Speed
100  2871  100   119  100  2752    117   2723  0:00:01  0:00:01 --:--:--  2727
{"responseData": {"confidence": -22552.496054649353, "language": "en"}, "responseDetails": null, "responseStatus": 200}

If no “q=XXX” key-value pair is present in the HTTP POST payload, langid.py will interpret the entire file as a single query. This allows for redirection via curl:

# echo "This is a test" | curl -d @- localhost:9008/detect
{"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

langid.py will attempt to discover the host IP address automatically. Often, this is set to localhost(127.0.1.1), even though the machine has a different external IP address. langid.py can attempt to automatically discover the external IP address. To enable this functionality, start langid.py with the -r flag.

langid.py supports constraining of the output language set using the -l flag and a comma-separated list of ISO639-1 language codes (the -n flag enables probability normalization):

# python langid.py -n -l it,fr
>>> Io non parlo italiano
('it', 0.99999999988965627)
>>> Je ne parle pas français
('fr', 1.0)
>>> I don't speak english
('it', 0.92210605672341062)

When using langid.py as a library, the set_languages method can be used to constrain the language set:

python
Python 2.7.2+ (default, Oct  4 2011, 20:06:09)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import langid
>>> langid.classify("I do not speak english")
('en', 0.57133487679900674)
>>> langid.set_languages(['de','fr','it'])
>>> langid.classify("I do not speak english")
('it', 0.99999835791478453)
>>> langid.set_languages(['en','it'])
>>> langid.classify("I do not speak english")
('en', 0.99176190378750373)

Batch Mode

langid.py supports batch mode processing, which can be invoked with the -b flag. In this mode, langid.py reads a list of paths to files to classify as arguments. If no arguments are supplied, langid.py reads the list of paths from stdin, this is useful for using langid.py with UNIX utilities such as find.

In batch mode, langid.py uses multiprocessing to invoke multiple instances of the classifier, utilizing all available CPUs to classify documents in parallel.

Probability Normalization

The probabilistic model implemented by langid.py involves the multiplication of a large number of probabilities. For computational reasons, the actual calculations are implemented in the log-probability space (a common numerical technique for dealing with vanishingly small probabilities). One side-effect of this is that it is not necessary to compute a full probability in order to determine the most probable language in a set of candidate languages. However, users sometimes find it helpful to have a “confidence” score for the probability prediction. Thus, langid.py implements a re-normalization that produces an output in the 0-1 range.

langid.py disables probability normalization by default. For command-line usages of langid.py, it can be enabled by passing the -n flag. For probability normalization in library use, the user must instantiate their own LanguageIdentifier. An example of such usage is as follows:

>> from py3langid.langid import LanguageIdentifier, MODEL_FILE
>> identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
>> identifier.classify("This is a test")
('en', 0.9999999909903544)

Training a model

So far Python 2.7 only, see the original instructions.

langid.py is based on published research. [1] describes the LD feature selection technique in detail, and [2] provides more detail about the module langid.py itself.

[1] Lui, Marco and Timothy Baldwin (2011) Cross-domain Feature Selection for Language Identification, In Proceedings of the Fifth International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, pp. 553—561. Available from http://www.aclweb.org/anthology/I11-1062

[2] Lui, Marco and Timothy Baldwin (2012) langid.py: An Off-the-shelf Language Identification Tool, In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Demo Session, Jeju, Republic of Korea. Available from www.aclweb.org/anthology/P12-3005

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Jun 18, 2024

0.2.2

Jun 14, 2022

0.2.1

Mar 29, 2022

0.2.0

Nov 29, 2021

0.1.2

Nov 24, 2021

0.1.1

Nov 24, 2021

0.1.0

Nov 23, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py3langid-0.3.0.tar.gz (752.9 kB view details)

Uploaded Jun 18, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

py3langid-0.3.0-py3-none-any.whl (746.1 kB view details)

Uploaded Jun 18, 2024 Python 3

File details

Details for the file py3langid-0.3.0.tar.gz.

File metadata

Download URL: py3langid-0.3.0.tar.gz
Upload date: Jun 18, 2024
Size: 752.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.10.12

File hashes

Hashes for py3langid-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`0a875a031a58aaf9dbda7bb8285fd75e801a7bd276216ffabe037901d4b449ec`
MD5	`7e45e4e22f94a8308a115ffb58859750`
BLAKE2b-256	`9943c3f7a3c5150c56a0ca70c3039e53cc58046698b7ce0913bb8fa86d71abcb`

See more details on using hashes here.

File details

Details for the file py3langid-0.3.0-py3-none-any.whl.

File metadata

Download URL: py3langid-0.3.0-py3-none-any.whl
Upload date: Jun 18, 2024
Size: 746.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.10.12

File hashes

Hashes for py3langid-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`38f022eec31cf9a2bf6f142acb2a9b350fd7d0d5ae7762b1392c6d3567401fd3`
MD5	`3af78872b7419e22d74a93f799a1eb84`
BLAKE2b-256	`9d1c8212ea872d236af0aea37043fb6feeaa9a43449183782b19d342f8ddd343`

See more details on using hashes here.

py3langid 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Changes in this fork

Usage

Drop-in replacement

With Python

On the command-line

Legacy documentation

Introduction

Usage

Batch Mode

Probability Normalization

Training a model

Read more

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes