Skip to main content

Written in C drop-in replacement of the language identification tool langid.py

Project description

langid.pyc

The modified version of langid.c with Python bindings -- a straightforward replacement for langid.py, offering the same features, but 200 times as faster.

Installation

pip install langid-pyc

Usage

Basic

from langid_pyc import (
    classify,
    rank,
)

classify("This is English text")
# ('en', 0.9999999239251556)

rank("This is English text")
# [('en', 0.9999999239251556),
#  ('la', 5.0319768731501096e-08),
#  ('br', 1.2684715402216825e-08),
#  ...]

Language set constraint

from langid_pyc import (
    classify,
    nb_classes,
    set_languages,
)

nb_classes()
# ['af',
#  'am',
#  'an',
#  ...]

len(nb_classes())
# 97

set_languages(["en", "ru"])
nb_classes()
# ['en', 'ru']

classify("This is English text")
# ('en', 1.0)

classify("А это текст на русском")
# ('ru', 1.0)

set_languages() # reset languages
len(nb_classes())
# 97

LanguageIdentifier class

from langid_pyc import LanguageIdentifier

identifier = LanguageIdentifier.from_modelpath("ldpy3.pmodel")  # default model

len(identifier.nb_classes)
# 97

identifier.classify("This is English text")
# ('en', 0.9999999239251556)

# identifier.rank(...)
# identifier.set_languages(...)

How to build?

Install relevant protobuf packages

apt install protobuf-c-compiler libprotobuf-c-dev

Install dev python requirements

pip install -r requirements.txt

Run build

make build

See Makefile for more details.

How to add a new model?

Train a new model using langid.py package. You will get the model file as described here:

# output the model
output_path = os.path.join(model_dir, 'your_new_model.model')
model = nb_ptc, nb_pc, nb_classes,tk_nextmove, tk_output
string = base64.b64encode(bz2.compress(cPickle.dumps(model)))
with open(output_path, 'w') as f:
f.write(string)
print "wrote model to %s (%d bytes)" % (output_path, len(string))

Move your_new_model.model to models dir and run

make your_new_model.model

Now you have your_new_model.pmodel file in the root which can be feed to LanguageIdentifer.from_modelpath

from langid_pyc import LanguageIdentifier

your_new_identifier = LanguageIdentifier.from_modelpath("your_new_model.pmodel")

Benchmark

Benchmark was calculated on Mac M2 Max, 32Gb RAM with python 3.8.18 and can be found here.

TL;DR langid.pyc is ~200x faster than langid.py and ~1-1.5x faster than pycld2, especially on long texts.

Original README

================ langid.c readme

Introduction

langid.c is an experimental implementation of the language identifier described by [1] in pure C. It is largely based on the design of langid.py[2], and uses langid.py to train models.

Planned features

See TODO

Speed

Initial comparisons against Google's cld2[3] suggest that langid.c is about twice as fast.

(langid.c) @mlui langid.c git:[master] wc -l wikifiles 
28600 wikifiles
(langid.c) @mlui langid.c git:[master] time cat wikifiles | ./compact_lang_det_batch > xxx
cat wikifiles  0.00s user 0.00s system 0% cpu 7.989 total
./compact_lang_det_batch > xxx  7.77s user 0.60s system 98% cpu 8.479 total
(langid.c) @mlui langid.c git:[master] time cat wikifiles | ./langidOs -b > xxx           
cat wikifiles  0.00s user 0.00s system 0% cpu 3.577 total
./langidOs -b > xxx  3.44s user 0.24s system 97% cpu 3.759 total

(langid.c) @mlui langid.c git:[master] wc -l rcv2files 
20000 rcv2files
(langid.c) @mlui langid.c git:[master] time cat rcv2files | ./langidO2 -b > xxx     
cat rcv2files  0.00s user 0.00s system 0% cpu 31.702 total
./langidO2 -b > xxx  8.23s user 0.54s system 22% cpu 38.644 total
(langid.c) @mlui langid.c git:[master] time cat rcv2files | ./compact_lang_det_batch > xxx 
cat rcv2files  0.00s user 0.00s system 0% cpu 18.343 total
./compact_lang_det_batch > xxx  18.14s user 0.53s system 97% cpu 19.155 total

Model Training

Google's protocol buffers [4] are used to transfer models between languages. The Python program ldpy2ldc.py can convert a model produced by langid.py [2] into the protocol-buffer format, and also the C source format used to compile an in-built model directly into executable.

Dependencies

Protocol buffers [4] protobuf-c [5]

Contact

Marco Lui saffsd@gmail.com

References

[1] http://aclweb.org/anthology-new/I/I11/I11-1062.pdf [2] https://github.com/saffsd/langid.py [3] https://code.google.com/p/cld2/ [4] https://github.com/google/protobuf/ [5] https://github.com/protobuf-c/protobuf-c

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langid-pyc-0.1.0.tar.gz (4.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langid_pyc-0.1.0-cp38-cp38-macosx_14_0_arm64.whl (1.7 MB view details)

Uploaded CPython 3.8macOS 14.0+ ARM64

File details

Details for the file langid-pyc-0.1.0.tar.gz.

File metadata

  • Download URL: langid-pyc-0.1.0.tar.gz
  • Upload date:
  • Size: 4.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for langid-pyc-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5581908cb83cdcc1c7e0eece336d9da7f563bc0a7a4b52d870609e312b45a745
MD5 f49f1119dfb6d1b843f68f7423dd6f2b
BLAKE2b-256 b7e3c5f36fc463e2eec50c054eb452f943d2e45e068420dc240a359c18c30168

See more details on using hashes here.

File details

Details for the file langid_pyc-0.1.0-cp38-cp38-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for langid_pyc-0.1.0-cp38-cp38-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 c94aef19dd11b93d0b8140828b351ef8902e977dc8a9c68da8c417a99d2eed6e
MD5 2cb6058d0bba37f8bfe448f04c6202a4
BLAKE2b-256 a5fcfc2e885e35478281cd8a35fa4c0979907ca6305eaa2e4141c7f03bb1ec0f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page