Written in C drop-in replacement of the language identification tool langid.py

These details have not been verified by PyPI

Project links

Homepage

Project description

langid.pyc

The modified version of langid.c with Python bindings -- a straightforward replacement for langid.py, offering the same features, but 200 times as faster.

Installation

pip install langid-pyc

Usage

Basic

from langid_pyc import (
    classify,
    rank,
)

classify("This is English text")
# ('en', 0.9999999239251556)

rank("This is English text")
# [('en', 0.9999999239251556),
#  ('la', 5.0319768731501096e-08),
#  ('br', 1.2684715402216825e-08),
#  ...]

Language set constraint

from langid_pyc import (
    classify,
    nb_classes,
    set_languages,
)

nb_classes()
# ['af',
#  'am',
#  'an',
#  ...]

len(nb_classes())
# 97

set_languages(["en", "ru"])
nb_classes()
# ['en', 'ru']

classify("This is English text")
# ('en', 1.0)

classify("А это текст на русском")
# ('ru', 1.0)

set_languages() # reset languages
len(nb_classes())
# 97

`LanguageIdentifier` class

from langid_pyc import LanguageIdentifier

identifier = LanguageIdentifier.from_modelpath("ldpy3.pmodel")  # default model

len(identifier.nb_classes)
# 97

identifier.classify("This is English text")
# ('en', 0.9999999239251556)

# identifier.rank(...)
# identifier.set_languages(...)

How to build?

Install relevant protobuf packages

apt install protobuf-c-compiler libprotobuf-c-dev

Install dev python requirements

pip install -r requirements.txt

Run build

make build

See Makefile for more details.

How to add a new model?

Train a new model using langid.py package. You will get the model file as described here:

# output the model
output_path = os.path.join(model_dir, 'your_new_model.model')
model = nb_ptc, nb_pc, nb_classes,tk_nextmove, tk_output
string = base64.b64encode(bz2.compress(cPickle.dumps(model)))
with open(output_path, 'w') as f:
f.write(string)
print "wrote model to %s (%d bytes)" % (output_path, len(string))

Move your_new_model.model to models dir and run

make your_new_model.model

Now you have your_new_model.pmodel file in the root which can be feed to LanguageIdentifer.from_modelpath

from langid_pyc import LanguageIdentifier

your_new_identifier = LanguageIdentifier.from_modelpath("your_new_model.pmodel")

Benchmark

Benchmark was calculated on Mac M2 Max, 32Gb RAM with python 3.8.18 and can be found here.

TL;DR langid.pyc is ~200x faster than langid.py and ~1-1.5x faster than pycld2, especially on long texts.

Original README

================ `langid.c` readme

Introduction

langid.c is an experimental implementation of the language identifier described by [1] in pure C. It is largely based on the design of langid.py[2], and uses langid.py to train models.

Planned features

See TODO

Speed

Initial comparisons against Google's cld2[3] suggest that langid.c is about twice as fast.

(langid.c) @mlui langid.c git:[master] wc -l wikifiles 
28600 wikifiles
(langid.c) @mlui langid.c git:[master] time cat wikifiles | ./compact_lang_det_batch > xxx
cat wikifiles  0.00s user 0.00s system 0% cpu 7.989 total
./compact_lang_det_batch > xxx  7.77s user 0.60s system 98% cpu 8.479 total
(langid.c) @mlui langid.c git:[master] time cat wikifiles | ./langidOs -b > xxx           
cat wikifiles  0.00s user 0.00s system 0% cpu 3.577 total
./langidOs -b > xxx  3.44s user 0.24s system 97% cpu 3.759 total

(langid.c) @mlui langid.c git:[master] wc -l rcv2files 
20000 rcv2files
(langid.c) @mlui langid.c git:[master] time cat rcv2files | ./langidO2 -b > xxx     
cat rcv2files  0.00s user 0.00s system 0% cpu 31.702 total
./langidO2 -b > xxx  8.23s user 0.54s system 22% cpu 38.644 total
(langid.c) @mlui langid.c git:[master] time cat rcv2files | ./compact_lang_det_batch > xxx 
cat rcv2files  0.00s user 0.00s system 0% cpu 18.343 total
./compact_lang_det_batch > xxx  18.14s user 0.53s system 97% cpu 19.155 total

Model Training

Google's protocol buffers [4] are used to transfer models between languages. The Python program ldpy2ldc.py can convert a model produced by langid.py [2] into the protocol-buffer format, and also the C source format used to compile an in-built model directly into executable.

Dependencies

Protocol buffers [4] protobuf-c [5]

Contact

Marco Lui saffsd@gmail.com

References

[1] http://aclweb.org/anthology-new/I/I11/I11-1062.pdf [2] https://github.com/saffsd/langid.py [3] https://code.google.com/p/cld2/ [4] https://github.com/google/protobuf/ [5] https://github.com/protobuf-c/protobuf-c

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Apr 11, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langid-pyc-0.1.0.tar.gz (4.5 MB view details)

Uploaded Apr 11, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

langid_pyc-0.1.0-cp38-cp38-macosx_14_0_arm64.whl (1.7 MB view details)

Uploaded Apr 11, 2024 CPython 3.8macOS 14.0+ ARM64

File details

Details for the file langid-pyc-0.1.0.tar.gz.

File metadata

Download URL: langid-pyc-0.1.0.tar.gz
Upload date: Apr 11, 2024
Size: 4.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for langid-pyc-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5581908cb83cdcc1c7e0eece336d9da7f563bc0a7a4b52d870609e312b45a745`
MD5	`f49f1119dfb6d1b843f68f7423dd6f2b`
BLAKE2b-256	`b7e3c5f36fc463e2eec50c054eb452f943d2e45e068420dc240a359c18c30168`

See more details on using hashes here.

File details

Details for the file langid_pyc-0.1.0-cp38-cp38-macosx_14_0_arm64.whl.

File metadata

Download URL: langid_pyc-0.1.0-cp38-cp38-macosx_14_0_arm64.whl
Upload date: Apr 11, 2024
Size: 1.7 MB
Tags: CPython 3.8, macOS 14.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for langid_pyc-0.1.0-cp38-cp38-macosx_14_0_arm64.whl
Algorithm	Hash digest
SHA256	`c94aef19dd11b93d0b8140828b351ef8902e977dc8a9c68da8c417a99d2eed6e`
MD5	`2cb6058d0bba37f8bfe448f04c6202a4`
BLAKE2b-256	`a5fcfc2e885e35478281cd8a35fa4c0979907ca6305eaa2e4141c7f03bb1ec0f`

See more details on using hashes here.

langid-pyc 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

langid.pyc

Installation

Usage

Basic

Language set constraint

LanguageIdentifier class

How to build?

How to add a new model?

Benchmark

Original README

================ langid.c readme

Introduction

Planned features

Speed

Model Training

Dependencies

Contact

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`LanguageIdentifier` class

================ `langid.c` readme