Written in C drop-in replacement of the language identification tool langid.py
Project description
langid.pyc
The modified version of langid.c with Python bindings -- a straightforward replacement for langid.py, offering the same features, but 200 times as faster.
Installation
pip install langid-pyc
Usage
Basic
from langid_pyc import (
classify,
rank,
)
classify("This is English text")
# ('en', 0.9999999239251556)
rank("This is English text")
# [('en', 0.9999999239251556),
# ('la', 5.0319768731501096e-08),
# ('br', 1.2684715402216825e-08),
# ...]
Language set constraint
from langid_pyc import (
classify,
nb_classes,
set_languages,
)
nb_classes()
# ['af',
# 'am',
# 'an',
# ...]
len(nb_classes())
# 97
set_languages(["en", "ru"])
nb_classes()
# ['en', 'ru']
classify("This is English text")
# ('en', 1.0)
classify("А это текст на русском")
# ('ru', 1.0)
set_languages() # reset languages
len(nb_classes())
# 97
LanguageIdentifier class
from langid_pyc import LanguageIdentifier
identifier = LanguageIdentifier.from_modelpath("ldpy3.pmodel") # default model
len(identifier.nb_classes)
# 97
identifier.classify("This is English text")
# ('en', 0.9999999239251556)
# identifier.rank(...)
# identifier.set_languages(...)
How to build?
Install relevant protobuf packages
apt install protobuf-c-compiler libprotobuf-c-dev
Install dev python requirements
pip install -r requirements.txt
Run build
make build
See Makefile for more details.
How to add a new model?
Train a new model using langid.py package. You will get the model file as described here:
# output the model
output_path = os.path.join(model_dir, 'your_new_model.model')
model = nb_ptc, nb_pc, nb_classes,tk_nextmove, tk_output
string = base64.b64encode(bz2.compress(cPickle.dumps(model)))
with open(output_path, 'w') as f:
f.write(string)
print "wrote model to %s (%d bytes)" % (output_path, len(string))
Move your_new_model.model to models dir and run
make your_new_model.model
Now you have your_new_model.pmodel file in the root which can be feed to LanguageIdentifer.from_modelpath
from langid_pyc import LanguageIdentifier
your_new_identifier = LanguageIdentifier.from_modelpath("your_new_model.pmodel")
Benchmark
Benchmark was calculated on Mac M2 Max, 32Gb RAM with python 3.8.18 and can be found here.
TL;DR langid.pyc is ~200x faster than langid.py and ~1-1.5x faster than pycld2, especially on long texts.
Original README
================
langid.c readme
Introduction
langid.c is an experimental implementation of the language identifier
described by [1] in pure C. It is largely based on the design of
langid.py[2], and uses langid.py to train models.
Planned features
See TODO
Speed
Initial comparisons against Google's cld2[3] suggest that langid.c is about
twice as fast.
(langid.c) @mlui langid.c git:[master] wc -l wikifiles
28600 wikifiles
(langid.c) @mlui langid.c git:[master] time cat wikifiles | ./compact_lang_det_batch > xxx
cat wikifiles 0.00s user 0.00s system 0% cpu 7.989 total
./compact_lang_det_batch > xxx 7.77s user 0.60s system 98% cpu 8.479 total
(langid.c) @mlui langid.c git:[master] time cat wikifiles | ./langidOs -b > xxx
cat wikifiles 0.00s user 0.00s system 0% cpu 3.577 total
./langidOs -b > xxx 3.44s user 0.24s system 97% cpu 3.759 total
(langid.c) @mlui langid.c git:[master] wc -l rcv2files
20000 rcv2files
(langid.c) @mlui langid.c git:[master] time cat rcv2files | ./langidO2 -b > xxx
cat rcv2files 0.00s user 0.00s system 0% cpu 31.702 total
./langidO2 -b > xxx 8.23s user 0.54s system 22% cpu 38.644 total
(langid.c) @mlui langid.c git:[master] time cat rcv2files | ./compact_lang_det_batch > xxx
cat rcv2files 0.00s user 0.00s system 0% cpu 18.343 total
./compact_lang_det_batch > xxx 18.14s user 0.53s system 97% cpu 19.155 total
Model Training
Google's protocol buffers [4] are used to transfer models between languages. The
Python program ldpy2ldc.py can convert a model produced by langid.py [2] into
the protocol-buffer format, and also the C source format used to compile an
in-built model directly into executable.
Dependencies
Protocol buffers [4] protobuf-c [5]
Contact
Marco Lui saffsd@gmail.com
References
[1] http://aclweb.org/anthology-new/I/I11/I11-1062.pdf [2] https://github.com/saffsd/langid.py [3] https://code.google.com/p/cld2/ [4] https://github.com/google/protobuf/ [5] https://github.com/protobuf-c/protobuf-c
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langid-pyc-0.1.0.tar.gz.
File metadata
- Download URL: langid-pyc-0.1.0.tar.gz
- Upload date:
- Size: 4.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5581908cb83cdcc1c7e0eece336d9da7f563bc0a7a4b52d870609e312b45a745
|
|
| MD5 |
f49f1119dfb6d1b843f68f7423dd6f2b
|
|
| BLAKE2b-256 |
b7e3c5f36fc463e2eec50c054eb452f943d2e45e068420dc240a359c18c30168
|
File details
Details for the file langid_pyc-0.1.0-cp38-cp38-macosx_14_0_arm64.whl.
File metadata
- Download URL: langid_pyc-0.1.0-cp38-cp38-macosx_14_0_arm64.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.8, macOS 14.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c94aef19dd11b93d0b8140828b351ef8902e977dc8a9c68da8c417a99d2eed6e
|
|
| MD5 |
2cb6058d0bba37f8bfe448f04c6202a4
|
|
| BLAKE2b-256 |
a5fcfc2e885e35478281cd8a35fa4c0979907ca6305eaa2e4141c7f03bb1ec0f
|