Skip to main content

Fast and accurate language identifier

Project description

heliport

License PyPi-version

A language identification tool which aims for both speed and accuracy. Mostly an efficient HeLI-OTS port to Rust, achieving 25x speedups while maintaining same accuracy levels.

Installation

From PyPi

Install it in your environment

pip install heliport

then download the binarized model

heliport download

From source

Install the requirements:

Clone the repo, build the package and binarize the model

git clone https://github.com/ZJaume/heliport
cd heliport
pip install .
heliport binarize

Usage

CLI

Just run the heliport identify command that reads lines from stdin

cat sentences.txt | heliport identify
eng_latn
cat_latn
rus_cyrl
...
Identify languages of input text

Usage: heliport identify [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]

Arguments:
  [INPUT_FILE]   Input file, default: stdin
  [OUTPUT_FILE]  Output file, default: stdout

Options:
  -j, --threads <THREADS>                Number of parallel threads to use.
                                         0 means no multi-threading
                                         1 means running the identification in a separated thread
                                         >1 run multithreading [default: 0]
  -b, --batch-size <BATCH_SIZE>          Number of text segments to pre-load for parallel processing [default: 100000]
  -c, --ignore-confidence                Ignore confidence thresholds. Predictions under the thresholds will not be labeled as 'und'
  -s, --print-scores                     Print confidence score (higher is better) or raw score (higher is better) in case '-c' is provided
  -m, --model-dir <MODEL_DIR>            Model directory containing binarized model or plain text model. Default is Python module path or './LanguageModels' if relevant languages are requested
  -l, --relevant-langs <RELEVANT_LANGS>  Load only relevant languages. Specify a comma-separated list of language codes. Needs plain text model directory
  -h, --help                             Print help

Python package

>>> from heliport import Identifier
>>> i = Identifier()
>>> i.identify("L'aigua clara")
'cat_latn'

Remember to download or binarize the model first!

Rust crate

use std::path::PathBuf;
use heliport::identifier::Identifier;
use heliport::lang::Lang;

let identifier = Identifier::load(
    PathBuf::from("/path/to/model_dir",
    None,
    );
let lang, score = identifier.identify("L'aigua clara");
assert_eq!(lang, Lang::cat);

Differences with HeLI-OTS

Although heliport currently uses the same models as HeLI-OTS 2.0 and the identification algorithm is almost the same, there are a few differences (mainly during pre-processing) that may cause different results. However, in most case, these should not deacrease accuracy and should not happen frequently.

Note: Both tools have a pre-processing step for each identified text to remove all non-alphabetic characters.

The implementation differences that can change results are:

  • HeLI during preprocessing removes urls and words beginning with @, while heliport does not.
  • Since 1.5, during preprocessing, HeLI repeats every word that does not start with capital letter, This is probably to penalize proper nouns. However, in our tests, we have not find a significant improvement with this. Therefore,to avoid multiplying the cost of prediction by almost x2, this has not been implemented. In the future it might end up being implemented if there is need for it and can be implemented efficiently.
  • Rust and Java sometimes have small differences on the smallest decimals in a float, so the stored n-gram probabilities are not exactly the same. But this is very unlikely to affect predicted labels.

Benchmarks

Speed benchmarks with 100k random sentences from OpenLID, all the tools running single-threaded:

tool time (s)
CLD2 1.12
HeLI-OTS 60.37
lingua all high preloaded 56.29
lingua all low preloaded 23.34
fasttext openlid193 8.44
heliport 2.33

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

heliport-0.7.0.tar.gz (53.3 MB view details)

Uploaded Source

Built Distributions

heliport-0.7.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.0 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

heliport-0.7.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.0 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

heliport-0.7.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.0 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

heliport-0.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.0 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

heliport-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.0 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

heliport-0.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

heliport-0.7.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.0 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

File details

Details for the file heliport-0.7.0.tar.gz.

File metadata

  • Download URL: heliport-0.7.0.tar.gz
  • Upload date:
  • Size: 53.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.7.4

File hashes

Hashes for heliport-0.7.0.tar.gz
Algorithm Hash digest
SHA256 20e6b4d7c18af0e43b8eb79a53bf36ef17044100cf5a75c3e28d7b250c6aed4f
MD5 44e939e4e3af13934cca840c5d05d9ee
BLAKE2b-256 4159c7cf23fb8c644b2d97688b01c1e32c16ccbbb95ea6fe1678504aa3b95859

See more details on using hashes here.

File details

Details for the file heliport-0.7.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heliport-0.7.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 633d68f11464bd337271b412ccffdcb8cecd4f42d9a2ab67abd153bcf9e90ea5
MD5 c86c753a6f1cbccc42a5f91a9d0ceaac
BLAKE2b-256 58d801b3f88eaecf80e263425796104e6f8c02578324db9dfae574435900c0a5

See more details on using hashes here.

File details

Details for the file heliport-0.7.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heliport-0.7.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fa98caf2021851cd8a4bd82f386b2ab0b755e10fd3e85678297c38fc00fc05bc
MD5 d628a0288c83f5674d1b48e5d424c6ca
BLAKE2b-256 fd51ce2f98f6faac3839200d211fa7d495f11de10c7d950243bde5cdfa1546fd

See more details on using hashes here.

File details

Details for the file heliport-0.7.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heliport-0.7.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1d5a027bb79e69e40115bc18bf560a657dc291ed7293574affd222483afb6d29
MD5 4ab3195168a113d51ec4b063337ba475
BLAKE2b-256 5938c635fafb7ffa9352c51b2deaad29c6500336233f25eccedd20631f1e8580

See more details on using hashes here.

File details

Details for the file heliport-0.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heliport-0.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2cff8af282e14f9d4455780ac1be377189599f6a0920a2577641afdbaa94f5bf
MD5 f1089de0a8c851d485f7bfb464ac9e21
BLAKE2b-256 2dd1f0eaf9edeeaaa1afb5e3c712607707012fb9be37bbb1dad5ff4f02f1539d

See more details on using hashes here.

File details

Details for the file heliport-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heliport-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b334fd44319c5cbb27d9e931262d03e86d5d2e45769ae390daefbab123c6b7a8
MD5 645e6ead0a2923940731bb55f05f0544
BLAKE2b-256 783e26a71824644ab1188179e7c90398c869f04947643a2cf21d27ddb8ac8827

See more details on using hashes here.

File details

Details for the file heliport-0.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heliport-0.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 453642de3acb27e2eb71c0e1a473613846ff60d623459279789e919ccb10ecdb
MD5 c1ed8e91418c23042e2798d43c74e4ed
BLAKE2b-256 eda50888ae544f392805670c00a00db1ca5848728cda754535df4bc267a1c5e0

See more details on using hashes here.

File details

Details for the file heliport-0.7.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heliport-0.7.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1be0cb9e24b9e3d9c2e578ca8ee963a549cfdeae9297d38ef0fc24477bde999b
MD5 ccaf71d220d36c9bab3afc68513b887b
BLAKE2b-256 7ef5da789ca22f9ab4314eb89e6dfe76e8c983ce019b5db39bd6b3d115ef410c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page