Skip to main content

Fast and accurate language identifier

Project description

heliport

License PyPi-version Python-version Supported-languages

A language identification tool which aims for both speed and accuracy, with support for 220 languages(or add your own languages!).

This tool is an efficient HeLI-OTS port to Rust, achieving 25x speedups while having almost identical output.

Installation

From PyPi

Install it in your environment

pip install heliport

NOTE: Since version 0.8 models do not need to be downloaded anymore.

From source

Install the requirements:

Clone the repo, build the package and binarize the model

git clone https://github.com/ZJaume/heliport
cd heliport
pip install .

Usage

CLI

Just run the heliport identify command that reads lines from stdin

cat sentences.txt | heliport identify
eng
cat
rus
...
Identify languages of input text

Usage: heliport identify [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]

Arguments:
  [INPUT_FILE]   Input file, default: stdin
  [OUTPUT_FILE]  Output file, default: stdout

Options:
  -j, --threads <THREADS>                Number of parallel threads to use.
                                         0 means no multi-threading
                                         1 means running the identification in a separated thread
                                         >1 run multithreading [default: 0]
  -b, --batch-size <BATCH_SIZE>          Number of text segments to pre-load for parallel processing [default:
                                         100000]
  -c, --ignore-confidence                Ignore confidence thresholds. Predictions under the thresholds will not be
                                         labeled as 'und'
  -s, --print-scores                     Print confidence score (higher is better) or raw score (lower is better) in
                                         case '-c' is provided
  -n, --not-strict                       Do not be strict when loading confidence thresholds (do not fail if one
                                         language is missing)
  -p, --precision <PRECISION>            Number of decimals precision when printing scores [default: 4]
  -m, --model-dir <MODEL_DIR>            Model directory containing binarized model or plain text model. Default is
                                         Python module path or './LanguageModels' if relevant languages are requested
  -l, --relevant-langs <RELEVANT_LANGS>  Load only relevant languages. Specify a comma-separated list of language
                                         codes. Needs plain text model directory
  -h, --help                             Print help

Python package

>>> from heliport import Identifier
>>> i = Identifier()
>>> i.identify("L'aigua clara")
'cat'

For further information of the avaliable functions and parameters, please take a look at the module docs:

>>> import heliport
>>> help(heliport)

Rust crate

use std::path::PathBuf;
use heliport::identifier::Identifier;
use heliport::lang::Lang;

let identifier = Identifier::load(
    PathBuf::from("/path/to/model_dir",
    None,
    );
let lang, score = identifier.identify("L'aigua clara");
assert_eq!(lang, Lang::cat);

Differences with HeLI-OTS

Although heliport currently uses the same models as HeLI-OTS 2.0 and the identification algorithm is almost the same, there are a few differences (mainly during pre-processing) that may cause different results. However, in most case, these should not deacrease accuracy and should not happen frequently.

Note: Both tools have a pre-processing step for each identified text to remove all non-alphabetic characters.

The implementation differences that can change results are:

  • HeLI during preprocessing removes urls and words beginning with @, while heliport does not.
  • Since 1.5, during preprocessing, HeLI repeats every word that does not start with capital letter, This is probably to penalize proper nouns. However, in our tests, we have not find a significant improvement with this. Therefore,to avoid multiplying the cost of prediction by almost x2, this has not been implemented. In the future it might end up being implemented if there is need for it and can be implemented efficiently.
  • Rust and Java implementations have small precision differences due to Rust accumulating probabilities with double precision floats.

Benchmarks

Speed benchmarks with 100k random sentences from OpenLID, all the tools running single-threaded:

tool time (s)
CLD2 1.12
HeLI-OTS 60.37
lingua all high preloaded 56.29
lingua all low preloaded 23.34
fasttext openlid193 8.44
heliport 2.33

Connecting Europe Facility

All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

heliport-0.10.0.tar.gz (53.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

heliport-0.10.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

heliport-0.10.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

heliport-0.10.0-cp313-cp313-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.13Windows x86-64

heliport-0.10.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

heliport-0.10.0-cp313-cp313-macosx_11_0_arm64.whl (85.3 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

heliport-0.10.0-cp313-cp313-macosx_10_12_x86_64.whl (85.3 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

heliport-0.10.0-cp312-cp312-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.12Windows x86-64

heliport-0.10.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

heliport-0.10.0-cp312-cp312-macosx_11_0_arm64.whl (85.3 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

heliport-0.10.0-cp312-cp312-macosx_10_12_x86_64.whl (85.3 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

heliport-0.10.0-cp311-cp311-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.11Windows x86-64

heliport-0.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

heliport-0.10.0-cp311-cp311-macosx_11_0_arm64.whl (85.3 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

heliport-0.10.0-cp311-cp311-macosx_10_12_x86_64.whl (85.3 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

heliport-0.10.0-cp310-cp310-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.10Windows x86-64

heliport-0.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

heliport-0.10.0-cp39-cp39-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.9Windows x86-64

heliport-0.10.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

File details

Details for the file heliport-0.10.0.tar.gz.

File metadata

  • Download URL: heliport-0.10.0.tar.gz
  • Upload date:
  • Size: 53.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.8.6

File hashes

Hashes for heliport-0.10.0.tar.gz
Algorithm Hash digest
SHA256 1a77def823c66edf522e60c52d7087d9842ea37453f98d419783aec4de618322
MD5 b75fe43a24a9c9300376903cc9564167
BLAKE2b-256 557b34ae918c3136d14881b37969afc67cadcf5cd688f961836f4c22ad2d99f9

See more details on using hashes here.

File details

Details for the file heliport-0.10.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heliport-0.10.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c5b2de5dc327d6a5128124b7c0f99df2fb3a392b947dc8d39c884e09e9e013bd
MD5 8f2f406573da7f30017dce053d5dfd86
BLAKE2b-256 14bdc87fcc4bc9c380494f80a44aede7cade1099acad823b2539a1411c048114

See more details on using hashes here.

File details

Details for the file heliport-0.10.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heliport-0.10.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8fe3aef0f0f8c307f48e8e5217f819b44de864da99cdfab13e15e17901d73c42
MD5 44ef0eda4c81d7ec461f2c78b07f2788
BLAKE2b-256 ea1197e03e3eb1ca591d577614e1f52d18be9dd3ebeb21f8308a0f36347cf252

See more details on using hashes here.

File details

Details for the file heliport-0.10.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for heliport-0.10.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 60733825ef7b3b0ff89c4baa6e627ddf5b07e3cffd96f4e31187244631d810ad
MD5 8d59d1ac72f4f7c002378cfbee3d21c1
BLAKE2b-256 286a182175643fa4f20b49302738392cf9dee3c6fd128198b7f7f26ae0407443

See more details on using hashes here.

File details

Details for the file heliport-0.10.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heliport-0.10.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6d69f3d2e0ad3df1c93a241f21c0177dbc3ef75b2a7bc8ba697e2fdd35372e24
MD5 db11655d4bb171fe9b38e35610e5d8ff
BLAKE2b-256 a784874494d0cde5b4d55682b1b5003f6e91c6f7217851f19f0688c42c91a34a

See more details on using hashes here.

File details

Details for the file heliport-0.10.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for heliport-0.10.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 72b796196af38bbaa7da75c6380fafdad69bea95be1da09ec9a6a1b09c4f24d0
MD5 5b529e289c659f81766fa4094be7aa9b
BLAKE2b-256 e551717a88b460a435bc1aa73a5ace3fbda43d3ff1147b92d112bfc2135f5c62

See more details on using hashes here.

File details

Details for the file heliport-0.10.0-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for heliport-0.10.0-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5fcb129ff373295d382c9566b5d41df9ec0dbc11a9269b18cf6719740f8ea064
MD5 b1431ea57241de4cdcf9436dad56a292
BLAKE2b-256 d7f47ef0070edddd64f533129ae83709ff152d1277ce169e7d7b0c9a966e61a3

See more details on using hashes here.

File details

Details for the file heliport-0.10.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for heliport-0.10.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 46a34bc4a39b247607f759a67c3ffce16d502f07cf13632069ef6d7a809847f5
MD5 1bfed540cacdcb4ee697bd017dfe80aa
BLAKE2b-256 48b7e87cca396734fc03118eead1e1493019c05e9789b79fec97481c6e316ee0

See more details on using hashes here.

File details

Details for the file heliport-0.10.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heliport-0.10.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6d1853204afed2e01d8709bd012163921af4d96446e12d7afc6aabb7bdcbeda1
MD5 7cad7d93380e0df518adb1e78ad6f85a
BLAKE2b-256 10ecfd276b5de8340def6c6097eb37f722344dfea755f6b18970c9c8a2545ff1

See more details on using hashes here.

File details

Details for the file heliport-0.10.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for heliport-0.10.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 dbe2b4e795e93d49b2e0c10526766a8abcd0ae2327551a6aa464b3f7b9b0565e
MD5 8412ea16fb1f7840417c414d78ef6784
BLAKE2b-256 3d2d5254c8f2ed07dfb744d6c424e640061f5744fb294513557227f61bf5cef5

See more details on using hashes here.

File details

Details for the file heliport-0.10.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for heliport-0.10.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a5d45e15db7705d7e32d99a1e6e2b976322a4f30333910abc98f2477ef731a4d
MD5 5f23c72043850a129b156396eb8789f8
BLAKE2b-256 0d0972c0596737efa70a6f93d761a49ad59aec32a4f144a9971b2d1b3271455e

See more details on using hashes here.

File details

Details for the file heliport-0.10.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for heliport-0.10.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 653b1a76fb9963924bdb4a18e783fdf6152beac4c66d881aa62cc29fe9c5b463
MD5 b357f5ccdab5855edb0d09e100d0350c
BLAKE2b-256 f70a1e5fa77ff2ba9ef4ae3afda90a1d86dca0c92e0738669ea2c3569d9069af

See more details on using hashes here.

File details

Details for the file heliport-0.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heliport-0.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5eb34b293a1fb0a32a8a22d27c53586dc078b465aaac818b0939aaa61914db34
MD5 5738752a8f03bf58851b631aa43cbb86
BLAKE2b-256 1ac5fc79a9291bda87826c13a67dea50df25658cf1845fc1bb7f57c9bb971242

See more details on using hashes here.

File details

Details for the file heliport-0.10.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for heliport-0.10.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4acc4630df9e25ffe93894662873fcedb7859c28356dd1f1e3f9a9d8e117de79
MD5 7b56e71fe2ecbf4a8fd179e11fb1ced8
BLAKE2b-256 66d9c8a88441f6ebbd608c95be7c0c9f9196179610ae6c9823ae713768bdb771

See more details on using hashes here.

File details

Details for the file heliport-0.10.0-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for heliport-0.10.0-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 9e11ca4ba9a7926b64a357c759f91df22381281d2372f01ce256cc0ad475224b
MD5 9a41b97ce86aeadca7014aa0c5f7bfc2
BLAKE2b-256 c6a9cfea4ee7d779be6edf14b8f059072157ed4c68d508a6f00a2e154881f784

See more details on using hashes here.

File details

Details for the file heliport-0.10.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for heliport-0.10.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 0b1a536bad17e7e2c3c57be6d7faf657062502fb3a7033879719a59db24aff8a
MD5 41bfdb4e0c2ec5a84d64f6a14cf1f6f3
BLAKE2b-256 507f09387ceccab2b5c6e76a16302456344fe2903ab5d7240e2ced3d20208047

See more details on using hashes here.

File details

Details for the file heliport-0.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heliport-0.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d0a20366b5ff6d55c23731d398dcedf4376556d03f7207714c72c7102a3fa62f
MD5 55acf5c4df4f1f21ebaef4878890456d
BLAKE2b-256 6a373c070cfd55b8372e9879585b4014209cc73f1c604b6e5338cf15535235ca

See more details on using hashes here.

File details

Details for the file heliport-0.10.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: heliport-0.10.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.8.6

File hashes

Hashes for heliport-0.10.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 dc05996fcdde0cf6f06a15d09fec103f5ae0c9a22f58afaa0f9ec8f951d3a59a
MD5 651c5730e96040f90efd9e2073847b8d
BLAKE2b-256 73a5f1ee05205a478d5a60eb72342f798e7244e24f2e4b504c3577e031533728

See more details on using hashes here.

File details

Details for the file heliport-0.10.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for heliport-0.10.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a5055bd9b05dfb260e9f3f0c27e12a66aca94c48234def3b00433d53ca9bad2f
MD5 2098c91483278b5423f591845204ad2d
BLAKE2b-256 b06ca14068f4b0c110abaa0891dcbb2256e222bd3bc53b0bca84ccf3c98f957b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page