Language detection for news powered by fasttext

These details have not been verified by PyPI

Project links

Homepage

Project description

fastlangid

The only language identification that includes Cantonese (廣東話), traditional and simplified Chinese.

Why and who is this package for?

This is a language identification language focus on providing higher accuracy in Japanese, Korean, and Chinese language compares to the original Fasttext model ( lid.176.ftz ). This package also include identification for cantonese, simplified and traditional Chinese language.

First stage model F1, which is same from fasttext language identification model

Model	F1@1
lid.176.ftz	0.977

We can achieve higher accuracy by including an additional language identification model to handle low confidence scores for Japanese, Korean, Chinese. The table below shows F1 (k=1) scores in identifying 3 languages. (we updated the validation corpus which is much harder to the first revision : shorter text, latest news text )

2nd-Stage Model	F1@1	Acc@1
version 1.0.0	0.826	0.744
master	0.801	0.894

Master version is also trained with identifying Cantonese (zh-yue) text from Mozilla Common Voice corpus text. Currently the model is senstive to non cantonese text mixing inside the sentence, hence please use the model with care.

To use Cantonese prediction, it recommended to force inference using the second stage prediction

lang_code = langid.predict('平嘢有冇好嘢?', force_second=True)

For more edge case detail please refer to fasttext_issues.py

The training data for the supplement model was drawn from Common Crawl Corpus and Currents API internal language dataset.

We wish to support Cantonese language in the upcoming future. Feel free to contact us if you would like to provide any related corpus.

Install

$ pip install fastlangid

Example

Only one function call away to handle single or multiple sentences

from fastlangid.langid import LID
langid = LID()
result = langid.predict('This is a test')
print(result)

from fastlangid.langid import LID
langid = LID()
examples = [
  '中文繁體',
  '中文简体',
  'Lorem Ipsum is simply dummy text of the printing and typesetting industry',
  'Lorem Ipsum adalah text contoh digunakan didalam industri pencetakan dan typesetting',
  'Le Lorem Ipsum est simplement du faux texte employé dans la composition et la mise en page avant impression'
]
results = langid.predict(examples)
print(results)

Supported Languages

Supports 177 languages. The ISO codes for the corresponding languages are as below.

af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk
ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es et eu fa fi fr frr fy ga
gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia id ie ilo io is it ja jbo jv ka kk km
kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms
mt mwl my myv mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu
rm ro ru rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl tr
tt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh-hans zh-hant zh-yue

Caveats

Bag of words method doesn't work well in short text classification as found by this article by Apple. Hence it's recommend that you ensure the text have at least more than 5 characters/words.

Cantonese language identification is trained on daily conversation text which may not represent well in article types text. Hence it may confuse with traditional chinese (zh-hant) as they share the exact same characters.

Reference

Enriching Word Vectors with Subword Information

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

FastText.zip: Compressing text classification models

[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

License

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.11

Dec 6, 2022

1.0.10

Dec 5, 2022

1.0.9

May 10, 2022

1.0.8

May 10, 2022

1.0.7

May 3, 2022

1.0.3

Jun 29, 2021

1.0.2

Jun 28, 2021

1.0.1

Jun 28, 2021

1.0.0

Mar 9, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastlangid-1.0.11.tar.gz (1.2 MB view details)

Uploaded Dec 6, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fastlangid-1.0.11-py2.py3-none-any.whl (1.2 MB view details)

Uploaded Dec 6, 2022 Python 2Python 3

File details

Details for the file fastlangid-1.0.11.tar.gz.

File metadata

Download URL: fastlangid-1.0.11.tar.gz
Upload date: Dec 6, 2022
Size: 1.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.15

File hashes

Hashes for fastlangid-1.0.11.tar.gz
Algorithm	Hash digest
SHA256	`e19923245943714809e1ed283ae2fbc1223f64a0afee3e02ad66004edc119f50`
MD5	`d7c6fc104a6769d085e0d6fcc662fe5b`
BLAKE2b-256	`e1b71b17848fb522872f0923967d5c113338e23ba2be9d3c311d5215b88045ab`

See more details on using hashes here.

File details

Details for the file fastlangid-1.0.11-py2.py3-none-any.whl.

File metadata

Download URL: fastlangid-1.0.11-py2.py3-none-any.whl
Upload date: Dec 6, 2022
Size: 1.2 MB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.15

File hashes

Hashes for fastlangid-1.0.11-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`feb53f9415f6556f5c29b98a6792f697d3780a970393a7ec6e44689257069992`
MD5	`b8702b758eb6ef4236b90a057527fbd7`
BLAKE2b-256	`c4aa11ed99e3592830829f7f8626738d27300f0939a1363864b1f8172bbf7b6f`

See more details on using hashes here.

fastlangid 1.0.11

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

fastlangid

Why and who is this package for?

Install

Example

Supported Languages

Caveats

Reference

Enriching Word Vectors with Subword Information

Bag of Tricks for Efficient Text Classification

FastText.zip: Compressing text classification models

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes