Skip to main content

Sensing the language of the text using Machine Learning

Project description

Luga

  • A blazing fast language detection using fastText's language models.

Languages

Luga is a Swahili word for language. fastText provides blazing-fast language detection tool. Lamentably, fastText's API is beauty-less, and the documentation is a bit fuzzy. It is also funky that we have to manually download and load models.

Here is where luga comes in. We abstract unnecessary steps and allow you to do precisely one thing: detecting text language.

cover image

Stand Still. Stay Silent - The relationships between Indo-European and Uralic languages by Minna Sundberg.

Show, don't tell

Luga in Action

Installation

python -m pip install -U luga

Usage:

⚠️ Note: The first usage downloads the model for you. It will take a bit longer to import depending on internet speed. It is done only once.

from luga import language

print(language("the world ended yesterday"))

# Language(name='en', score=0.9804665446281433)

With the list of texts, we can create a mask for a filtering pipeline, that can be used, for example, with DataFrames

from luga import language
import pandas as pd

examples = ["Jeg har ikke en rød reje", "Det blæser en halv pelican", "We are not robots yet"]
languages(texts=examples, only_language=True, to_array=True) == "en"
# output
# array([False, False, True])

dataf = pd.DataFrame({"text": examples})
dataf.loc[lambda d: languages(texts=d["text"].to_list(), only_language=True, to_array=True) == "en"]
# output
# 2    We are not robots yet
# Name: text, dtype: object

Without Luga:

Download the model

wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -O /tmp/lid.176.bin

Load and use

import fasttext

PATH_TO_MODEL = '/tmp/lid.176.bin'
fmodel = fasttext.load_model(PATH_TO_MODEL)
fmodel.predict(["the world has ended yesterday"])

# ([['__label__en']], [array([0.98046654], dtype=float32)])

Dev:

poetry run pre-commit install

Release Flow

# assumes git push is completed
git tag -l #  lists tags
git tag v*.*.* # Major.Minor.Fix
git push origin tag v*.*.*

# to delete tag:
git tag -d v*.*.* && git push origin tag -d v*.*.*

# change project_toml and __init__.py to reflect new version

TODO:

  • refactor artifacts.py
  • auto checkers with pre-commit | invoke
  • write more tests
  • write github actions
  • create an intelligent data checker (a fast List[str], what do with none strings)
  • make it faster with Cython
  • get NDArray typing correctly
  • fix artifacts.py line 111 cast to List[str] that causes issues
  • remove nptyping when more packages move to numpy > 1.21

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

luga-0.2.7.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

luga-0.2.7-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file luga-0.2.7.tar.gz.

File metadata

  • Download URL: luga-0.2.7.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.10.6 Linux/5.15.0-1024-azure

File hashes

Hashes for luga-0.2.7.tar.gz
Algorithm Hash digest
SHA256 f59a07dc9eaa6b72b8b88ddea69be292b9fbc4d1522cf3ed4a6f29fc5d7feaff
MD5 90512e900b10169ec7a8d3bacb97d52f
BLAKE2b-256 12e77bcb3cee8e1fd07eff6b29c071c2ed78761c2308c14b006fc11cf5295eaa

See more details on using hashes here.

File details

Details for the file luga-0.2.7-py3-none-any.whl.

File metadata

  • Download URL: luga-0.2.7-py3-none-any.whl
  • Upload date:
  • Size: 5.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.10.6 Linux/5.15.0-1024-azure

File hashes

Hashes for luga-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 b4a6f39fb2e1d5dbec5fd1cc26646ac83ee0acfc6599732bbef1bd5ce4e35b94
MD5 cd8089ec6e9e371a521f8a28f625e439
BLAKE2b-256 860b201e6de09986764cf4ae15b3a3b6165389c764f40f0866540a7b04bddcd2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page