Skip to main content

A multilingual text analytics package.

Project description

Lingualytics : Easy codemixed analytics

Lingualytics is a Python library for dealing with code mixed text.
Lingualytics is powered by powerful libraries like Pytorch, Transformers, Texthero, NLTK and Scikit-learn.

Features

  1. Preprocessing

    • Remove stopwords
    • Remove punctuations, with an option to add punctuations of your own language
    • Remove words less than a character limit
  2. Representation

    • Find n-grams from given text
  3. NLP

    • Classification using PyTorch
      • Train a classifier on your data to perform tasks like Sentiment Analysis
      • Evaluate the classifier with metrics like accuracy, f1 score, precision and recall
      • Use the trained tokenizer to tokenize text
    • Some pretrained Huggingface models trained on codemixed datasets you can use

Installation

Use the package manager pip to install lingualytics.

pip install lingualytics

Usage

Preprocessing

from lingualytics.preprocessing import remove_lessthan, remove_punctuation, remove_stopwords
from lingualytics.stopwords import hi_stopwords,en_stopwords
from texthero.preprocessing import remove_digits
import pandas as pd
df = pd.read_csv(
   "https://github.com/lingualytics/py-lingualytics/raw/master/datasets/SAIL_2017/Processed_Data/Devanagari/validation.txt", header=None, sep='\t', names=['text','label']
)
# pd.set_option('display.max_colwidth', None)
df['clean_text'] = df['text'].pipe(remove_digits) \
                                    .pipe(remove_punctuation) \
                                    .pipe(remove_lessthan,length=3) \
                                    .pipe(remove_stopwords,stopwords=en_stopwords.union(hi_stopwords))
print(df)

Classification

The train data path should have 4 files - train.txt - validation.txt - test.txt

You can just download datasets/SAIL_2017/Processed Data/Devanagari from the Github repository to try this out.

from lingualytics.learner import Learner

learner = Learner(model_type = 'bert',
                model_name = 'bert-base-multilingual-cased',
                dataset = 'SAIL-2017')
learner.fit()

Find topmost n-grams

from lingualytics.representation import get_ngrams
import pandas as pd
df = pd.read_csv(
   "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

ngrams = get_ngrams(df['text'],n=2)

print(ngrams[:10])

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lingualytics-0.1.3.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

lingualytics-0.1.3-py3-none-any.whl (12.7 kB view details)

Uploaded Python 3

File details

Details for the file lingualytics-0.1.3.tar.gz.

File metadata

  • Download URL: lingualytics-0.1.3.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7

File hashes

Hashes for lingualytics-0.1.3.tar.gz
Algorithm Hash digest
SHA256 1b1c26da41f157767ed8b8ace40f65e5a3af8d75382d3bfaf1ba432ba5c656be
MD5 8901c722ecd2e02a4ff27539950de88b
BLAKE2b-256 f2af1992a09559e32e566790a8141bd04e9d83f83cb226e8bda75107ba03d698

See more details on using hashes here.

File details

Details for the file lingualytics-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: lingualytics-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 12.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7

File hashes

Hashes for lingualytics-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7a30f7e56da6849f7ca53fcf6745b0d9a653738040682ec60db628077aefb235
MD5 83161338284ea62a1f7849323d66e072
BLAKE2b-256 605f4244b57496fbd517403816a5c87dfb292a92d9f4058a6ba25d5e4820280f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page