Skip to main content

A multilingual text analytics package.

Project description

Lingualytics : Easy codemixed analytics

Lingualytics is a Python library for dealing with code mixed text.
Lingualytics is powered by powerful libraries like Pytorch, Transformers, Texthero, NLTK and Scikit-learn.

Features

  1. Preprocessing

    • Remove stopwords
    • Remove punctuations, with an option to add punctuations of your own language
    • Remove words less than a character limit
  2. Representation

    • Find n-grams from given text
  3. NLP

    • Classification using PyTorch
      • Train a classifier on your data to perform tasks like Sentiment Analysis
      • Evaluate the classifier with metrics like accuracy, f1 score, precision and recall
      • Use the trained tokenizer to tokenize text
    • Some pretrained Huggingface models trained on codemixed datasets you can use

Installation

Use the package manager pip to install lingualytics.

pip install lingualytics

Usage

Preprocessing

from lingualytics.preprocessing import remove_lessthan, remove_punctuation, remove_stopwords
from lingualytics.stopwords import hi_stopwords,en_stopwords
from texthero.preprocessing import remove_digits
import pandas as pd
df = pd.read_csv(
   "https://github.com/lingualytics/py-lingualytics/raw/master/datasets/SAIL_2017/Processed_Data/Devanagari/validation.txt", header=None, sep='\t', names=['text','label']
)
# pd.set_option('display.max_colwidth', None)
df['clean_text'] = df['text'].pipe(remove_digits) \
                                    .pipe(remove_punctuation) \
                                    .pipe(remove_lessthan,length=3) \
                                    .pipe(remove_stopwords,stopwords=en_stopwords.union(hi_stopwords))
print(df)

Classification

The train data path should have 4 files - train.txt - validation.txt - test.txt

You can just download datasets/SAIL_2017/Processed Data/Devanagari from the Github repository to try this out.

from lingualytics.learner import Learner

learner = Learner(data_dir='<path-to-train-data>',
                output_dir='<path-to-output-predictions-and-save-the-model>')
learner.fit()

Find topmost n-grams

from lingualytics.representation import get_ngrams
import pandas as pd
df = pd.read_csv(
   "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

ngrams = get_ngrams(df['text'],n=2)

print(ngrams[:10])

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lingualytics-0.1.2.tar.gz (13.0 kB view details)

Uploaded Source

Built Distribution

lingualytics-0.1.2-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file lingualytics-0.1.2.tar.gz.

File metadata

  • Download URL: lingualytics-0.1.2.tar.gz
  • Upload date:
  • Size: 13.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7

File hashes

Hashes for lingualytics-0.1.2.tar.gz
Algorithm Hash digest
SHA256 5728b64a1efda8c36e0db9f9be271af2856c9d7bc2f06fa8db3b44e6d0c3fa5f
MD5 a49afb1fc05e82da4895fba846199f35
BLAKE2b-256 97b5bd897fb98cc9f4b79cccc8032c05550240b33960415bf2804ee54c9c3c18

See more details on using hashes here.

File details

Details for the file lingualytics-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: lingualytics-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 12.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7

File hashes

Hashes for lingualytics-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b92e9037fd2f5ab174f59dda6de61554eb7990674bbfcfce13542439962ccdcc
MD5 1340e62257731b331f8e1408bfb1a313
BLAKE2b-256 822bd0d2a01c8690a6d61be91d8d880affe3468fc5916e3b8f34de16d80660d5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page