Skip to main content

Sklearn base nlp models

Project description

Old Fashioned NLP

License: MIT PyPI version CodeFactor

Builds pypi

This package aims to bring back the old fashioned NLP pipelines into your modeling workflow, providing a baseline reference before you move onto a transformer model.

Installation

pip install git+https://github.com/ChenghaoMou/old-fashioned-nlp.git

Usage

Classification

Currently, we have TfidfLinearSVC, and TfidfLDALinearSVC.

from old_fashioned_nlp.classification import TfidfLinearSVC
from sklearn.datasets import fetch_20newsgroups

data_train = fetch_20newsgroups(subset='train', categories=None,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))

data_test = fetch_20newsgroups(subset='test', categories=None,
                            shuffle=True, random_state=42,
                            remove=('headers', 'footers', 'quotes'))

m = TfidfLinearSVC()
m.fit(data_train.data, data_train.target)
m.score(data_test.data, data_test.target)

Sequence Tagging

We only have CharTfidfTagger right now.

import nltk
from old_fashioned_nlp.tagging import CharTfidfTagger

nltk.download('conll2002')

train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
train_tokens, train_pos, train_ner = zip(*[zip(*e) for e in train_sents])

model = CharTfidfTagger()
model.fit(train_tokens, train_pos)
model.score(test_tokens, test_pos)

Regression

Similar to classification, we have TfidfLinearSVR and TfidfLDALinearSVR.

Text Cleaning

CleanTextTransformer can be plugged into any sklearn pipeline.

transformer = CleanTextTransformer(
    replace_dates_with='DATE',
    replace_times_with='TIME',
    replace_emails_with='EMAIL',
    replace_numbers_with='NUMBER',
    replace_percentages_with='PERCENT',
    replace_money_with='MONEY',
    replace_hashtags_with='HASHTAG',
    replace_handles_with='HANDLE',
    expand_contractions=True
)
transformer.transform(["#now @me I'll log 80% entries are due by January 4th, 2017at 8:00pm contact me at chenghao@armorblox.com send me $500.00 now 3,415"])

Benchmarks

Classification

All scores are test scores using nlp datasets from Huggingface. See benchmarks directory for details.

SOGOU

              precision    recall  f1-score   support

           0       0.96      0.95      0.95     12000
           1       0.93      0.95      0.94     12000
           2       0.95      0.97      0.96     12000
           3       0.95      0.96      0.96     12000
           4       0.96      0.92      0.94     12000

    accuracy                           0.95     60000
   macro avg       0.95      0.95      0.95     60000
weighted avg       0.95      0.95      0.95     60000

GLUE/COLA

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       322
           1       0.69      1.00      0.82       721

    accuracy                           0.69      1043
   macro avg       0.35      0.50      0.41      1043
weighted avg       0.48      0.69      0.57      1043

GLUE/SST2

              precision    recall  f1-score   support

           0       0.84      0.77      0.80       428
           1       0.79      0.86      0.82       444

    accuracy                           0.81       872
   macro avg       0.82      0.81      0.81       872
weighted avg       0.82      0.81      0.81       872

Yelp

              precision    recall  f1-score   support

           0       0.94      0.94      0.94     19000
           1       0.94      0.94      0.94     19000

    accuracy                           0.94     38000
   macro avg       0.94      0.94      0.94     38000
weighted avg       0.94      0.94      0.94     38000

AG News

              precision    recall  f1-score   support

           0       0.94      0.91      0.92      1900
           1       0.96      0.98      0.97      1900
           2       0.90      0.89      0.89      1900
           3       0.89      0.91      0.90      1900

    accuracy                           0.92      7600
   macro avg       0.92      0.92      0.92      7600
weighted avg       0.92      0.92      0.92      7600

allocine

              precision    recall  f1-score   support

           0       0.93      0.93      0.93     10408
           1       0.92      0.93      0.92      9592

    accuracy                           0.93     20000
   macro avg       0.93      0.93      0.93     20000
weighted avg       0.93      0.93      0.93     20000

Tagging

Default CharTfidfTagger

CONLL POS score: 0.5835184323399495 CONLL NER score: 0.15840812513116917

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

old-fashioned-nlp-0.1.3.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

old_fashioned_nlp-0.1.3-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file old-fashioned-nlp-0.1.3.tar.gz.

File metadata

  • Download URL: old-fashioned-nlp-0.1.3.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.7.9

File hashes

Hashes for old-fashioned-nlp-0.1.3.tar.gz
Algorithm Hash digest
SHA256 d832ac3ddea93b4f764a42e845a2fbcea43c6956a2a2975737197c752d3f2103
MD5 fdb4079e2495cb7f454b866250199cc0
BLAKE2b-256 42c46e481d2ffc5ad502c64722cb9529a66e853e395483031282329c2bb16042

See more details on using hashes here.

File details

Details for the file old_fashioned_nlp-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: old_fashioned_nlp-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.7.9

File hashes

Hashes for old_fashioned_nlp-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 fbd88490778e2cce8544ec2c8a0104c7280719a9aa7d2873b95eb7fb917eda57
MD5 30f1ecb3887e40158100f499d9be6fe4
BLAKE2b-256 ffa009db31f6c3608d9034f67f8c177c31ae8b96779e84e2b8f35d4554652960

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page