Sklearn base nlp models
Project description
Old Fashioned NLP
Builds
This package aims to bring back the old fashioned NLP pipelines into your modeling workflow, providing a baseline reference before you move onto a transformer model.
Installation
pip install git+https://github.com/ChenghaoMou/old-fashioned-nlp.git
Usage
Classification
Currently, we have TfidfLinearSVC
, and TfidfLDALinearSVC
.
from old_fashioned_nlp.classification import TfidfLinearSVC
from sklearn.datasets import fetch_20newsgroups
data_train = fetch_20newsgroups(subset='train', categories=None,
shuffle=True, random_state=42,
remove=('headers', 'footers', 'quotes'))
data_test = fetch_20newsgroups(subset='test', categories=None,
shuffle=True, random_state=42,
remove=('headers', 'footers', 'quotes'))
m = TfidfLinearSVC()
m.fit(data_train.data, data_train.target)
m.score(data_test.data, data_test.target)
Sequence Tagging
We only have CharTfidfTagger
right now.
import nltk
from old_fashioned_nlp.tagging import CharTfidfTagger
nltk.download('conll2002')
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
train_tokens, train_pos, train_ner = zip(*[zip(*e) for e in train_sents])
model = CharTfidfTagger()
model.fit(train_tokens, train_pos)
model.score(test_tokens, test_pos)
Regression
Similar to classification, we have TfidfLinearSVR
and TfidfLDALinearSVR
.
Text Cleaning
CleanTextTransformer
can be plugged into any sklearn pipeline.
transformer = CleanTextTransformer(
replace_dates_with='DATE',
replace_times_with='TIME',
replace_emails_with='EMAIL',
replace_numbers_with='NUMBER',
replace_percentages_with='PERCENT',
replace_money_with='MONEY',
replace_hashtags_with='HASHTAG',
replace_handles_with='HANDLE',
expand_contractions=True
)
transformer.transform(["#now @me I'll log 80% entries are due by January 4th, 2017at 8:00pm contact me at chenghao@armorblox.com send me $500.00 now 3,415"])
Benchmarks
Classification
All scores are test scores using nlp
datasets from Huggingface. See benchmarks directory for details.
SOGOU
precision recall f1-score support
0 0.96 0.95 0.95 12000
1 0.93 0.95 0.94 12000
2 0.95 0.97 0.96 12000
3 0.95 0.96 0.96 12000
4 0.96 0.92 0.94 12000
accuracy 0.95 60000
macro avg 0.95 0.95 0.95 60000
weighted avg 0.95 0.95 0.95 60000
GLUE/COLA
precision recall f1-score support
0 0.00 0.00 0.00 322
1 0.69 1.00 0.82 721
accuracy 0.69 1043
macro avg 0.35 0.50 0.41 1043
weighted avg 0.48 0.69 0.57 1043
GLUE/SST2
precision recall f1-score support
0 0.84 0.77 0.80 428
1 0.79 0.86 0.82 444
accuracy 0.81 872
macro avg 0.82 0.81 0.81 872
weighted avg 0.82 0.81 0.81 872
Yelp
precision recall f1-score support
0 0.94 0.94 0.94 19000
1 0.94 0.94 0.94 19000
accuracy 0.94 38000
macro avg 0.94 0.94 0.94 38000
weighted avg 0.94 0.94 0.94 38000
AG News
precision recall f1-score support
0 0.94 0.91 0.92 1900
1 0.96 0.98 0.97 1900
2 0.90 0.89 0.89 1900
3 0.89 0.91 0.90 1900
accuracy 0.92 7600
macro avg 0.92 0.92 0.92 7600
weighted avg 0.92 0.92 0.92 7600
allocine
precision recall f1-score support
0 0.93 0.93 0.93 10408
1 0.92 0.93 0.92 9592
accuracy 0.93 20000
macro avg 0.93 0.93 0.93 20000
weighted avg 0.93 0.93 0.93 20000
Tagging
Default CharTfidfTagger
CONLL POS score: 0.5835184323399495 CONLL NER score: 0.15840812513116917
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file old-fashioned-nlp-0.1.3.tar.gz
.
File metadata
- Download URL: old-fashioned-nlp-0.1.3.tar.gz
- Upload date:
- Size: 10.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d832ac3ddea93b4f764a42e845a2fbcea43c6956a2a2975737197c752d3f2103 |
|
MD5 | fdb4079e2495cb7f454b866250199cc0 |
|
BLAKE2b-256 | 42c46e481d2ffc5ad502c64722cb9529a66e853e395483031282329c2bb16042 |
File details
Details for the file old_fashioned_nlp-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: old_fashioned_nlp-0.1.3-py3-none-any.whl
- Upload date:
- Size: 12.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fbd88490778e2cce8544ec2c8a0104c7280719a9aa7d2873b95eb7fb917eda57 |
|
MD5 | 30f1ecb3887e40158100f499d9be6fe4 |
|
BLAKE2b-256 | ffa009db31f6c3608d9034f67f8c177c31ae8b96779e84e2b8f35d4554652960 |