Skip to main content

TF-IDF + LogReg baseline for text classification

Project description

tests linter codecov

python 3.6 release (latest by date) license

pre-commit code style: black

pypi version pypi downloads

Text Classification Baseline

Pipeline for fast building text classification baselines with TF-IDF + LogReg.

Usage

Instead of writing custom code for specific text classification task, you just need:

  1. install pipeline:
pip install text-classification-baseline
  1. run pipeline:
  • either in terminal:
text-clf-train --path_to_config config.yaml
  • or in python:
import text_clf

model, target_names_mapping = text_clf.train(path_to_config="config.yaml")

NOTE: more about config file here.

No data preparation is needed, only a csv file with two raw columns (with arbitrary names):

  • text
  • target

The target can be presented in any format, including text - not necessarily integers from 0 to n_classes-1.

Config

The user interface consists of two files:

  • config.yaml - general configuration with sklearn TF-IDF and LogReg parameters
  • hyperparams.py - sklearn GridSearchCV parameters

Change config.yaml and hyperparams.py to create the desired configuration and train text classification model with the following command:

  • terminal:
text-clf-train --path_to_config config.yaml
  • python:
import text_clf

model, target_names_mapping = text_clf.train(path_to_config="config.yaml")

Default config.yaml:

seed: 42
path_to_save_folder: models
experiment_name: model

# data
data:
  train_data_path: data/train.csv
  test_data_path: data/test.csv
  sep: ','
  text_column: text
  target_column: target_name_short

# preprocessing
# (included in resulting model pipeline, so preserved for inference)
preprocessing:
  lemmatization: null  # pymorphy2

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 1

# logreg
logreg:
  penalty: l2
  C: 1.0
  class_weight: balanced
  solver: saga
  n_jobs: -1

# grid-search
grid-search:
  do_grid_search: false
  grid_search_params_path: hyperparams.py

NOTE: grid search is disabled by default, to use it set do_grid_search: true.

NOTE: tf-idf and logreg are sklearn TfidfVectorizer and LogisticRegression parameters correspondingly, so you can parameterize instances of these classes however you want. The same logic applies to grid-search which is sklearn GridSearchCV parametrized with hyperparams.py.

Output

After training the model, the pipeline will return the following files:

  • model.joblib - sklearn pipeline with TF-IDF and LogReg steps
  • target_names.json - mapping from encoded target labels from 0 to n_classes-1 to it names
  • config.yaml - config that was used to train the model
  • hyperparams.py - grid-search parameters (if grid-search was used)
  • logging.txt - logging file

Additional functions

  • text_clf.token_frequency.get_token_frequency(path_to_config) -
    get token frequency of train dataset according to the config file parameters

Only for binary classifiers:

  • text_clf.pr_roc_curve.get_precision_recall_curve(path_to_model_folder) -
    get precision and recall metrics for precision-recall curve
  • text_clf.pr_roc_curve.get_roc_curve(path_to_model_folder) -
    get false positive rate (fpr) and true positive rate (tpr) metrics for roc curve
  • text_clf.pr_roc_curve.plot_precision_recall_curve(precision, recall) -
    plot precision-recall curve
  • text_clf.pr_roc_curve.plot_roc_curve(fpr, tpr) -
    plot roc curve
  • text_clf.pr_roc_curve.plot_precision_recall_f1_curves_for_thresholds(precision, recall, thresholds) -
    plot precision, recall, f1-score curves for probability thresholds

Requirements

Python >= 3.6

Citation

If you use text-classification-baseline in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021textclf,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training text classification baselines},
    howpublished = {\url{https://github.com/dayyass/text-classification-baseline}},
    year         = {2021}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text-classification-baseline-0.1.5.tar.gz (11.6 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file text-classification-baseline-0.1.5.tar.gz.

File metadata

  • Download URL: text-classification-baseline-0.1.5.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.5

File hashes

Hashes for text-classification-baseline-0.1.5.tar.gz
Algorithm Hash digest
SHA256 a3253899f78b38b3ecc18a8f5a377db746c6b4aabe42d05e18a7540bdae148f8
MD5 6bf4b7316a5c5f83d7928ddcd38b834a
BLAKE2b-256 33de0f6cf9cbe8ecd4cf4872d01bf1aaf235ceb3c194361968ccec689ed07848

See more details on using hashes here.

File details

Details for the file text_classification_baseline-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: text_classification_baseline-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 17.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.5

File hashes

Hashes for text_classification_baseline-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 6269adff9fdd411d2ffe3075ea4070fcb3ea9d42e460075f0121731cdc66479d
MD5 1e0dd831de81415521e03789efc9b6d6
BLAKE2b-256 5c73b4e42ac069605cf829f5edd7e7fea09f5245aef165a655b4c52b0a4c5e2f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page