Skip to main content

TF-IDF + LogReg baseline for text classification

Project description

tests linter codecov

python 3.6 release (latest by date) license

pre-commit code style: black

pypi version pypi downloads

Text Classification Baseline

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Usage

Instead of writing custom code for specific text classification task, you just need:

  1. install pipeline:
pip install text-classification-baseline
  1. run pipeline:
  • either in terminal:
text-clf-train
  • or in python:
import text_clf

text_clf.train()

No data preparation is needed, only a csv file with two raw columns (with arbitrary names):

  • text
  • target

NOTE: the target can be presented in any format, including text - not necessarily integers from 0 to n_classes-1.

Config

The user interface consists of only one file config.yaml.

Change config.yaml to create the desired configuration and train text classification model with the following command:

  • terminal:
text-clf-train --path_to_config config.yaml
  • python:
import text_clf

text_clf.train(path_to_config="config.yaml")

Default config.yaml:

seed: 42
verbose: true
path_to_save_folder: models

# data
data:
  train_data_path: data/train.csv
  valid_data_path: data/valid.csv
  sep: ','
  text_column: text
  target_column: target_name_short

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 0.0

# logreg
logreg:
  penalty: l2
  C: 1.0
  class_weight: balanced
  solver: saga
  multi_class: auto
  n_jobs: -1

NOTE: tf-idf and logreg are sklearn TfidfVectorizer and LogisticRegression parameters correspondingly, so you can parameterize instances of these classes however you want.

Output

After training the model, the pipeline will return the following files:

  • model.joblib - sklearn pipeline with TF-IDF and LogReg steps
  • target_names.json - mapping from encoded target labels from 0 to n_classes-1 to it names
  • config.yaml - config that was used to train the model
  • logging.txt - logging file

Requirements

Python >= 3.6

Citation

If you use text-classification-baseline in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021textclf,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training text classification baselines},
    howpublished = {\url{https://github.com/dayyass/text-classification-baseline}},
    year         = {2021}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text-classification-baseline-0.1.1.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file text-classification-baseline-0.1.1.tar.gz.

File metadata

  • Download URL: text-classification-baseline-0.1.1.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.5

File hashes

Hashes for text-classification-baseline-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c8a3a9361c4ebec85aa7b5ebaa8f674581d57eb0277334235adb88c403bbc2a7
MD5 95632d8dde6d85b53c56765e445eba1a
BLAKE2b-256 fce92d82325444a4dd15614dc08f6e653067bfabc079754b0d7868bdfba9009d

See more details on using hashes here.

File details

Details for the file text_classification_baseline-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: text_classification_baseline-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 10.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.5

File hashes

Hashes for text_classification_baseline-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8169479696908b332a06705fb8a7c543f08dc85e51fd2470d5352955a8c349e2
MD5 d4ae0ba733bc28f7609598fcc06d14fe
BLAKE2b-256 f2c302d78808ffe6d3a955e31bd6a33b12c7cddb13c92f7e774236b53c53bde1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page