Skip to main content

TF-IDF + LogReg baseline for text classification

Project description

Text Classification Baseline

Pipeline for building text classification TF-IDF + LogReg baselines using sklearn.

Usage

Instead of writing custom code for specific text classification task, you just need:

  1. install pipeline:
pip install text-classification-baseline
  1. run pipeline:

    • either in terminal:
    text-clf --config config.yaml
    
    • or in python:
    import text_clf
    
    text_clf.train(path_to_config="config.yaml")
    

No data preparation is needed, only a csv file with two raw columns (with arbitrary names):

  • text
  • target

NOTE: the target can be presented in any format, including text - not necessarily integers from 0 to n_classes-1.

Config

The user interface consists of only one file config.yaml.

Change config.yaml to create the desired configuration and train text classification model.

Default config.yaml:

seed: 42
verbose: true
path_to_save_folder: models

# data
data:
  train_data_path: data/train.csv
  valid_data_path: data/valid.csv
  sep: ','
  text_column: text
  target_column: target_name_short

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 0.0

# logreg
logreg:
  penalty: l2
  C: 1.0
  class_weight: balanced
  solver: saga
  multi_class: auto
  n_jobs: -1

Output

After training the model, the pipeline will return the following files:

  • model.joblib - sklearn pipeline with TF-IDF and LogReg steps
  • target_names.json - mapping from encoded target labels from 0 to n_classes-1 to it names
  • config.yaml - config that was used to train the model
  • logging.txt - logging file

Requirements

Python >= 3.7

Citation

If you use text-classification-baseline in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021textclf,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training text classification baselines},
    howpublished = {\url{https://github.com/dayyass/text-classification-baseline}},
    year         = {2021}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text-classification-baseline-0.1.0.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file text-classification-baseline-0.1.0.tar.gz.

File metadata

  • Download URL: text-classification-baseline-0.1.0.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.5

File hashes

Hashes for text-classification-baseline-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cbe3a4770a33a5c508704f1d3e6d33b6c46fc3685f05ea7edb72a54c8ae3c8e2
MD5 ddfb9613b94cb07e27b33673e0d98c69
BLAKE2b-256 d67e0367acb9b14af8fa51d0f24c288f8890ebebdce775bdcd6afa2cb63d2de4

See more details on using hashes here.

File details

Details for the file text_classification_baseline-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: text_classification_baseline-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.5

File hashes

Hashes for text_classification_baseline-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 936a098b66ca6646310424de31b207cde6581d928196506cd4f44a50ff50df76
MD5 c15738261b337c86b8d2594cf8ac9bad
BLAKE2b-256 c242e5aeff339a4215924a7b7b7f080cc24e9b069ce81df8f6037fd1d7a77d3b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page