TF-IDF + LogReg baseline for text classification
Project description
Text Classification Baseline
Pipeline for fast building text classification baselines with TF-IDF + LogReg.
Usage
Instead of writing custom code for specific text classification task, you just need:
- install pipeline:
pip install text-classification-baseline
- run pipeline:
- either in terminal:
text-clf-train --path_to_config config.yaml
- or in python:
import text_clf
model, target_names_mapping = text_clf.train(path_to_config="config.yaml")
NOTE: more about config file here.
No data preparation is needed, only a csv file with two raw columns (with arbitrary names):
text
target
The target can be presented in any format, including text - not necessarily integers from 0 to n_classes-1.
Config
The user interface consists of two files:
- config.yaml - general configuration with sklearn TF-IDF and LogReg parameters
- hyperparams.py - sklearn GridSearchCV parameters
Change config.yaml and hyperparams.py to create the desired configuration and train text classification model with the following command:
- terminal:
text-clf-train --path_to_config config.yaml
- python:
import text_clf
model, target_names_mapping = text_clf.train(path_to_config="config.yaml")
Default config.yaml:
seed: 42
path_to_save_folder: models
# data
data:
train_data_path: data/train.csv
test_data_path: data/test.csv
sep: ','
text_column: text
target_column: target_name_short
# tf-idf
tf-idf:
lowercase: true
ngram_range: (1, 1)
max_df: 1.0
min_df: 1
# logreg
logreg:
penalty: l2
C: 1.0
class_weight: balanced
solver: saga
n_jobs: -1
# grid-search
grid-search:
do_grid_search: false
grid_search_params_path: hyperparams.py
NOTE: grid search is disabled by default, to use it set do_grid_search: true
.
NOTE: tf-idf
and logreg
are sklearn TfidfVectorizer and LogisticRegression parameters correspondingly, so you can parameterize instances of these classes however you want. The same logic applies to grid-search
which is sklearn GridSearchCV parametrized with hyperparams.py.
Output
After training the model, the pipeline will return the following files:
model.joblib
- sklearn pipeline with TF-IDF and LogReg stepstarget_names.json
- mapping from encoded target labels from 0 to n_classes-1 to it namesconfig.yaml
- config that was used to train the modelhyperparams.py
- grid-search parameters (if grid-search was used)logging.txt
- logging file
Requirements
Python >= 3.6
Citation
If you use text-classification-baseline in a scientific publication, we would appreciate references to the following BibTex entry:
@misc{dayyass2021textclf,
author = {El-Ayyass, Dani},
title = {Pipeline for training text classification baselines},
howpublished = {\url{https://github.com/dayyass/text-classification-baseline}},
year = {2021}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for text-classification-baseline-0.1.4.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | b0bd3f41728bfa4256047072382e83acc906783d8d6a34a2d0e20c789aa05b45 |
|
MD5 | e845f15c5061c621a8b8b9b4d4ce33a3 |
|
BLAKE2b-256 | 59e321a28d329429614a7126b3818c6badd7a382a21759de2fc6db02f2faa495 |
Hashes for text_classification_baseline-0.1.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2d28cfb198da4880d5507db74061d4255e424c1ad75913587e3b34b758fad68f |
|
MD5 | 2342aac8a0c0cf34d81c7c1cfc2d77cf |
|
BLAKE2b-256 | 20eab1ffbe0b595ff7625ba597397cee0d5526a2533082edd33810f064cb9622 |