TF-IDF + LogReg baseline for text classification
Project description
Text Classification Baseline
Pipeline for building text classification TF-IDF + LogReg baselines using sklearn.
Usage
Instead of writing custom code for specific text classification task, you just need:
- install pipeline:
pip install text-classification-baseline
-
run pipeline:
- either in terminal:
text-clf --config config.yaml
- or in python:
import text_clf text_clf.train(path_to_config="config.yaml")
No data preparation is needed, only a csv file with two raw columns (with arbitrary names):
text
target
NOTE: the target can be presented in any format, including text - not necessarily integers from 0 to n_classes-1.
Config
The user interface consists of only one file config.yaml.
Change config.yaml to create the desired configuration and train text classification model.
Default config.yaml:
seed: 42
verbose: true
path_to_save_folder: models
# data
data:
train_data_path: data/train.csv
valid_data_path: data/valid.csv
sep: ','
text_column: text
target_column: target_name_short
# tf-idf
tf-idf:
lowercase: true
ngram_range: (1, 1)
max_df: 1.0
min_df: 0.0
# logreg
logreg:
penalty: l2
C: 1.0
class_weight: balanced
solver: saga
multi_class: auto
n_jobs: -1
Output
After training the model, the pipeline will return the following files:
model.joblib
- sklearn pipeline with TF-IDF and LogReg stepstarget_names.json
- mapping from encoded target labels from 0 to n_classes-1 to it namesconfig.yaml
- config that was used to train the modellogging.txt
- logging file
Requirements
Python >= 3.7
Citation
If you use text-classification-baseline in a scientific publication, we would appreciate references to the following BibTex entry:
@misc{dayyass2021textclf,
author = {El-Ayyass, Dani},
title = {Pipeline for training text classification baselines},
howpublished = {\url{https://github.com/dayyass/text-classification-baseline}},
year = {2021}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for text-classification-baseline-0.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | cbe3a4770a33a5c508704f1d3e6d33b6c46fc3685f05ea7edb72a54c8ae3c8e2 |
|
MD5 | ddfb9613b94cb07e27b33673e0d98c69 |
|
BLAKE2b-256 | d67e0367acb9b14af8fa51d0f24c288f8890ebebdce775bdcd6afa2cb63d2de4 |
Hashes for text_classification_baseline-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 936a098b66ca6646310424de31b207cde6581d928196506cd4f44a50ff50df76 |
|
MD5 | c15738261b337c86b8d2594cf8ac9bad |
|
BLAKE2b-256 | c242e5aeff339a4215924a7b7b7f080cc24e9b069ce81df8f6037fd1d7a77d3b |