TF-IDF + LogReg baseline for text classification
Project description
Text Classification Baseline
Pipeline for building text classification TF-IDF + LogReg baselines using sklearn.
Usage
Instead of writing custom code for specific text classification task, you just need:
- install pipeline:
pip install text-classification-baseline
-
run pipeline:
- either in terminal:
text-clf --config config.yaml
- or in python:
import text_clf text_clf.train(path_to_config="config.yaml")
No data preparation is needed, only a csv file with two raw columns (with arbitrary names):
text
target
NOTE: the target can be presented in any format, including text - not necessarily integers from 0 to n_classes-1.
Config
The user interface consists of only one file config.yaml.
Change config.yaml to create the desired configuration and train text classification model.
Default config.yaml:
seed: 42
verbose: true
path_to_save_folder: models
# data
data:
train_data_path: data/train.csv
valid_data_path: data/valid.csv
sep: ','
text_column: text
target_column: target_name_short
# tf-idf
tf-idf:
lowercase: true
ngram_range: (1, 1)
max_df: 1.0
min_df: 0.0
# logreg
logreg:
penalty: l2
C: 1.0
class_weight: balanced
solver: saga
multi_class: auto
n_jobs: -1
Output
After training the model, the pipeline will return the following files:
model.joblib
- sklearn pipeline with TF-IDF and LogReg stepstarget_names.json
- mapping from encoded target labels from 0 to n_classes-1 to it namesconfig.yaml
- config that was used to train the modellogging.txt
- logging file
Requirements
Python >= 3.7
Citation
If you use text-classification-baseline in a scientific publication, we would appreciate references to the following BibTex entry:
@misc{dayyass2021textclf,
author = {El-Ayyass, Dani},
title = {Pipeline for training text classification baselines},
howpublished = {\url{https://github.com/dayyass/text-classification-baseline}},
year = {2021}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file text-classification-baseline-0.1.0.tar.gz
.
File metadata
- Download URL: text-classification-baseline-0.1.0.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cbe3a4770a33a5c508704f1d3e6d33b6c46fc3685f05ea7edb72a54c8ae3c8e2 |
|
MD5 | ddfb9613b94cb07e27b33673e0d98c69 |
|
BLAKE2b-256 | d67e0367acb9b14af8fa51d0f24c288f8890ebebdce775bdcd6afa2cb63d2de4 |
File details
Details for the file text_classification_baseline-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: text_classification_baseline-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 936a098b66ca6646310424de31b207cde6581d928196506cd4f44a50ff50df76 |
|
MD5 | c15738261b337c86b8d2594cf8ac9bad |
|
BLAKE2b-256 | c242e5aeff339a4215924a7b7b7f080cc24e9b069ce81df8f6037fd1d7a77d3b |