TF-IDF + LogReg baseline for text classification
Project description
Text Classification Baseline
Pipeline for fast building text classification baselines with TF-IDF + LogReg.
Usage
Instead of writing custom code for specific text classification task, you just need:
- install pipeline:
pip install text-classification-baseline
- run pipeline:
- either in terminal:
text-clf-train --path_to_config config.yaml
- or in python:
import text_clf
model, target_names_mapping = text_clf.train(path_to_config="config.yaml")
NOTE: more about config file here.
No data preparation is needed, only a csv file with two raw columns (with arbitrary names):
text
target
The target can be presented in any format, including text - not necessarily integers from 0 to n_classes-1.
Config
The user interface consists of two files:
- config.yaml - general configuration with sklearn TF-IDF and LogReg parameters
- hyperparams.py - sklearn GridSearchCV parameters
Change config.yaml and hyperparams.py to create the desired configuration and train text classification model with the following command:
- terminal:
text-clf-train --path_to_config config.yaml
- python:
import text_clf
model, target_names_mapping = text_clf.train(path_to_config="config.yaml")
Default config.yaml:
seed: 42
path_to_save_folder: models
# data
data:
train_data_path: data/train.csv
test_data_path: data/test.csv
sep: ','
text_column: text
target_column: target_name_short
# tf-idf
tf-idf:
lowercase: true
ngram_range: (1, 1)
max_df: 1.0
min_df: 1
# logreg
logreg:
penalty: l2
C: 1.0
class_weight: balanced
solver: saga
n_jobs: -1
# grid-search
grid-search:
do_grid_search: false
grid_search_params_path: hyperparams.py
NOTE: grid search is disabled by default, to use it set do_grid_search: true
.
NOTE: tf-idf
and logreg
are sklearn TfidfVectorizer and LogisticRegression parameters correspondingly, so you can parameterize instances of these classes however you want. The same logic applies to grid-search
which is sklearn GridSearchCV parametrized with hyperparams.py.
Output
After training the model, the pipeline will return the following files:
model.joblib
- sklearn pipeline with TF-IDF and LogReg stepstarget_names.json
- mapping from encoded target labels from 0 to n_classes-1 to it namesconfig.yaml
- config that was used to train the modelhyperparams.py
- grid-search parameters (if grid-search was used)logging.txt
- logging file
Requirements
Python >= 3.6
Citation
If you use text-classification-baseline in a scientific publication, we would appreciate references to the following BibTex entry:
@misc{dayyass2021textclf,
author = {El-Ayyass, Dani},
title = {Pipeline for training text classification baselines},
howpublished = {\url{https://github.com/dayyass/text-classification-baseline}},
year = {2021}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file text-classification-baseline-0.1.4.tar.gz
.
File metadata
- Download URL: text-classification-baseline-0.1.4.tar.gz
- Upload date:
- Size: 8.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b0bd3f41728bfa4256047072382e83acc906783d8d6a34a2d0e20c789aa05b45 |
|
MD5 | e845f15c5061c621a8b8b9b4d4ce33a3 |
|
BLAKE2b-256 | 59e321a28d329429614a7126b3818c6badd7a382a21759de2fc6db02f2faa495 |
File details
Details for the file text_classification_baseline-0.1.4-py3-none-any.whl
.
File metadata
- Download URL: text_classification_baseline-0.1.4-py3-none-any.whl
- Upload date:
- Size: 12.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2d28cfb198da4880d5507db74061d4255e424c1ad75913587e3b34b758fad68f |
|
MD5 | 2342aac8a0c0cf34d81c7c1cfc2d77cf |
|
BLAKE2b-256 | 20eab1ffbe0b595ff7625ba597397cee0d5526a2533082edd33810f064cb9622 |