A text classification toolkit
Project description
Bunruija
Bunruija is a text classification toolkit. Bunruija aims at enabling pre-processing, training and evaluation of text classification models with minimum coding effort. Bunruija is mainly focusing on Japanese though it is also applicable to other languages.
See example
for understanding how bunruija is easy to use.
Features
- Minimum requirements of coding: bunruija enables users to train and evaluate their models through command lines. Because all experimental settings are stored in a yaml file, users do not have to write codes.
- Easy to compare neural-based model with non-neural-based model: because bunruija supports models based on scikit-learn and PyTorch in the same framework, users can easily compare classification accuracies and prediction times of neural- and non-neural-based models.
- Easy to reproduce the training of a model: because all hyperparameters of a model are stored in a yaml file, it is easy to reproduce the model.
Install
pip install bunruija
Example configs
Example of sklearn.svm.SVC
data:
train: train.csv
dev: dev.csv
test: test.csv
bin_dir: models/svm-model
pipeline:
- type: sklearn.feature_extraction.text.TfidfVectorizer
args:
tokenizer:
type: bunruija.tokenizers.mecab_tokenizer.MeCabTokenizer
args:
lemmatize: true
exclude_pos:
- 助詞
- 助動詞
max_features: 10000
min_df: 3
ngram_range:
- 1
- 3
- type: sklearn.svm.SVC
args:
verbose: false
C: 10.
Example of BERT
data:
train: train.csv
dev: dev.csv
test: test.csv
bin_dir: models/transformer-model
pipeline:
- type: bunruija.feature_extraction.sequence.SequenceVectorizer
args:
tokenizer:
type: transformers.AutoTokenizer
args:
pretrained_model_name_or_path: cl-tohoku/bert-base-japanese
- type: bunruija.classifiers.transformer.TransformerClassifier
args:
device: cpu
pretrained_model_name_or_path: cl-tohoku/bert-base-japanese
optimizer: adamw
lr: 3e-5
max_epochs: 3
weight_decay: 0.01
CLI
# Training a classifier
bunruija-train -y config.yaml
# Evaluating the trained classifier
bunruija-evaluate -y config.yaml
Prediction using the trained classifier in Python code
from bunruija import Predictor
predictor = Predictor(args.yaml)
while True:
text = input("Input:")
label: list[str] = predictor([text], return_label_type="str")
print(label[0])
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
bunruija-0.0.3.tar.gz
(18.3 kB
view details)
Built Distribution
File details
Details for the file bunruija-0.0.3.tar.gz
.
File metadata
- Download URL: bunruija-0.0.3.tar.gz
- Upload date:
- Size: 18.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.6 Darwin/23.2.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c32a8e9244ae46fc26e5266746f9421152c79a1e1011df7f7b071d78e530cf91 |
|
MD5 | 888b7d042d2b0c2e2a6e1c0d1cd2dcdd |
|
BLAKE2b-256 | 53ce03c948f0f99a22cb3106f43e44c83f258cb2784d814f33f9fbc4c9e98717 |
File details
Details for the file bunruija-0.0.3-cp310-cp310-macosx_14_0_x86_64.whl
.
File metadata
- Download URL: bunruija-0.0.3-cp310-cp310-macosx_14_0_x86_64.whl
- Upload date:
- Size: 167.6 kB
- Tags: CPython 3.10, macOS 14.0+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.6 Darwin/23.2.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f50b3823d5ece73eb63cebf351e2f3340859c19ae0d3877275efb8c1e15f11b3 |
|
MD5 | 8eb20f70a92745cf24b20279386d60c1 |
|
BLAKE2b-256 | 3b651a83fa8a4c55ebfab97bb5e164f96b342824af108f348c0eb00a3996efda |