Skip to main content

A text classification toolkit

Project description

Bunruija

Bunruija is a text classification toolkit. Bunruija aims at enabling pre-processing, training and evaluation of text classification models with minimum coding effort. Bunruija is mainly focusing on Japanese though it is also applicable to other languages.

See example for understanding how bunruija is easy to use.

Features

  • Minimum requirements of coding: bunruija enables users to train and evaluate their models through command lines. Because all experimental settings are stored in a yaml file, users do not have to write codes.
  • Easy to compare neural-based model with non-neural-based model: because bunruija supports models based on scikit-learn and PyTorch in the same framework, users can easily compare classification accuracies and prediction times of neural- and non-neural-based models.
  • Easy to reproduce the training of a model: because all hyperparameters of a model are stored in a yaml file, it is easy to reproduce the model.

Install

pip install bunruija

Example configs

Example of sklearn.svm.SVC

data:
  train: train.csv
  dev: dev.csv
  test: test.csv

bin_dir: models/svm-model

pipeline:
  - type: sklearn.feature_extraction.text.TfidfVectorizer
    args:
      tokenizer:
        type: bunruija.tokenizers.mecab_tokenizer.MeCabTokenizer
        args:
          lemmatize: true
          exclude_pos:
            - 助詞
            - 助動詞
      max_features: 10000
      min_df: 3
      ngram_range:
        - 1
        - 3
  - type: sklearn.svm.SVC
    args:
      verbose: false
      C: 10.

Example of BERT

data:
  train: train.csv
  dev: dev.csv
  test: test.csv

bin_dir: models/transformer-model

pipeline:
  - type: bunruija.feature_extraction.sequence.SequenceVectorizer
    args:
      tokenizer:
        type: transformers.AutoTokenizer
        args:
          pretrained_model_name_or_path: cl-tohoku/bert-base-japanese
  - type: bunruija.classifiers.transformer.TransformerClassifier
    args:
      device: cpu
      pretrained_model_name_or_path: cl-tohoku/bert-base-japanese
      optimizer: adamw
      lr: 3e-5
      max_epochs: 3
      weight_decay: 0.01

CLI

# Training a classifier
bunruija-train -y config.yaml

# Evaluating the trained classifier
bunruija-evaluate -y config.yaml

Prediction using the trained classifier in Python code

from bunruija import Predictor

predictor = Predictor(args.yaml)
while True:
    text = input("Input:")
    label: list[str] = predictor([text], return_label_type="str")
    print(label[0])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bunruija-0.0.4.tar.gz (19.6 kB view details)

Uploaded Source

Built Distribution

bunruija-0.0.4-cp310-cp310-macosx_14_0_x86_64.whl (170.1 kB view details)

Uploaded CPython 3.10 macOS 14.0+ x86-64

File details

Details for the file bunruija-0.0.4.tar.gz.

File metadata

  • Download URL: bunruija-0.0.4.tar.gz
  • Upload date:
  • Size: 19.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.6 Darwin/23.2.0

File hashes

Hashes for bunruija-0.0.4.tar.gz
Algorithm Hash digest
SHA256 b8b0b87974ff9e58a7770074f328832f3308ea6696812f77e8eccb190ba7d985
MD5 2c98b2c46482e60d7ba70a4664df6162
BLAKE2b-256 84ed3a19c7182fb8a30d434bf7dad43da902b2c6aa8257fb217371de5821e2fc

See more details on using hashes here.

File details

Details for the file bunruija-0.0.4-cp310-cp310-macosx_14_0_x86_64.whl.

File metadata

File hashes

Hashes for bunruija-0.0.4-cp310-cp310-macosx_14_0_x86_64.whl
Algorithm Hash digest
SHA256 dfea77989edf86f3fdc367425780d7499dc32bddc79eb25c1a30f2f7e8491196
MD5 fa69953a53566c33d66dc83eca4ed6d0
BLAKE2b-256 1e61ba671bc795adc2c3812e8ef652111184edb90887304da530a43a52154d37

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page