Skip to main content

A text classification toolkit

Project description

Bunruija

Bunruija is a text classification toolkit. Bunruija aims at enabling pre-processing, training and evaluation of text classification models with minimum coding effort. Bunruija is mainly focusing on Japanese though it is also applicable to other languages.

See example for understanding how bunruija is easy to use.

Features

  • Minimum requirements of coding: bunruija enables users to train and evaluate their models through command lines. Because all experimental settings are stored in a yaml file, users do not have to write codes.
  • Easy to compare neural-based model with non-neural-based model: because bunruija supports models based on scikit-learn and PyTorch in the same framework, users can easily compare classification accuracies and prediction times of neural- and non-neural-based models.
  • Easy to reproduce the training of a model: because all hyperparameters of a model are stored in a yaml file, it is easy to reproduce the model.

Install

pip install bunruija

Example configs

Example of sklearn.svm.SVC

data:
  train: train.csv
  dev: dev.csv
  test: test.csv

bin_dir: models/svm-model

pipeline:
  - type: sklearn.feature_extraction.text.TfidfVectorizer
    args:
      tokenizer:
        type: bunruija.tokenizers.mecab_tokenizer.MeCabTokenizer
        args:
          lemmatize: true
          exclude_pos:
            - 助詞
            - 助動詞
      max_features: 10000
      min_df: 3
      ngram_range:
        - 1
        - 3
  - type: sklearn.svm.SVC
    args:
      verbose: false
      C: 10.

Example of BERT

data:
  train: train.csv
  dev: dev.csv
  test: test.csv

bin_dir: models/transformer-model

pipeline:
  - type: bunruija.feature_extraction.sequence.SequenceVectorizer
    args:
      tokenizer:
        type: transformers.AutoTokenizer
        args:
          pretrained_model_name_or_path: cl-tohoku/bert-base-japanese
  - type: bunruija.classifiers.transformer.TransformerClassifier
    args:
      device: cpu
      pretrained_model_name_or_path: cl-tohoku/bert-base-japanese
      optimizer: adamw
      lr: 3e-5
      max_epochs: 3
      weight_decay: 0.01

CLI

# Training a classifier
bunruija-train -y config.yaml

# Evaluating the trained classifier
bunruija-evaluate -y config.yaml

Prediction using the trained classifier in Python code

from bunruija import Predictor

predictor = Predictor(args.yaml)
while True:
    text = input("Input:")
    label: list[str] = predictor([text], return_label_type="str")
    print(label[0])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bunruija-0.0.3.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

bunruija-0.0.3-cp310-cp310-macosx_14_0_x86_64.whl (167.6 kB view details)

Uploaded CPython 3.10 macOS 14.0+ x86-64

File details

Details for the file bunruija-0.0.3.tar.gz.

File metadata

  • Download URL: bunruija-0.0.3.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.6 Darwin/23.2.0

File hashes

Hashes for bunruija-0.0.3.tar.gz
Algorithm Hash digest
SHA256 c32a8e9244ae46fc26e5266746f9421152c79a1e1011df7f7b071d78e530cf91
MD5 888b7d042d2b0c2e2a6e1c0d1cd2dcdd
BLAKE2b-256 53ce03c948f0f99a22cb3106f43e44c83f258cb2784d814f33f9fbc4c9e98717

See more details on using hashes here.

File details

Details for the file bunruija-0.0.3-cp310-cp310-macosx_14_0_x86_64.whl.

File metadata

File hashes

Hashes for bunruija-0.0.3-cp310-cp310-macosx_14_0_x86_64.whl
Algorithm Hash digest
SHA256 f50b3823d5ece73eb63cebf351e2f3340859c19ae0d3877275efb8c1e15f11b3
MD5 8eb20f70a92745cf24b20279386d60c1
BLAKE2b-256 3b651a83fa8a4c55ebfab97bb5e164f96b342824af108f348c0eb00a3996efda

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page