Skip to main content

BOLT: Benchmarking Open-world Learning for Text classification

Project description

BOLT-Lab

BOLT-Lab is a self-contained Python package for benchmarking open-world learning (OWL) in text classification. It wraps 19 baseline methods (11 GCD + 8 Open-set) via subprocess calls and provides a unified grid experiment runner.


1. Installation

Requirements

  • Linux + NVIDIA GPU
  • Python 3.10
  • NVIDIA driver installed (nvidia-smi works)

Steps (run in order)

  1. Install bolt-lab
pip install bolt_lab
  1. Install PyTorch (CUDA 12.6 uses cu126)
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu126
  1. Install NVCC (use conda only for this step)
conda install -c nvidia cuda-nvcc -y
  1. Install the remaining Python dependencies
pip install -r requirements.txt
  1. Install flash-attn (install separately to avoid build failures)
mkdir -p ~/tmp/pip
TMPDIR=~/tmp/pip pip install --no-build-isolation --no-cache-dir flash-attn==2.8.3

Quick self-check

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
python -c "from bolt_lab.methods import list_methods; print(list_methods())"
bolt-grid --help

2. Environment Variables

Variable Description Example
BOLT_DATA_DIR Path to BOLT datasets /path/to/bolt/data
BOLT_PRETRAINED_MODELS Path to pretrained models directory /path/to/pretrained_models
BOLT_INTEGRATION Set to 1 to run integration tests 1

Set them in your shell or pass via --model-dir:

export BOLT_DATA_DIR=/path/to/bolt/data
export BOLT_PRETRAINED_MODELS=/path/to/pretrained_models

3. Usage

Initialize workspace

bolt-grid --init-only --output-dir ./bolt_workspace --model-dir /path/to/pretrained_models

This creates the directory structure and copies editable configs to ./bolt_workspace/configs/.

Run experiments

bolt-grid --config grid_gcd.yaml --output-dir ./bolt_workspace --model-dir /path/to/pretrained_models

Arguments

Argument Description
--config Grid config YAML. Bare names are resolved from output-dir/configs/, then package builtins.
--output-dir Working directory for all outputs/results/logs.
--model-dir Pretrained models directory (bert-base-uncased, etc.).
--init-only Initialize workspace only, do not run experiments.
--overwrite-configs Re-copy config files from package to output-dir.

Typical workflow

# 1. Initialize and edit configs
bolt-grid --init-only --output-dir ./bolt_workspace --model-dir /path/to/pretrained_models
vim ./bolt_workspace/configs/grid_gcd.yaml

# 2. Run
bolt-grid --config grid_gcd.yaml --output-dir ./bolt_workspace --model-dir /path/to/pretrained_models

4. Grid Config Example

methods: [loop, glean, alup, geoid, sdc, dpn, deepaligned, tan]
datasets: [banking, clinc, stackoverflow]
result_file: summary_gcd

grid:
  known_cls_ratio: [0.25, 0.5, 0.75]
  labeled_ratio: [0.1, 0.5, 1.0]
  seeds: [2025]
  fold_types: [fold]
  fold_idxs: [0,1,2,3,4]
  fold_nums: [5]
  cluster_num_factor: [1.0]

run:
  gpus: [0,1,2,3]
  max_workers: 4
  num_pretrain_epochs: 100
  num_train_epochs: 50

5. Output Structure

After running with --output-dir ./bolt_workspace:

bolt_workspace/
├── configs/          # Editable YAML configs (safe to modify)
├── outputs/          # Training artifacts (models, predictions)
├── results/          # Result CSVs + _index.json (dedup index)
├── logs/             # Experiment logs
├── data -> ...       # Symlink to dataset directory
└── pretrained_models -> ...  # Symlink to model directory

Deduplication

Completed experiments are tracked in results/<task>/<method>/results.csv. Re-running the same grid config will automatically skip finished experiments based on matching method, dataset, known_cls_ratio, labeled_ratio, seed, and fold parameters.


6. Methods

GCD (11 methods)

Name Description
loop KNN + SupConLoss + MLM pretrain
glean KNN + DistillLoss + LLM cluster characterization
alup Active Learning with LLM labeling
geoid GeoID clustering
sdc Self-paced Deep Clustering
dpn Deep Pairwise Network
deepaligned DeepAligned Clustering
tan TAN method
tlsa TLSA method
plm_gcd PLM-based GCD
llm4openssl Llama-based GCD (SFTTrainer + LoRA)

Open-set (8 methods)

Name Description
ab Adaptive Boundary
adb Adaptive Decision Boundary
doc DOC method
deepunk DeepUnk (TF/Keras)
scl Supervised Contrastive Learning (TF/Keras)
dyen Dynamic Ensemble
knncon KNN-Contrastive
unllm Llama-based open-set (SFTTrainer + LoRA)

All methods are subprocess wrappers. Training source code is bundled in _builtin/_src/.


7. Notes

  • Do not point --output-dir to the outputs/ directory itself. Point it to an experiment root so outputs/results/logs/ are created as subdirectories.
  • data/ and pretrained_models/ under output-dir are symlinks. Do not edit them directly.
  • If flash-attn installation fails: check torch.cuda.is_available(), CUDA version match, and disk space.

8. Updating the Package

cd /path/to/bolt-lab
# Edit version in pyproject.toml if needed
pip install -e .

Since bolt-lab is installed in editable mode (-e), code changes take effect immediately. Only re-run pip install -e . after changing pyproject.toml.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bolt_lab-1.0.2.tar.gz (26.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bolt_lab-1.0.2-py3-none-any.whl (27.5 MB view details)

Uploaded Python 3

File details

Details for the file bolt_lab-1.0.2.tar.gz.

File metadata

  • Download URL: bolt_lab-1.0.2.tar.gz
  • Upload date:
  • Size: 26.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for bolt_lab-1.0.2.tar.gz
Algorithm Hash digest
SHA256 fc87c6b121a78516f8d2ce6452a48683738fc273ecca90d5f44afa0d44df1d0a
MD5 86a43a1863d62b12ec6629d1dd6c8ac6
BLAKE2b-256 8a854a97aeb2116f2306785045f969c9d2f5d69706fe03f7e09c5892e30f156e

See more details on using hashes here.

File details

Details for the file bolt_lab-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: bolt_lab-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 27.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for bolt_lab-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 02138b84fc60f2bce4204ad8e90f601155501ee2faa99e495a59e2e6bcde2392
MD5 fb5d4d59a7fc05137ec52b65362b8189
BLAKE2b-256 cf137d1d901ad0fec82834a90f200dc08b5ae59578b1bd5633254526adad13d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page