Skip to main content

BOLT: Benchmarking Open-world Learning for Text classification

Project description

BOLT-Lab

BOLT-Lab is a self-contained Python package for benchmarking open-world learning (OWL) in text classification. It wraps 18 baseline methods (10 GCD + 8 Open-set) via subprocess calls and provides a unified grid experiment runner.


1. Installation

Requirements

  • Linux + NVIDIA GPU
  • Python 3.10
  • NVIDIA driver installed (nvidia-smi works)

Steps (run in order)

  1. Install bolt-lab
pip install bolt_lab
  1. Install PyTorch (CUDA 12.6 uses cu126)
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu126
  1. Install NVCC (use conda only for this step)
conda install -c nvidia cuda-nvcc -y
  1. Install the remaining Python dependencies
pip install -r requirements.txt
  1. Install flash-attn (install separately to avoid build failures)
mkdir -p ~/tmp/pip
TMPDIR=~/tmp/pip pip install --no-build-isolation --no-cache-dir flash-attn==2.8.3

Quick self-check

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
bolt-grid --help

Quick self-check

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
python -c "from bolt_lab.methods import list_methods; print(list_methods())"
bolt-grid --help

2. Environment Variables

Variable Description Example
BOLT_DATA_DIR Path to BOLT datasets /path/to/bolt/data
BOLT_PRETRAINED_MODELS Path to pretrained models directory /path/to/pretrained_models
BOLT_INTEGRATION Set to 1 to run integration tests 1

Set them in your shell or pass via --model-dir:

export BOLT_DATA_DIR=/path/to/bolt/data
export BOLT_PRETRAINED_MODELS=/path/to/pretrained_models

3. Usage

Initialize workspace

bolt-grid --init-only --output-dir ./bolt_workspace --model-dir /path/to/pretrained_models

This creates the directory structure and copies editable configs to ./bolt_workspace/configs/.

Run experiments

bolt-grid --config grid_gcd.yaml --output-dir ./bolt_workspace --model-dir /path/to/pretrained_models

Arguments

Argument Description
--config Grid config YAML. Bare names are resolved from output-dir/configs/, then package builtins.
--output-dir Working directory for all outputs/results/logs.
--model-dir Pretrained models directory (bert-base-uncased, etc.).
--init-only Initialize workspace only, do not run experiments.
--overwrite-configs Re-copy config files from package to output-dir.

Typical workflow

# 1. Initialize and edit configs
bolt-grid --init-only --output-dir ./bolt_workspace --model-dir /path/to/pretrained_models
vim ./bolt_workspace/configs/grid_gcd.yaml

# 2. Run
bolt-grid --config grid_gcd.yaml --output-dir ./bolt_workspace --model-dir /path/to/pretrained_models

4. Grid Config Example

methods: [loop, glean, alup, geoid, sdc, dpn, deepaligned, tan]
datasets: [banking, clinc, stackoverflow]
result_file: summary_gcd

grid:
  known_cls_ratio: [0.25, 0.5, 0.75]
  labeled_ratio: [0.1, 0.5, 1.0]
  seeds: [2025]
  fold_types: [fold]
  fold_idxs: [0,1,2,3,4]
  fold_nums: [5]
  cluster_num_factor: [1.0]

run:
  gpus: [0,1,2,3]
  max_workers: 4
  num_pretrain_epochs: 100
  num_train_epochs: 50

5. Output Structure

After running with --output-dir ./bolt_workspace:

bolt_workspace/
├── configs/          # Editable YAML configs (safe to modify)
├── outputs/          # Training artifacts (models, predictions)
├── results/          # Result CSVs + _index.json (dedup index)
├── logs/             # Experiment logs
├── data -> ...       # Symlink to dataset directory
└── pretrained_models -> ...  # Symlink to model directory

Deduplication

Completed experiments are tracked in results/_index.json. Re-running the same grid config will automatically skip finished experiments. To re-run a specific experiment, remove its entry from _index.json.


6. Methods

GCD (10 methods)

Name Module Description
loop _builtin/loop.py KNN + SupConLoss + MLM pretrain
glean _builtin/glean.py KNN + DistillLoss + LLM cluster characterization
alup _builtin/alup.py Active Learning with LLM labeling
geoid _builtin/geoid.py GeoID clustering
sdc _builtin/sdc.py Self-paced Deep Clustering
dpn _builtin/dpn.py Deep Pairwise Network
deepaligned _builtin/dal.py DeepAligned Clustering
tan _builtin/tan.py TAN method
tlsa _builtin/tlsa.py TLSA method
llm4openssl _builtin/llm4openssl.py Llama-based GCD (SFTTrainer + DeepSpeed)

Open-set (8 methods)

Name Module Description
ab _builtin/ab.py Adaptive Boundary
adb _builtin/adb.py Adaptive Decision Boundary
doc _builtin/doc.py DOC method
deepunk _builtin/deepunk.py DeepUnk (TF/Keras)
scl _builtin/scl.py Supervised Contrastive Learning (TF/Keras)
dyen _builtin/dyen.py Dynamic Ensemble
knncon _builtin/knncon.py KNN-Contrastive
unllm _builtin/unllm.py Llama-based open-set (SFTTrainer + DeepSpeed)

All methods are subprocess wrappers. Training source code is bundled in _builtin/_src/.


7. Notes

  • Do not point --output-dir to the outputs/ directory itself. Point it to an experiment root so outputs/results/logs/ are created as subdirectories.
  • data/ and pretrained_models/ under output-dir are symlinks. Do not edit them directly.
  • If flash-attn installation fails: check torch.cuda.is_available(), CUDA version match, and disk space.

8. Updating the Package

cd /path/to/bolt-lab
# Edit version in pyproject.toml if needed
pip install -e .

Since bolt-lab is installed in editable mode (-e), code changes take effect immediately. Only re-run pip install -e . after changing pyproject.toml.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bolt_lab-1.0.0.tar.gz (26.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bolt_lab-1.0.0-py3-none-any.whl (27.4 MB view details)

Uploaded Python 3

File details

Details for the file bolt_lab-1.0.0.tar.gz.

File metadata

  • Download URL: bolt_lab-1.0.0.tar.gz
  • Upload date:
  • Size: 26.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for bolt_lab-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e033edcf9b58ec90517a8da9f4b912d9c2b2494b994d3c58783964039fad1f9a
MD5 0a1d41ed778e218bccf59a09d33da558
BLAKE2b-256 1787a269bf765095ab5192ff9a832353a42589a8cfa7ae4cb715551947addd8e

See more details on using hashes here.

File details

Details for the file bolt_lab-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: bolt_lab-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 27.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for bolt_lab-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5cd2d1fa24d035b55bacb3e1ceddd53141e82cbd25f90c503c9cc69fe1f5bb4d
MD5 21e468508d86d569ec79625e895e480c
BLAKE2b-256 edf556a5562c3778f914d3ecf08cc0b3b0e13c38e87e2709dceb5c52f9a9b15b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page