Skip to main content

BOLT: Benchmarking Open-world Learning for Text classification

Project description

BOLT-Lab

BOLT-Lab is a self-contained Python package for benchmarking open-world learning (OWL) in text classification. It wraps 19 baseline methods (11 GCD + 8 Open-set) via subprocess calls and provides a unified grid experiment runner.


1. Installation

Requirements

  • Linux + NVIDIA GPU
  • Python 3.10
  • NVIDIA driver installed (nvidia-smi works)

Steps (run in order)

  1. Install bolt-lab
pip install bolt_lab
  1. Install PyTorch (CUDA 12.6 uses cu126)
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu126
  1. Install NVCC (use conda only for this step)
conda install -c nvidia cuda-nvcc -y
  1. Install the remaining Python dependencies
pip install -r requirements.txt
  1. Install flash-attn (install separately to avoid build failures)
mkdir -p ~/tmp/pip
TMPDIR=~/tmp/pip pip install --no-build-isolation --no-cache-dir flash-attn==2.8.3

Quick self-check

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
python -c "from bolt_lab.methods import list_methods; print(list_methods())"
bolt-grid --help

2. Environment Variables

Variable Description Example
BOLT_DATA_DIR Path to BOLT datasets /path/to/bolt/data
BOLT_PRETRAINED_MODELS Path to pretrained models directory /path/to/pretrained_models
BOLT_INTEGRATION Set to 1 to run integration tests 1

Set them in your shell or pass via --model-dir:

export BOLT_DATA_DIR=/path/to/bolt/data
export BOLT_PRETRAINED_MODELS=/path/to/pretrained_models

3. Usage

Initialize workspace

bolt-grid --init-only --output-dir ./bolt_workspace --model-dir /path/to/pretrained_models

This creates the directory structure and copies editable configs to ./bolt_workspace/configs/.

Run experiments

bolt-grid --config grid_gcd.yaml --output-dir ./bolt_workspace --model-dir /path/to/pretrained_models

Arguments

Argument Description
--config Grid config YAML. Bare names are resolved from output-dir/configs/, then package builtins.
--output-dir Working directory for all outputs/results/logs.
--model-dir Pretrained models directory (bert-base-uncased, etc.).
--init-only Initialize workspace only, do not run experiments.
--overwrite-configs Re-copy config files from package to output-dir.

Typical workflow

# 1. Initialize and edit configs
bolt-grid --init-only --output-dir ./bolt_workspace --model-dir /path/to/pretrained_models
vim ./bolt_workspace/configs/grid_gcd.yaml

# 2. Run
bolt-grid --config grid_gcd.yaml --output-dir ./bolt_workspace --model-dir /path/to/pretrained_models

4. Grid Config Example

methods: [loop, glean, alup, geoid, sdc, dpn, deepaligned, tan]
datasets: [banking, clinc, stackoverflow]
result_file: summary_gcd

grid:
  known_cls_ratio: [0.25, 0.5, 0.75]
  labeled_ratio: [0.1, 0.5, 1.0]
  seeds: [2025]
  fold_types: [fold]
  fold_idxs: [0,1,2,3,4]
  fold_nums: [5]
  cluster_num_factor: [1.0]

run:
  gpus: [0,1,2,3]
  max_workers: 4
  num_pretrain_epochs: 100
  num_train_epochs: 50

5. Output Structure

After running with --output-dir ./bolt_workspace:

bolt_workspace/
├── configs/          # Editable YAML configs (safe to modify)
├── outputs/          # Training artifacts (models, predictions)
├── results/          # Result CSVs + _index.json (dedup index)
├── logs/             # Experiment logs
├── data -> ...       # Symlink to dataset directory
└── pretrained_models -> ...  # Symlink to model directory

Deduplication

Completed experiments are tracked in results/<task>/<method>/results.csv. Re-running the same grid config will automatically skip finished experiments based on matching method, dataset, known_cls_ratio, labeled_ratio, seed, and fold parameters.


6. Methods

GCD (11 methods)

Name Description
loop KNN + SupConLoss + MLM pretrain
glean KNN + DistillLoss + LLM cluster characterization
alup Active Learning with LLM labeling
geoid GeoID clustering
sdc Self-paced Deep Clustering
dpn Deep Pairwise Network
deepaligned DeepAligned Clustering
tan TAN method
tlsa TLSA method
plm_gcd PLM-based GCD
llm4openssl Llama-based GCD (SFTTrainer + LoRA)

Open-set (8 methods)

Name Description
ab Adaptive Boundary
adb Adaptive Decision Boundary
doc DOC method
deepunk DeepUnk (TF/Keras)
scl Supervised Contrastive Learning (TF/Keras)
dyen Dynamic Ensemble
knncon KNN-Contrastive
unllm Llama-based open-set (SFTTrainer + LoRA)

All methods are subprocess wrappers. Training source code is bundled in _builtin/_src/.


7. Notes

  • Do not point --output-dir to the outputs/ directory itself. Point it to an experiment root so outputs/results/logs/ are created as subdirectories.
  • data/ and pretrained_models/ under output-dir are symlinks. Do not edit them directly.
  • If flash-attn installation fails: check torch.cuda.is_available(), CUDA version match, and disk space.

8. Updating the Package

cd /path/to/bolt-lab
# Edit version in pyproject.toml if needed
pip install -e .

Since bolt-lab is installed in editable mode (-e), code changes take effect immediately. Only re-run pip install -e . after changing pyproject.toml.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bolt_lab-1.0.1.tar.gz (26.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bolt_lab-1.0.1-py3-none-any.whl (27.4 MB view details)

Uploaded Python 3

File details

Details for the file bolt_lab-1.0.1.tar.gz.

File metadata

  • Download URL: bolt_lab-1.0.1.tar.gz
  • Upload date:
  • Size: 26.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for bolt_lab-1.0.1.tar.gz
Algorithm Hash digest
SHA256 7d4b4809ecd1a58f3d759eebe31df99cccfaf4f58821a54786ab13ebc35cc542
MD5 83cd7ff84e25697951a9f048bc514b1d
BLAKE2b-256 56f44d9392cf5a91ddbd3ec2ab3b92adc35b28944a8b65675918af4e10d257a4

See more details on using hashes here.

File details

Details for the file bolt_lab-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: bolt_lab-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 27.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for bolt_lab-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a954eb8d66718583983062d5a93afded7a0a9a8c65bc85e869cf26d2cb6255a0
MD5 7bbade82ff641e81fca7e2f59be286dc
BLAKE2b-256 cd66d838b4ec4015ef1795b996bab2354ef9842f8a74ac255c842bf59cd9ce8d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page