BOLT: Benchmarking Open-world Learning for Text classification
Project description
BOLT-Lab
BOLT-Lab is a self-contained Python package for benchmarking open-world learning (OWL) in text classification. It wraps 19 baseline methods (11 GCD + 8 Open-set) via subprocess calls and provides a unified grid experiment runner.
1. Installation
Requirements
- Linux + NVIDIA GPU
- Python 3.10
- NVIDIA driver installed (
nvidia-smiworks)
Steps (run in order)
- Install bolt-lab
pip install bolt_lab
- Install PyTorch (CUDA 12.6 uses cu126)
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu126
- Install NVCC (use conda only for this step)
conda install -c nvidia cuda-nvcc -y
- Install the remaining Python dependencies
pip install -r requirements.txt
- Install flash-attn (install separately to avoid build failures)
mkdir -p ~/tmp/pip
TMPDIR=~/tmp/pip pip install --no-build-isolation --no-cache-dir flash-attn==2.8.3
Quick self-check
python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
python -c "from bolt_lab.methods import list_methods; print(list_methods())"
bolt-grid --help
2. Environment Variables
| Variable | Description | Example |
|---|---|---|
BOLT_DATA_DIR |
Path to BOLT datasets | /path/to/bolt/data |
BOLT_PRETRAINED_MODELS |
Path to pretrained models directory | /path/to/pretrained_models |
BOLT_INTEGRATION |
Set to 1 to run integration tests |
1 |
Set them in your shell or pass via --model-dir:
export BOLT_DATA_DIR=/path/to/bolt/data
export BOLT_PRETRAINED_MODELS=/path/to/pretrained_models
3. Usage
Initialize workspace
bolt-grid --init-only --output-dir ./bolt_workspace --model-dir /path/to/pretrained_models
This creates the directory structure and copies editable configs to ./bolt_workspace/configs/.
Run experiments
bolt-grid --config grid_gcd.yaml --output-dir ./bolt_workspace --model-dir /path/to/pretrained_models
Arguments
| Argument | Description |
|---|---|
--config |
Grid config YAML. Bare names are resolved from output-dir/configs/, then package builtins. |
--output-dir |
Working directory for all outputs/results/logs. |
--model-dir |
Pretrained models directory (bert-base-uncased, etc.). |
--init-only |
Initialize workspace only, do not run experiments. |
--overwrite-configs |
Re-copy config files from package to output-dir. |
Typical workflow
# 1. Initialize and edit configs
bolt-grid --init-only --output-dir ./bolt_workspace --model-dir /path/to/pretrained_models
vim ./bolt_workspace/configs/grid_gcd.yaml
# 2. Run
bolt-grid --config grid_gcd.yaml --output-dir ./bolt_workspace --model-dir /path/to/pretrained_models
4. Grid Config Example
methods: [loop, glean, alup, geoid, sdc, dpn, deepaligned, tan]
datasets: [banking, clinc, stackoverflow]
result_file: summary_gcd
grid:
known_cls_ratio: [0.25, 0.5, 0.75]
labeled_ratio: [0.1, 0.5, 1.0]
seeds: [2025]
fold_types: [fold]
fold_idxs: [0,1,2,3,4]
fold_nums: [5]
cluster_num_factor: [1.0]
run:
gpus: [0,1,2,3]
max_workers: 4
num_pretrain_epochs: 100
num_train_epochs: 50
5. Output Structure
After running with --output-dir ./bolt_workspace:
bolt_workspace/
├── configs/ # Editable YAML configs (safe to modify)
├── outputs/ # Training artifacts (models, predictions)
├── results/ # Result CSVs + _index.json (dedup index)
├── logs/ # Experiment logs
├── data -> ... # Symlink to dataset directory
└── pretrained_models -> ... # Symlink to model directory
Deduplication
Completed experiments are tracked in results/<task>/<method>/results.csv. Re-running the same grid config will automatically skip finished experiments based on matching method, dataset, known_cls_ratio, labeled_ratio, seed, and fold parameters.
6. Methods
GCD (11 methods)
| Name | Description |
|---|---|
| loop | KNN + SupConLoss + MLM pretrain |
| glean | KNN + DistillLoss + LLM cluster characterization |
| alup | Active Learning with LLM labeling |
| geoid | GeoID clustering |
| sdc | Self-paced Deep Clustering |
| dpn | Deep Pairwise Network |
| deepaligned | DeepAligned Clustering |
| tan | TAN method |
| tlsa | TLSA method |
| plm_gcd | PLM-based GCD |
| llm4openssl | Llama-based GCD (SFTTrainer + LoRA) |
Open-set (8 methods)
| Name | Description |
|---|---|
| ab | Adaptive Boundary |
| adb | Adaptive Decision Boundary |
| doc | DOC method |
| deepunk | DeepUnk (TF/Keras) |
| scl | Supervised Contrastive Learning (TF/Keras) |
| dyen | Dynamic Ensemble |
| knncon | KNN-Contrastive |
| unllm | Llama-based open-set (SFTTrainer + LoRA) |
All methods are subprocess wrappers. Training source code is bundled in _builtin/_src/.
7. Notes
- Do not point
--output-dirto theoutputs/directory itself. Point it to an experiment root sooutputs/results/logs/are created as subdirectories. data/andpretrained_models/under output-dir are symlinks. Do not edit them directly.- If
flash-attninstallation fails: checktorch.cuda.is_available(), CUDA version match, and disk space.
8. Updating the Package
cd /path/to/bolt-lab
# Edit version in pyproject.toml if needed
pip install -e .
Since bolt-lab is installed in editable mode (-e), code changes take effect immediately. Only re-run pip install -e . after changing pyproject.toml.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bolt_lab-1.0.2.tar.gz.
File metadata
- Download URL: bolt_lab-1.0.2.tar.gz
- Upload date:
- Size: 26.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fc87c6b121a78516f8d2ce6452a48683738fc273ecca90d5f44afa0d44df1d0a
|
|
| MD5 |
86a43a1863d62b12ec6629d1dd6c8ac6
|
|
| BLAKE2b-256 |
8a854a97aeb2116f2306785045f969c9d2f5d69706fe03f7e09c5892e30f156e
|
File details
Details for the file bolt_lab-1.0.2-py3-none-any.whl.
File metadata
- Download URL: bolt_lab-1.0.2-py3-none-any.whl
- Upload date:
- Size: 27.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
02138b84fc60f2bce4204ad8e90f601155501ee2faa99e495a59e2e6bcde2392
|
|
| MD5 |
fb5d4d59a7fc05137ec52b65362b8189
|
|
| BLAKE2b-256 |
cf137d1d901ad0fec82834a90f200dc08b5ae59578b1bd5633254526adad13d6
|