Load and run BCLM language models with PyTorch.
Project description
BCLM-PyTorch Documentation
bclm-pytorch is the official Python library for loading and running inference with BCLM language models.
Installation
pip install bclm-pytorch
Dependencies installed automatically: torch, safetensors, tokenmonster, huggingface_hub.
Quick Start
import bclm
# Loads a model from HuggingFace
model = bclm.load("bclm-1-small-preview")
# Multi-turn chat
chat = model.chat(system_prompt="You are a helpful assistant.")
print(chat.send("What is 2+2?"))
print(chat.send("Why is that the case?"))
# Streaming chat (real-time token output)
for chunk in chat.send_stream("Tell me a short story about a cat."):
print(chunk, end="", flush=True)
print()
Loading Models
bclm.load() is the single entry point for all model sources.
From Hugging Face
# Short form — resolves to huggingface.co/bclm/bclm-1-small-preview
model = bclm.load("bclm-1-small-preview")
# Explicit repo ID
model = bclm.load("bclm/bclm-1-small-preview")
# Any HF repo
model = bclm.load("your-org/your-model")
From a Local Directory
Point to a directory containing config.json and model.safetensors:
model = bclm.load("/path/to/model/directory")
From a URL
Point to a directory served over HTTPS that contains config.json and model.safetensors:
model = bclm.load("https://example.com/models/bclm-1-small/")
Options
model = bclm.load(
"bclm-1-small-preview",
device="cuda", # "cpu", "cuda", "cuda:0", etc. (default: auto)
dtype=torch.bfloat16, # torch.float16, torch.float32 (default: bf16 on GPU, f32 on CPU)
compile=False, # Enable torch.compile for inference (default: False)
)
| Parameter | Default | Description |
|---|---|---|
device |
auto | "cpu", "cuda", or a specific device string. Auto-selects CUDA if available. |
dtype |
auto | torch.bfloat16 on CUDA, torch.float32 on CPU. Weights ship as float16 but were trained in bfloat16. |
compile |
False |
Wrap the model with torch.compile. Off by default to avoid warmup latency. |
Chat Interface
The chat interface supports multi-turn conversations with automatic history management.
Basic Usage
chat = model.chat(
system_prompt="You are a helpful assistant.", # optional
max_new_tokens=512,
temperature=1.0,
top_k=50,
top_p=0.9,
repetition_penalty=1.15,
frequency_penalty=0.1,
)
# Each call appends to the conversation history
response = chat.send("Hello, who are you?")
print(response)
follow_up = chat.send("Can you elaborate?")
print(follow_up)
Streaming
for chunk in chat.send_stream("Write a poem about the ocean."):
print(chunk, end="", flush=True)
print()
Per-Message Overrides
Override generation parameters for a single message without changing the session defaults:
response = chat.send(
"Give me a one-word answer: is the sky blue?",
max_new_tokens=10,
temperature=0.1,
repetition_penalty=1.0, # disable for this message
frequency_penalty=0.0,
)
History Management
# View conversation history
for msg in chat.messages:
print(f"{msg['role']}: {msg['content'][:80]}")
# Clear history (keeps system prompt)
chat.clear()
Interactive CLI
Launch a blocking interactive chat loop in the terminal:
chat.interactive()
Commands inside the interactive loop:
/clear— reset conversation/history— print message list/help— show commandsCtrl-C— exit
Text Completion
For non-chat use cases (continuing a text prompt):
text = model.complete("Once upon a time", max_tokens=200, temperature=0.8)
print(text)
Streaming Completion
for chunk in model.complete_stream("The quick brown fox", max_tokens=100):
print(chunk, end="", flush=True)
print()
Generation Parameters
All generation methods accept these parameters:
| Parameter | Default | Description |
|---|---|---|
max_new_tokens |
512 | Maximum number of tokens to generate. |
temperature |
1.0 | Sampling temperature. 0 = greedy, higher = more random. |
top_k |
50 | Restrict sampling to the top-k most likely tokens. None to disable. |
top_p |
0.9 | Nucleus sampling threshold. None to disable. |
repetition_penalty |
1.15 | Multiplicative penalty applied to every token already present in the context. Positive logits are divided by the penalty; negative logits are multiplied — both directions make the token less likely. 1.0 disables. Defaults are tuned for small language models, which are more prone to degenerate repetition. |
frequency_penalty |
0.1 | Additive penalty proportional to how many times each token has appeared. Logits are reduced by frequency_penalty × count, discouraging high-frequency tokens more strongly than low-frequency ones. 0.0 disables. |
Model Information
model = bclm.load("bclm-1-small-preview")
# Architecture name (e.g., "BCLM1Model")
print(model.architecture)
# Model config object
print(model.config)
# Parameter count
print(f"{model.num_parameters / 1e6:.1f}M parameters")
# Device and dtype
print(model.device, model.dtype)
Advanced: Direct Access
For advanced use cases, you can access the underlying PyTorch modules:
model = bclm.load("bclm-1-small-preview")
# Raw nn.Module (e.g., BCLM1Model)
raw = model.raw_model
# Inference wrapper (e.g., BCLM1ForGeneration)
gen = model.generator
# Tokenizer
tok = model.tokenizer
Tokenizer
The current tokenizer backend is TokenMonster. The tokenizer spec is embedded in each model's config.json (e.g., "tokenmonster:english-32000-consistent-v1"), so the correct tokenizer is loaded automatically.
tok = model.tokenizer
ids = tok.encode("Hello world")
text = tok.decode(ids)
Environment Variables
| Variable | Description |
|---|---|
BCLM_TOKENIZER |
Override tokenizer spec (e.g., tokenmonster:english-32000-consistent-v1). |
BCLM_TOKENMONSTER_DIR |
Custom cache directory for TokenMonster vocab files. |
Config Format
Model directories must contain a config.json:
{
"architecture": "BCLM1Model",
"model": {
"vocab_size": 32768,
"tokenizer": "tokenmonster:english-32000-consistent-v1",
"embed_dim": 384,
"n_layers": 12,
"max_seq_len": 16384,
"dropout": 0.0,
"attn_heads": 6,
"attn_kv_heads": 2,
"local_attn_layers": [1, 5, 7, 11],
"global_attn_layers": [3, 9],
"attn_window_size": 1024,
"conv_kernel_size": 4,
"osc_n_pairs": 1,
"osc_n_real": 16,
"osc_clamp_min_decay": 1e-05,
"bigram_table_factor": 5
}
}
The "architecture" field determines which model class is instantiated. Weights should be in model.safetensors (safetensors format).
Error Handling
import bclm
try:
model = bclm.load("nonexistent-model")
except FileNotFoundError:
print("Model not found")
except ValueError as e:
print(f"Invalid model: {e}")
except ImportError as e:
print(f"Missing dependency: {e}")
Requirements
- Python ≥ 3.9
- PyTorch ≥ 2.1
safetensorstokenmonsterhuggingface_hub
Optional:
xformers— enables optimized attention kernels on CUDA
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bclm_pytorch-1.0.5.tar.gz.
File metadata
- Download URL: bclm_pytorch-1.0.5.tar.gz
- Upload date:
- Size: 37.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d91054758ee395ce63023a8ed3ee67b341a036309182b22f89deb9c613426cf
|
|
| MD5 |
42ac135ce280dfb47cd426997eb4b56e
|
|
| BLAKE2b-256 |
eaac25fdb3383410ce4698ad26554b96b092ccc76f1e843fee0fa9a2a0d3015b
|
File details
Details for the file bclm_pytorch-1.0.5-py3-none-any.whl.
File metadata
- Download URL: bclm_pytorch-1.0.5-py3-none-any.whl
- Upload date:
- Size: 35.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bca1cab714c9e436c47ee2d853b40b00f2e493b3477d9f8e2acd9920b3f18c84
|
|
| MD5 |
540cbe4bdf03ba67d90008869108d342
|
|
| BLAKE2b-256 |
2e910a6e7aa0d1b88d2a38e4dcfc09502641d0914ebeec232c623bd315fb89f3
|