Wiola13M: A 12.9M parameter decoder-only language model featuring Gated Spiral Attention, Spiral RoPE, and Butterfly MLP.
Project description
Wiola
Gated Spiral Attention — a small language model built for the 10–100M parameter regime
Spiral RoPE · content-adaptive attention gating · Butterfly MLP
Wiola is a decoder-only small language model whose novelty lives entirely in two sub-components of every layer. It is designed to run on a laptop, train on a single consumer GPU in hours, and publish to the Hugging Face Hub — yet be architecturally distinct enough to serve as a real experimental baseline.
| Variant | d |
L |
H |
d_inner |
Params |
|---|---|---|---|---|---|
| Nano | 256 | 6 | 8 | 512 | ~12.9M |
| Micro | 384 | 8 | 12 | 768 | ~40M |
| Small | 512 | 12 | 16 | 1024 | ~90M |
What's novel
-
Spiral Rotary Positional Encoding. Standard RoPE frequencies are perturbed by a
sqrt-growing, per-dimension-pair factor so phase trajectories fan outward instead of staying collinear — improving long-range discrimination at zero added parameters. Settingspiral_alpha=0.0recovers standard RoPE exactly. -
Gated Spiral Attention (GSA). A per-head, content-adaptive scalar gate, derived causally from a cumulative mean of the query projections, modulates attention scores before softmax. Heads that don't help self-suppress — implicit soft head pruning with no sparsity loss. The gate adds
2·H·d_h + H²params (a few hundred for Nano) and is fully KV-cache compatible. -
Butterfly MLP. A multiplicative feed-forward block,
SiLU(a) ⊙ b, plus an intra-block bypassW_bypass·x. Withd_inner = 2dit matches a GeLU 4× FFN in parameter count while providing SwiGLU-class gating and steadier gradients in shallow stacks.
See docs/ARCHITECTURE.md for the full math.
Install
# from source (recommended while pre-1.0)
git clone https://github.com/wiola-project/wiola.git
cd wiola
pip install -e .
# with training / hub extras
pip install -e ".[train,hub]"
From PyPI once published:
pip install wiola
Version note: the model uses the modern
transformersCacheAPI. Pinned totransformers>=4.40,<4.46, the range this release is tested against.
Quickstart
import torch
from wiola13m import WiolaConfig, WiolaForCausalLM
model = WiolaForCausalLM(WiolaConfig()) # Wiola Nano, random init
ids = torch.randint(0, 32000, (1, 16))
out = model(input_ids=ids, labels=ids) # forward + LM loss
out.loss.backward() # gradients flow
model.eval()
gen = model.generate(ids[:, :4], max_new_tokens=20, do_sample=False)
Or run the bundled smoke test:
python scripts/quickstart.py
Train on TinyStories
# 1) get a 32k tokenizer (reuse a LLaMA tokenizer, or train your own)
python examples/create_tokenizer.py reuse --source meta-llama/Llama-2-7b-hf --out ./wiola-tokenizer
# 2) pre-train Nano (~2h/epoch on an RTX 3090)
python examples/train_tinystories.py \
--tokenizer ./wiola-tokenizer \
--output-dir ./wiola-nano-tinystories \
--max-steps 20000
# 3) generate
python examples/generate.py --model ./wiola-nano-tinystories --prompt "Once upon a time"
Publish to the Hugging Face Hub
Wiola ships with auto_map support, so anyone can load your model without
installing this package:
huggingface-cli login
python examples/push_to_hub.py --model-dir ./wiola-nano-tinystories --repo-id your-name/wiola-nano
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("your-name/wiola-nano", trust_remote_code=True)
tok = AutoTokenizer.from_pretrained("your-name/wiola-nano")
If the wiola package is installed, the "wiola" architecture is auto-registered
and you don't even need trust_remote_code=True.
Design decision: gate input
The design doc's figure feeds the gate from post-RoPE queries, while the prose
describes it as content-adaptive. Wiola defaults to computing the gate from the
pre-RoPE query projections (gate_pre_rope=True) — position-independent and
numerically stable — and exposes gate_pre_rope=False to match the figure. Both are
causally correct and KV-cache safe.
Tests
pip install -e ".[dev]"
pytest
The suite verifies output shapes, weight tying, strict causality (no future-token leakage), exact equivalence between cached step-by-step decoding and a single full-sequence forward (with and without the gate), save/reload round-trips, and greedy/sampling/batched/beam generation.
Project layout
wiola/
├── src/wiola/
│ ├── configuration_wiola.py # WiolaConfig
│ ├── modeling_wiola.py # Spiral RoPE, GSA, Butterfly MLP, decoder, CausalLM
│ └── __init__.py # Auto* registration
├── examples/ # train / generate / tokenizer / push_to_hub
├── scripts/quickstart.py
├── tests/
└── docs/ARCHITECTURE.md
Citation
If you use Wiola, please cite it (see CITATION.cff).
License
Apache 2.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wiola13m-1.0.0.tar.gz.
File metadata
- Download URL: wiola13m-1.0.0.tar.gz
- Upload date:
- Size: 24.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c6e43a9d68da5044241d826bbb565208068f022c6d28779742e98b512f34da6
|
|
| MD5 |
7e4d182bf7030f1a3a355d8351f413a5
|
|
| BLAKE2b-256 |
9047b56ec999351c945448c25893544896abe278a282b14c786f52b6ef5a948b
|
File details
Details for the file wiola13m-1.0.0-py3-none-any.whl.
File metadata
- Download URL: wiola13m-1.0.0-py3-none-any.whl
- Upload date:
- Size: 20.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9e7677d2ad49377d0d42ea13c01214b84269b45c0065cbb7aebfdc35ecf9e25
|
|
| MD5 |
4d5e436da6a1753a3fb80cf828780d59
|
|
| BLAKE2b-256 |
b1ee478610a77707efb38bcf38d53bf52b8c28dad7e6f6c35136c15948e8ec53
|