Skip to main content

Wiola13M: A 12.9M parameter decoder-only language model featuring Gated Spiral Attention, Spiral RoPE, and Butterfly MLP.

Project description

Wiola

Gated Spiral Attention — a small language model built for the 10–100M parameter regime
Spiral RoPE · content-adaptive attention gating · Butterfly MLP

License Python PyTorch Transformers


Wiola is a decoder-only small language model whose novelty lives entirely in two sub-components of every layer. It is designed to run on a laptop, train on a single consumer GPU in hours, and publish to the Hugging Face Hub — yet be architecturally distinct enough to serve as a real experimental baseline.

Variant d L H d_inner Params
Nano 256 6 8 512 ~12.9M
Micro 384 8 12 768 ~40M
Small 512 12 16 1024 ~90M

What's novel

  1. Spiral Rotary Positional Encoding. Standard RoPE frequencies are perturbed by a sqrt-growing, per-dimension-pair factor so phase trajectories fan outward instead of staying collinear — improving long-range discrimination at zero added parameters. Setting spiral_alpha=0.0 recovers standard RoPE exactly.

  2. Gated Spiral Attention (GSA). A per-head, content-adaptive scalar gate, derived causally from a cumulative mean of the query projections, modulates attention scores before softmax. Heads that don't help self-suppress — implicit soft head pruning with no sparsity loss. The gate adds 2·H·d_h + H² params (a few hundred for Nano) and is fully KV-cache compatible.

  3. Butterfly MLP. A multiplicative feed-forward block, SiLU(a) ⊙ b, plus an intra-block bypass W_bypass·x. With d_inner = 2d it matches a GeLU 4× FFN in parameter count while providing SwiGLU-class gating and steadier gradients in shallow stacks.

See docs/ARCHITECTURE.md for the full math.

Install

# from source (recommended while pre-1.0)
git clone https://github.com/wiola-project/wiola.git
cd wiola
pip install -e .

# with training / hub extras
pip install -e ".[train,hub]"

From PyPI once published:

pip install wiola

Version note: the model uses the modern transformers Cache API. Pinned to transformers>=4.40,<4.46, the range this release is tested against.

Quickstart

import torch
from wiola13m import WiolaConfig, WiolaForCausalLM

model = WiolaForCausalLM(WiolaConfig())          # Wiola Nano, random init
ids = torch.randint(0, 32000, (1, 16))

out = model(input_ids=ids, labels=ids)           # forward + LM loss
out.loss.backward()                              # gradients flow

model.eval()
gen = model.generate(ids[:, :4], max_new_tokens=20, do_sample=False)

Or run the bundled smoke test:

python scripts/quickstart.py

Train on TinyStories

# 1) get a 32k tokenizer (reuse a LLaMA tokenizer, or train your own)
python examples/create_tokenizer.py reuse --source meta-llama/Llama-2-7b-hf --out ./wiola-tokenizer

# 2) pre-train Nano (~2h/epoch on an RTX 3090)
python examples/train_tinystories.py \
    --tokenizer ./wiola-tokenizer \
    --output-dir ./wiola-nano-tinystories \
    --max-steps 20000

# 3) generate
python examples/generate.py --model ./wiola-nano-tinystories --prompt "Once upon a time"

Publish to the Hugging Face Hub

Wiola ships with auto_map support, so anyone can load your model without installing this package:

huggingface-cli login
python examples/push_to_hub.py --model-dir ./wiola-nano-tinystories --repo-id your-name/wiola-nano
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("your-name/wiola-nano", trust_remote_code=True)
tok = AutoTokenizer.from_pretrained("your-name/wiola-nano")

If the wiola package is installed, the "wiola" architecture is auto-registered and you don't even need trust_remote_code=True.

Design decision: gate input

The design doc's figure feeds the gate from post-RoPE queries, while the prose describes it as content-adaptive. Wiola defaults to computing the gate from the pre-RoPE query projections (gate_pre_rope=True) — position-independent and numerically stable — and exposes gate_pre_rope=False to match the figure. Both are causally correct and KV-cache safe.

Tests

pip install -e ".[dev]"
pytest

The suite verifies output shapes, weight tying, strict causality (no future-token leakage), exact equivalence between cached step-by-step decoding and a single full-sequence forward (with and without the gate), save/reload round-trips, and greedy/sampling/batched/beam generation.

Project layout

wiola/
├── src/wiola/
│   ├── configuration_wiola.py   # WiolaConfig
│   ├── modeling_wiola.py        # Spiral RoPE, GSA, Butterfly MLP, decoder, CausalLM
│   └── __init__.py              # Auto* registration
├── examples/                    # train / generate / tokenizer / push_to_hub
├── scripts/quickstart.py
├── tests/
└── docs/ARCHITECTURE.md

Citation

If you use Wiola, please cite it (see CITATION.cff).

License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wiola13m-1.0.0.tar.gz (24.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wiola13m-1.0.0-py3-none-any.whl (20.0 kB view details)

Uploaded Python 3

File details

Details for the file wiola13m-1.0.0.tar.gz.

File metadata

  • Download URL: wiola13m-1.0.0.tar.gz
  • Upload date:
  • Size: 24.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for wiola13m-1.0.0.tar.gz
Algorithm Hash digest
SHA256 4c6e43a9d68da5044241d826bbb565208068f022c6d28779742e98b512f34da6
MD5 7e4d182bf7030f1a3a355d8351f413a5
BLAKE2b-256 9047b56ec999351c945448c25893544896abe278a282b14c786f52b6ef5a948b

See more details on using hashes here.

File details

Details for the file wiola13m-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: wiola13m-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 20.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for wiola13m-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b9e7677d2ad49377d0d42ea13c01214b84269b45c0065cbb7aebfdc35ecf9e25
MD5 4d5e436da6a1753a3fb80cf828780d59
BLAKE2b-256 b1ee478610a77707efb38bcf38d53bf52b8c28dad7e6f6c35136c15948e8ec53

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page