Genesis architecture (PyTorch) and utilities for inference/benchmarking.
Project description
license: apache-2.0 language:
- en pipeline_tag: text-generation tags:
- pytorch
- safetensors
- text-generation
- small-llm
- custom-architecture
- linear-attention
- gated-deltanet
- test-time-training
- hybrid-attention
- research library_name: genesis-llm datasets:
- HuggingFaceTB/smol-smoltalk base_model: []
Table of Contents
- Overview
- Model Summary
- Architecture Deep Dive
- Comparison with Other Architectures
- Training Details
- Usage
- Benchmarks
- Limitations
- Citation
- License
Overview
Genesis-152M-Instruct is an experimental small language model that combines recent advances in efficient attention mechanisms into a single architecture. It serves as a research platform for exploring:
- Hybrid attention: Mixing O(n) linear attention with O(n²) softmax attention
- Efficient inference: Sub-quadratic complexity for most layers
- Adaptive computation: Test-time training for dynamic model adaptation
⚠️ Experimental Model: This is a research artifact, not a production-ready model. It demonstrates architectural innovations but has limitations typical of small models.
Model Summary
| Property | Value |
|---|---|
| Parameters | 151.8M total (~122.8M non-embedding) |
| Architecture | Hybrid GLA + FoX Attention |
| Context Length | 2,048 tokens |
| Vocab Size | 50,279 (GPT-NeoX + ChatML tokens) |
| Pre-training Data | 2B tokens |
| SFT Dataset | smol-smoltalk |
| License | Apache 2.0 |
Files in this Repository
├── genesis_152m_instruct.safetensors # Model weights
├── README.md # This model card
└── LICENSE # Apache 2.0
Architecture Deep Dive
Genesis follows a "deep-and-thin" design philosophy inspired by SmolLM2 and MobileLLM, which has proven effective for small language models.
Core Configuration
| Component | Value | Rationale |
|---|---|---|
| Layers | 30 | Deep architecture for better representation |
| Hidden Size | 576 | Optimal width for 150M scale |
| Attention Heads | 9 | Query heads |
| KV Heads | 3 | 3:1 GQA ratio for memory efficiency |
| Head Dimension | 64 | Standard for efficient attention |
| FFN Size | 1,440 | 2.5× expansion (SwiGLU-efficient) |
| Weight Tying | ✓ | Embeddings tied with LM head |
Hybrid Attention Layout
Genesis employs a hybrid attention layout inspired by Qwen3-Next, alternating between linear and full attention:
Layer Distribution (30 layers):
├── 23 layers: GLA (Gated DeltaNet) - O(n) linear attention
└── 7 layers: FoX (Forgetting Attention) - O(n²) softmax with forget gate
Ratio: 75% Linear / 25% Full Attention
Why hybrid? Pure linear attention struggles with precise retrieval tasks (e.g., copying, in-context learning). Interleaving full attention layers restores this capability while maintaining overall efficiency.
📖 Reference: The hybrid approach is validated by Qwen3-Next (2025) and research showing that 3:1 to 6:1 linear-to-full ratios optimize the efficiency-quality tradeoff.
Gated DeltaNet (GLA)
The primary attention mechanism (75% of layers) is Gated DeltaNet, a state-of-the-art O(n) linear attention mechanism from NVIDIA.
Key Features
| Feature | Description | Paper Reference |
|---|---|---|
| Delta Rule | Online learning rule for recurrent state updates | Schlag et al., 2021 |
| Gated Forget | Mamba-style data-dependent forgetting | Gu & Dao, 2023 |
| Short Convolution | 1D conv on Q, K, V for local context | Gu et al., 2022 |
| L2 QK-Norm | Stabilizes attention scores | Standard practice |
Mathematical Formulation
The delta rule update enables the model to selectively write to and erase from a recurrent state:
S_t = α_t * S_{t-1} + β_t * (v_t ⊗ k_t - S_{t-1} @ k_t ⊗ k_t)
o_t = S_t @ q_t
Where:
S_t: Recurrent state matrixα_t: Forget gate (data-dependent)β_t: Learning rate gate (per-token)
📖 Paper: Gated Delta Networks: Improving Mamba2 with Delta Rule (ICLR 2025)
📦 Code: NVlabs/GatedDeltaNet
Configuration in Genesis
gla_expand_k: 0.75 # Key expansion ratio
gla_expand_v: 1.5 # Value expansion ratio (asymmetric)
gla_gate_fn: "swish" # Gating activation
gla_use_short_conv: True
gla_conv_size: 4
gla_chunk_size: 64 # For chunked parallel training
gla_use_delta_rule: True
gla_qk_norm: "l2"
gla_use_mamba_gate: True
Forgetting Attention (FoX)
The full attention layers (25%) use FoX (Forgetting Transformer), which augments standard softmax attention with a learnable forget gate.
Why FoX over Standard Attention?
| Aspect | Standard Attention | FoX |
|---|---|---|
| Position Encoding | Requires RoPE/ALiBi | NoPE (implicit via forget gate) |
| Long-range Decay | Uniform attention | Data-dependent decay |
| Length Extrapolation | Poor | Better generalization |
Mechanism
FoX modifies attention scores with cumulative forget gates:
attn[i,j] = softmax(q_i @ k_j / √d + Σ_{k=j}^{i} log(f_k))
Where f_k = sigmoid(W_f @ x_k) is a learned forget gate that naturally down-weights distant tokens.
📖 Paper: Forgetting Transformer: Softmax Attention with a Forget Gate (ICLR 2025)
FoX "Pro" Design
Genesis uses the enhanced "Pro" block design:
| Component | Purpose |
|---|---|
| Output Gate | Controls information flow (like GLA) |
| QK-Norm | Training stability |
| Short Convolution | Local context on K, V |
| FusedRMSNormSwishGate | Efficient fused operations |
Test-Time Training (TTT)
Genesis includes an experimental TTT metacognition layer that adapts the model during inference.
Concept
Traditional models have fixed weights at inference. TTT layers have a small set of fast weights that update based on the input sequence, allowing the model to "learn" from context.
Standard: y = f(x; θ_fixed)
TTT: y = f(x; θ_fixed, θ_fast(x))
Implementation Details
| Parameter | Value | Description |
|---|---|---|
ttt_rank |
4 | Low-rank adaptation dimension |
ttt_inner_lr |
0.01 | Learning rate for fast weights |
ttt_mode |
"dual" | Parallel dual-form computation |
ttt_chunk_size |
64 | Chunking for efficiency |
The "dual form" enables fully parallel gradient computation:
# Instead of sequential updates:
# W_1 = W_0 - lr * grad_0
# W_2 = W_1 - lr * grad_1
# ...
# Dual form computes all at once:
# W_t = W_0 - lr * Σ_{i<t} grad_i (via cumsum)
📖 Paper: Learning to (Learn at Test Time): RNNs with Expressive Hidden States (ICML 2024)
When TTT Activates
TTT is designed for inference-time adaptation and runs only during model.eval(). During training, it's disabled to avoid overhead.
Selective Activation
The FFN layers use SwiGLU with optional top-k sparsity masking.
SwiGLU FFN
FFN(x) = (Swish(W_gate @ x) ⊙ (W_up @ x)) @ W_down
📖 Paper: GLU Variants Improve Transformer (Shazeer, 2020)
Selective Activation (Experimental)
| Parameter | Value |
|---|---|
selective_k_ratio |
0.85 (keeps top 85%) |
selective_use_soft_mask |
True |
Important: This is a regularization technique, not a speedup mechanism. Real sparse acceleration requires specialized kernels (e.g., Triton sparse GEMM).
📖 Related: ReLU Strikes Back (Apple, ICLR 2024) shows natural activation sparsity can be exploited for inference.
Additional Components
Grouped Query Attention (GQA)
Genesis uses 3:1 GQA (9 query heads, 3 KV heads) for memory efficiency during inference.
📖 Paper: GQA: Training Generalized Multi-Query Transformer Models (Google, 2023)
Rotary Position Embeddings (RoPE)
Partial RoPE (50% rotation) is applied in GLA layers for position awareness.
📖 Paper: RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)
µP (Maximal Update Parametrization)
Hyperparameters were tuned using µP for potential scaling transfer.
📖 Paper: Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (Yang et al., 2022)
📖 Guide: The Practitioner's Guide to µP (Cerebras)
Zero-Centered RMSNorm
Used throughout for better weight decay compatibility with µP.
Comparison with Other Architectures
vs. SmolLM2-135M (HuggingFace)
| Aspect | Genesis-152M | SmolLM2-135M |
|---|---|---|
| Attention | Hybrid GLA + FoX | Standard Multi-Head |
| Complexity | O(n) for 75% layers | O(n²) all layers |
| Position Encoding | RoPE (GLA) / NoPE (FoX) | RoPE |
| TTT | ✓ Experimental | ✗ |
| Pre-training | 2B tokens | 2T tokens |
| Architecture | 30L × 576 | 30L × 576 |
SmolLM2 uses 1000× more training tokens, making direct benchmark comparison unfair. Genesis demonstrates architectural innovations, not data scaling.
vs. Qwen3-Next
| Aspect | Genesis-152M | Qwen3-Next-80B-A3B |
|---|---|---|
| Scale | 152M | 80B (3B active) |
| Linear Attention | GLA (same) | GLA |
| Full Attention | FoX | Standard |
| Hybrid Ratio | 75/25 | Similar |
| MoE | ✗ | ✓ |
Genesis can be seen as a miniature research version of the hybrid attention approach that Qwen3-Next uses at scale.
vs. Mamba / Mamba-2
| Aspect | Genesis-152M | Mamba-2 |
|---|---|---|
| Architecture | Hybrid (Linear + Softmax) | Pure SSM |
| Retrieval | Strong (FoX layers) | Limited |
| Implementation | PyTorch + Optional Triton | Requires CUDA |
| Flexibility | Modular | Monolithic |
Training Details
Pre-training
| Parameter | Value |
|---|---|
| Tokens | 2 billion |
| Dataset Mix | FineWeb-Edu (51%), DCLM (22%), FineMath (12%), Stack-Edu (8%), Cosmopedia (5%), Synth (2%) |
| Context Length | 2,048 |
| Batch Size | 128 |
| Learning Rate | 1e-3 (WSD schedule) |
| Optimizer | AdamW (β₁=0.9, β₂=0.95) |
| Weight Decay | 0.1 |
| Warmup | 5% of steps |
| Hardware | Single A100 80GB |
Learning Rate Schedule
WSD (Warmup-Stable-Decay):
- Warmup: 5% of training (linear ramp)
- Stable: 85% of training (constant LR)
- Decay: 10% of training (cosine to min_lr)
Supervised Fine-Tuning (SFT)
| Parameter | Value |
|---|---|
| Dataset | smol-smoltalk |
| Samples | ~485K conversations |
| Epochs | 1 |
| Learning Rate | 1e-3 |
| Batch Size | 32 (effective: 128 with grad accum) |
smol-smoltalk Composition
The SFT dataset is the same used to train SmolLM2-135M-Instruct:
| Subset | Purpose |
|---|---|
| smol-magpie-ultra-short | Instruction following |
| everyday-conversations | Multi-turn dialogue |
| smol-rewrite | Text editing |
| smol-summarize | Summarization |
| openhermes-100k | Knowledge & reasoning |
| systemchats-30k | System prompt following |
This dataset was specifically curated for small models (<1B params) and avoids issues like
<think>tags from reasoning models.
Usage
Installation
pip install genesis-llm
Download Weights
pip install "huggingface-hub>=0.20"
huggingface-cli download guiferrarib/genesis-152m-instruct genesis_152m_instruct.safetensors --local-dir .
Interactive Chat
genesis --model ./genesis_152m_instruct.safetensors
Python API
from genesis import Genesis, GenesisConfig
from genesis.tokenizer import GenesisTokenizer
# Load model
model = Genesis.from_pretrained("./genesis_152m_instruct.safetensors")
tokenizer = GenesisTokenizer()
# ChatML format
prompt = """<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
Explain what linear attention is in simple terms.
<|im_end|>
<|im_start|>assistant
"""
# Generate
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(output[0]))
Prompt Format
Genesis uses ChatML format:
<|im_start|>system
{system_message}
<|im_end|>
<|im_start|>user
{user_message}
<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>
Benchmarks
Evaluated using LightEval on MPS (Apple Silicon).
Results
| Task | Metric | Score | Stderr |
|---|---|---|---|
| ARC-Easy (25-shot) | acc_norm | 44.02% | ±1.02 |
| ARC-Challenge (25-shot) | acc_norm | 24.66% | ±1.26 |
| BoolQ (0-shot) | acc_norm | 56.30% | ±0.87 |
| HellaSwag (10-shot) | acc_norm | 30.19% | ±0.46 |
| Winogrande (5-shot) | acc | 49.09% | ±1.41 |
| CommonsenseQA (0-shot) | acc_norm | 29.16% | ±1.30 |
| OpenBookQA (0-shot) | acc_norm | 28.60% | ±2.02 |
| SciQ (0-shot) | acc_norm | 46.80% | ±1.58 |
Interpretation
| Task | Random Baseline | Genesis | Signal |
|---|---|---|---|
| ARC-Easy | 25% | 44% | ✅ Strong |
| BoolQ | 50% | 56% | ✅ Learning |
| HellaSwag | ~25% | 30% | ✅ Learning |
| Winogrande | 50% | 49% | ⚠️ At baseline |
| ARC-Challenge | ~25% | 25% | ⚠️ Too hard for size |
Note: With only 2B pre-training tokens (vs. 2T for SmolLM2), benchmarks primarily reflect architectural capacity rather than world knowledge.
Limitations
Known Issues
- Hallucinations: Frequent factual errors due to limited pre-training data
- Math: Unreliable arithmetic and multi-step reasoning
- Instruction Following: Can be brittle with strict constraints
- TTT Overhead: Metacognition layer adds latency (can be disabled)
Not Suitable For
- Production deployments requiring reliability
- Tasks requiring factual accuracy
- Complex multi-step reasoning
- Safety-critical applications
Best Use Cases
- Architecture research and ablation studies
- Efficient attention mechanism exploration
- Small model behavior analysis
- Educational purposes
Citation
If you use Genesis in your research, please cite:
@misc{genesis2024,
title={Genesis: A Hybrid Linear Attention Architecture for Small Language Models},
author={Ferrari Brescia, Guilherme},
year={2024},
url={https://huggingface.co/guiferrarib/genesis-152m-instruct}
}
Related Papers
@inproceedings{yang2024gated,
title={Gated Delta Networks: Improving Mamba2 with Delta Rule},
author={Yang, Songlin and Wang, Bailin and Zhang, Yu and Shen, Yikang and Keutzer, Kurt},
booktitle={ICLR},
year={2025}
}
@inproceedings{lin2025forgetting,
title={Forgetting Transformer: Softmax Attention with a Forget Gate},
author={Lin, Zhixuan and others},
booktitle={ICLR},
year={2025}
}
@inproceedings{sun2024learning,
title={Learning to (Learn at Test Time): RNNs with Expressive Hidden States},
author={Sun, Yu and others},
booktitle={ICML},
year={2024}
}
@article{allal2025smollm2,
title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model},
author={Allal, Loubna Ben and others},
journal={arXiv preprint arXiv:2502.02737},
year={2025}
}
License
| Component | License |
|---|---|
| Model Weights | Apache 2.0 |
| Code | Apache 2.0 |
| Training Data | Various (see dataset cards) |
Built with 🧬 by the Orch-Mind team
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file genesis_llm-2.0.1.tar.gz.
File metadata
- Download URL: genesis_llm-2.0.1.tar.gz
- Upload date:
- Size: 75.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
acba04d5c765da5309b97c84461f1426ad3f750f02b28d2454488f14b0518173
|
|
| MD5 |
f9bfb4fc2f45f2202a890096692170a0
|
|
| BLAKE2b-256 |
e5143d90498b3a20ea7c2f76844f7f24897a3cb3d96ad30ae9ddecfc81946498
|
File details
Details for the file genesis_llm-2.0.1-py3-none-any.whl.
File metadata
- Download URL: genesis_llm-2.0.1-py3-none-any.whl
- Upload date:
- Size: 70.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54e5a911d6046d17d1363844a0826de2b5eb9ee9f23521f025538b95239df97e
|
|
| MD5 |
d01c79067960973c23456e7c1304d0d9
|
|
| BLAKE2b-256 |
e5519e2be96cce9ac784931d966768429c94e6762f983ed1ea86b32f8934fc57
|