RCA (Recurrent Cross Attention) Architecture - A hybrid sequence modeling architecture

These details have not been verified by PyPI

Project links

Project description

RCA 1.3.0 — Recurrent Cross Attention Architecture

RCA-Mythos (v1.3.0): A hybrid Recurrent-Depth Architecture combining Mamba SSM, Gated Linear Attention (GLA), and Sliding Window Attention across specialized cognitive zones — heavily inspired by OpenMythos and Parcae scaling laws.

RCA v1.3.0 introduces the Mythos architecture, which replaces deep parameter stacking with a recurrent depth loop. A single block of parameters is looped multiple times at inference, allowing the model to "think deeper" about hard problems without consuming extra memory.

Why RCA?

Feature	Transformer	RCA v1.0/v2.0	RCA-Mythos v1.3.0
Training complexity	O(N²)	O(N)	O(N)
Generation memory	O(N) KV cache	O(1)	O(1)
Reasoning Depth	Fixed (depth = layers)	Fixed	Dynamic (Depth Extrapolation)
Parameter Efficiency	1x	1x	~5x (via recurrent looping)
Generation speed	Slows with context	Constant	Adaptive (ACT Halting)

Version History & Upgrade Guide

v1.0 / v2.0 (Ultra-Reasoning): Introduced the 3-zone architecture (SSM $\rightarrow$ GLA $\rightarrow$ Reasoning). This is a "flat" architecture where every layer has unique weights. Best for standard workflows where deterministic layer-by-layer execution is desired.
v1.3.0 (RCA-Mythos): Introduced the Recurrent-Depth architecture. The GLA zone is replaced by a RecurrentCore that loops a single set of weights $T$ times. This introduces Depth Extrapolation (the model can think deeper at inference by looping more times) and massive parameter efficiency.

Architecture: The 3-Stage Recurrent Pipeline (Mythos)

RCA-Mythos divides processing into three stages, running the GLA zone inside a recurrent loop. The Open-Mythos "thinking" mechanism is globally integrated as the central Recurrent Core of the model, enhancing complex reasoning for all tokens.

┌──────────────────────────────────────────────────────────────┐
│   PRELUDE                 │ Stream of Consciousness         │
│   SSM Blocks (run once)   │ Encodes input context → O(1)    │
├──────────────────────────────────────────────────────────────┤
│   RECURRENT CORE          │ Working Memory & Depth          │
│   GLA Block (run T times) │ Associative recall              │
│                           │ LTI-stable injection            │
│                           │ LoRA depth-wise adaptation      │
│                           │ ACT early-halting per-token     │
├──────────────────────────────────────────────────────────────┤
│   CODA                    │ Focus & Precision               │
│   Reasoning (run once)    │ Sliding Window + Memory Tokens  │
└──────────────────────────────────────────────────────────────┘

Installation

# Core (CPU/GPU)
pip install rca-arch

# With GPU acceleration (Triton kernels)
pip install rca-arch[gpu]

# With export support (safetensors)
pip install rca-arch[export]

# With training utilities
pip install rca-arch[training]

# Everything
pip install rca-arch[all]

Requirements: Python ≥ 3.9, PyTorch ≥ 2.0.0

Quick Start

1. Create a Model from Presets

Both the classic RCAModel and the new RCAMythosModel are available.

from rca import RCAConfig, RCAModel, RCAMythosModel

# --- Option A: Classic RCA (v1/v2 flat architecture) ---
classic_config = RCAConfig.rca_100m()
classic_model = RCAModel(classic_config)

# --- Option B: RCA-Mythos (v1.3.0 recurrent architecture) ---
mythos_config = RCAConfig.rca_mythos_100m()
mythos_config.vocab_size = 32000  # match your tokenizer

model = RCAMythosModel(mythos_config)
print(f"Parameters: {model.count_parameters():,}")
# → Parameters: ~100,000,000 (but performs like a 130M+ model)

# Inspect the architecture zones
print(model.get_architecture_summary())

2. Forward Pass

import torch

x = torch.randint(0, 32000, (2, 4096))  # [batch, seq_len]
output = model(x)

print(output.logits.shape)  # [2, 4096, 32000]
print(output.loss)          # None (no labels provided)

3. Forward Pass with Loss

x = torch.randint(0, 32000, (2, 4096))
labels = x.clone()

output = model(x, labels=labels)
print(f"Loss: {output.loss.item():.4f}")

4. Generate Text (with Depth Extrapolation)

With RCAMythosModel, you can dynamically increase the reasoning depth at inference time by passing a higher n_loops argument.

prompt = torch.randint(0, 32000, (1, 64))  # [1, prompt_len]

# Generate with test-time depth extrapolation (e.g. 16 loops)
generated = model.generate(
    prompt,
    max_new_tokens=200,
    n_loops=16,          # Force deeper reasoning than training default
    temperature=0.8,
    top_k=50,
    top_p=0.9,
)
print(generated.shape)  # [1, 264]

Model Presets

RCA-Mythos Presets (v1.3.0)

All Mythos presets use the Recurrent-Depth architecture. A 770M parameter Mythos model reaches the quality of a 1.3B flat-depth model.

Preset	Params	Equiv. Flat	Loops	Prelude	Coda	Hardware
`rca_mythos_100m()`	~100M	~130M	4	4	2	T4 / P100
`rca_mythos_500m()`	~500M	~700M	8	6	3	T4 / P100 (ckpt)
`rca_mythos_1b()`	~1B	~1.5B	12	8	4	A100
`rca_mythos_3b()`	~3B	~5B	16	10	6	Multi-GPU / A100 80G

Estimated Training Budget (7-hour window)

Optimal Loop scaling law follows Parcae: μ_rec ∝ C^0.40

Preset	T4 (16GB)	P100 (16GB)	Settings
`rca_mythos_100m`	~700M tokens	~1.1B tokens	batch=8, grad_accum=4, fp16
`rca_mythos_500m`	~250M tokens	~400M tokens	batch=2, grad_accum=16, fp16, grad_ckpt
`rca_mythos_1b`	~100M tokens	~250M tokens	batch=1, grad_accum=32, fp16, grad_ckpt

In-Depth: How RCA-Mythos Works

The "Thinking" part of the architecture is fully globally integrated into the model, processing every single token. It is not an add-on; it is the core engine.

The Prelude (SSM): Fast, $O(1)$ recurrent scan that rapidly ingests the input context and compresses it into a high-density vector $e$.
The Recurrent Core (GLA Loop): Instead of stacking 20 different layers, we take one Gated Linear Attention layer and loop it $T$ times (e.g., $T=8$).
- LTI Stable Injection: At each loop, the context $e$ is injected into the state. We strictly enforce $\rho(A) < 1$ (Linear Time-Invariant stability) so the gradients never explode, even if you loop it 100 times.
- Depth LoRA: A tiny parameter-efficient adapter tells the shared weights which loop iteration it is currently on, allowing the layer to act differently at loop 1 vs loop 8.
- ACT Halting (Adaptive Computation): "Easy" tokens (like "the", "and") accumulate high confidence quickly and exit the loop early. "Hard" tokens (complex math, logic) stay in the loop for all $T$ iterations to "think deeper".
The Coda (Reasoning): A final Sliding Window Attention pass that grounds the refined abstract thoughts back into precise token predictions.

Training RCA-Mythos

Training the Mythos architecture is identical to the classic architecture, but with a few built-in advantages. By training with a random number of loops (Parcae strategy), the model learns to extrapolate depth at inference time.

Training on a Single GPU (e.g., T4, P100, RTX 4050)

from rca import RCAConfig, RCAMythosModel, RCATrainer, TrainingArguments

# 1. Config: Use a Mythos preset
config = RCAConfig.rca_mythos_500m()
config.vocab_size = 32000

# 2. Model: Instantiate the Recurrent-Depth model
model = RCAMythosModel(config)

# 3. Dataset
from torch.utils.data import Dataset
class TextDataset(Dataset):
    def __init__(self, data, seq_len=4096):
        self.data = data
        self.seq_len = seq_len
    def __len__(self): return len(self.data) // self.seq_len
    def __getitem__(self, idx):
        start = idx * self.seq_len
        chunk = self.data[start : start + self.seq_len + 1]
        return {"input_ids": chunk[:-1], "labels": chunk[1:]}

# 4. Training args (Gradient Checkpointing is enabled by default in 500M)
args = TrainingArguments(
    output_dir="./checkpoints",
    num_train_epochs=1,
    per_device_train_batch_size=2,      # Small batch size to fit in VRAM
    gradient_accumulation_steps=16,     # Accumulate to get effective batch=32
    learning_rate=3e-4,
    warmup_steps=200,
    fp16=True,                          # Essential for 6GB VRAM
    logging_steps=10,
)

# 5. Train
trainer = RCATrainer(model=model, args=args, train_dataset=train_dataset)
trainer.train()

Hardware Estimates: Laptop GPU (RTX 4050 6GB)

Scenario A: 100M Model at Chinchilla Optimal Limit (2B Tokens)

The Chinchilla scaling law dictates that optimal training requires ~20 tokens per parameter. For a 100M model, this is 2 Billion tokens.

Compute Required: A 100M Mythos model effectively computes like a 130M flat model.
- $FLOPs \approx 6 \times 130,000,000 \times 2,000,000,000 \approx 1.56 \text{ ExaFLOPs}$
Hardware Speed: RTX 4050 Laptop (~30 effective TFLOPs in mixed precision).
Time Estimate: $1.56 \times 10^{18} / 30 \times 10^{12} \approx 52,000 \text{ seconds}$.
$\approx \mathbf{14.5 \text{ hours}}$. (You can easily train this overnight on a laptop!)

Scenario B: 500M Model on 10B Tokens

If you scale up to the 500M RCA-Mythos model on a dataset of 10 Billion tokens (world knowledge, math, coding):

Compute Required: Acts like a 700M flat model.
- $FLOPs \approx 6 \times 700,000,000 \times 10,000,000,000 \approx 42 \text{ ExaFLOPs}$
Time Estimate: $4.2 \times 10^{16} / 30 \text{ TFLOPs} \approx 1.4 \text{ million seconds}$.
$\approx \mathbf{388 \text{ hours}}$ (or about 16 days of continuous 24/7 training).

VRAM Note: To fit these models in 6GB of VRAM, you MUST use:

fp16=True or bf16=True
per_device_train_batch_size=1 or 2 (use gradient_accumulation_steps to compensate)
gradient_checkpointing=True
An 8-bit optimizer (like bitsandbytes.optim.AdamW8bit) to save optimizer state memory.

Multi-GPU Training (DDP)

torchrun --nproc_per_node=4 train.py

args = TrainingArguments(
    output_dir="./checkpoints",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    fp16=True,
    use_ddp=True,
    # ...
)

Large Model Training (FSDP — 5B+)

torchrun --nproc_per_node=8 train.py

config = RCAConfig.rca_5b()
config.vocab_size = 32000

args = TrainingArguments(
    output_dir="./checkpoints",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=64,
    bf16=True,
    use_fsdp=True,
    # ...
)

TPU Training (XLA)

args = TrainingArguments(
    output_dir="./checkpoints",
    per_device_train_batch_size=8,
    use_xla=True,
    bf16=True,
    # ...
)

Custom Architecture

Full control over every parameter:

from rca import RCAConfig, RCAModel

config = RCAConfig(
    vocab_size=50257,
    state_dim=768,
    n_layers=24,
    n_heads=12,

    # Ultra-Reasoning zones
    use_ultra_reasoning=True,
    use_glu_ffn=True,
    ssm_zone_end=0.6,        # First 60% = SSM (stream of consciousness)
    gla_zone_end=0.85,       # Next 25%  = GLA (working memory)
    # Remaining 15%          = Reasoning (focus)

    # GLA settings
    gla_heads=12,
    gla_expand_k=1.0,
    gla_expand_v=2.0,

    # Reasoning settings
    sliding_window_size=512,
    num_memory_tokens=32,

    # Performance
    gradient_checkpointing=True,
    max_seq_len=4096,
    dropout=0.1,
)

model = RCAModel(config)
print(model.get_layer_zones())

Key Configuration Parameters

Parameter	Default	Description
`use_ultra_reasoning`	`False`	Enable 3-zone architecture
`use_glu_ffn`	`False`	SwiGLU FFN instead of standard GELU
`ssm_zone_end`	`0.6`	Fraction of layers for SSM zone
`gla_zone_end`	`0.85`	Fraction of layers for SSM + GLA zones
`gradient_checkpointing`	`False`	Trade compute for memory savings
`sliding_window_size`	`512`	Local attention window in reasoning zone
`num_memory_tokens`	`32`	Global context bookmarks in reasoning zone
`use_mqa`	`False`	Multi-Query Attention for KV savings

Model Export

Safetensors (recommended for fast loading)

from rca import export_safetensors, load_safetensors, RCAModel

# Export
export_safetensors(model, "./my_model_safetensors/")

# Load
model = load_safetensors(RCAModel, "./my_model_safetensors/")

GGUF (for llama.cpp / edge inference)

from rca import export_gguf

# Full precision
export_gguf(model, "./my_model.gguf", quantization="f16")

# Quantized (smaller, faster on CPU)
export_gguf(model, "./my_model_q8.gguf", quantization="q8_0")
export_gguf(model, "./my_model_q4.gguf", quantization="q4_0")

Quantization options:

Format	Size vs f32	Quality	Use Case
`f32`	1×	Lossless	Research / debugging
`f16`	0.5×	Near-lossless	GPU inference
`q8_0`	0.25×	Minimal loss	CPU / edge inference
`q4_0`	0.125×	Some loss	Mobile / embedded

PyTorch Native Save/Load

# Save
model.save_pretrained("./my_model/")

# Load
model = RCAModel.from_pretrained("./my_model/")

Performance Features

Gradient Checkpointing

Trades ~30% compute for ~60% memory savings. Enabled by default for 500M+ presets.

config = RCAConfig.rca_500m()
# config.gradient_checkpointing is already True

# Or enable manually:
config.gradient_checkpointing = True

Triton-Accelerated Parallel Scan

The SSM parallel scan automatically uses Triton kernels on NVIDIA GPUs:

from rca import TRITON_AVAILABLE
print(f"Triton available: {TRITON_AVAILABLE}")
# Automatic — no code changes needed

torch.compile

Fuses operations for additional speedup:

args = TrainingArguments(
    use_torch_compile=True,
    compile_mode="reduce-overhead",  # or "max-autotune"
    # ...
)

Fused RMSNorm

All normalization layers use an optimized rsqrt(mean(x²)) implementation that is both faster and compatible with torch.compile kernel fusion.

Kaggle / Colab Quick Training

Complete training script for free-tier GPUs:

# Install
# !pip install rca-arch[gpu]

import torch
from rca import RCAConfig, RCAModel, RCATrainer, TrainingArguments

# Use 100M preset for T4
config = RCAConfig.rca_100m()
config.vocab_size = 32000

model = RCAModel(config)
print(f"Model: {model.count_parameters():,} params")
print(f"Zones: {model.get_layer_zones()}")

# Create a simple dataset (replace with your data)
from torch.utils.data import TensorDataset
data = torch.randint(0, 32000, (1000, 4097))
dataset = TensorDataset(data[:, :-1], data[:, 1:])

# Train
args = TrainingArguments(
    output_dir="/kaggle/working/rca_output",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=3e-4,
    warmup_steps=100,
    fp16=True,
    logging_steps=5,
    save_steps=200,
)

trainer = RCATrainer(model=model, args=args, train_dataset=dataset)
trainer.train()

# Export
from rca import export_safetensors
export_safetensors(model, "/kaggle/working/rca_safetensors/")

Running Tests

# Run the full test suite
python run_tests.py

# Run a standalone integration test
python run_single_test.py

Project Structure

src/rca/
├── __init__.py          # Public API
├── config.py            # RCAConfig with presets
├── modeling/
│   ├── rca_model.py     # RCAModel (SSM/GLA/Reasoning blocks)
│   └── outputs.py       # Output dataclasses
├── layers/
│   ├── ssm.py           # Selective State Space Model
│   ├── gla.py           # Gated Linear Attention (vectorized)
│   ├── sliding_attention.py  # Sliding Window + Memory Tokens
│   ├── attention.py     # Efficient Attention (MQA/Rotary)
│   ├── scan.py          # Parallel scan (PyTorch/Triton/XLA)
│   ├── norm.py          # Fused RMSNorm, DeepNorm
│   └── positions.py     # ALiBi, Rotary embeddings
├── trainer.py           # RCATrainer (DDP/FSDP/XLA/compile)
├── converter.py         # Safetensors + GGUF export
├── generator.py         # Text generation utilities
└── utils/
    ├── benchmark.py     # Performance benchmarking
    └── export.py        # ONNX export, save/load

Citation

@software{rca2024,
  title={RCA: Recursive Compression Architecture},
  author={Rajaaditya, R.},
  year={2024},
  url={https://github.com/rajaaditya/rca-arch}
}

License

MIT License — see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.3.0

Apr 25, 2026

1.0.2

Mar 22, 2026

1.0.1

Mar 22, 2026

1.0.0

Mar 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rca_arch-1.3.0.tar.gz (59.4 kB view details)

Uploaded Apr 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rca_arch-1.3.0-py3-none-any.whl (59.1 kB view details)

Uploaded Apr 25, 2026 Python 3

File details

Details for the file rca_arch-1.3.0.tar.gz.

File metadata

Download URL: rca_arch-1.3.0.tar.gz
Upload date: Apr 25, 2026
Size: 59.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for rca_arch-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`c83c9b7cb385831edce962155994a8cba694173ddb6671d944c38529df936d57`
MD5	`3310b9c183f45e9bb836690e7a1765fe`
BLAKE2b-256	`f9d009bc7f89eeec7ad140479e3b3d1c75635aeaef9d5c02424d69a60a8a00df`

See more details on using hashes here.

File details

Details for the file rca_arch-1.3.0-py3-none-any.whl.

File metadata

Download URL: rca_arch-1.3.0-py3-none-any.whl
Upload date: Apr 25, 2026
Size: 59.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for rca_arch-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4a42da5e6a4e78a2430ea7dced63e6ea2c6e21429555df7370d27434c2d39b31`
MD5	`2fa284cccc367c86c0f4e24ba8d53c22`
BLAKE2b-256	`595907bd76b07ac943c8bdef74b956ff6cd141550a30b02e215f43a0d32fb2cd`

See more details on using hashes here.

rca-arch 1.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RCA 1.3.0 — Recurrent Cross Attention Architecture

Why RCA?

Version History & Upgrade Guide

Architecture: The 3-Stage Recurrent Pipeline (Mythos)

Installation

Quick Start

1. Create a Model from Presets

2. Forward Pass

3. Forward Pass with Loss

4. Generate Text (with Depth Extrapolation)

Model Presets

RCA-Mythos Presets (v1.3.0)

Estimated Training Budget (7-hour window)

In-Depth: How RCA-Mythos Works

Training RCA-Mythos

Training on a Single GPU (e.g., T4, P100, RTX 4050)

Hardware Estimates: Laptop GPU (RTX 4050 6GB)

Scenario A: 100M Model at Chinchilla Optimal Limit (2B Tokens)

Scenario B: 500M Model on 10B Tokens

Multi-GPU Training (DDP)

Large Model Training (FSDP — 5B+)

TPU Training (XLA)

Custom Architecture

Key Configuration Parameters

Model Export

Safetensors (recommended for fast loading)

GGUF (for llama.cpp / edge inference)

PyTorch Native Save/Load

Performance Features

Gradient Checkpointing

Triton-Accelerated Parallel Scan

torch.compile

Fused RMSNorm

Kaggle / Colab Quick Training

Running Tests

Project Structure

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes