HALO-S: Sparse Attention Language Model Framework with O(N×K) complexity

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

neo_bueorm

These details have not been verified by PyPI

Project description

🌀 HALO-S

Hierarchical Attention with Local Offsets — Sparse

A linear-complexity language model framework that replaces quadratic attention with a structured sparse connectivity graph.

Python PyPI License Tests PyTorch

What if attention didn't have to be quadratic?

Every modern language model pays a steep price for long sequences: the standard Transformer's self-attention scales as O(N²), making context windows beyond 4K tokens prohibitively expensive. HALO-S takes a different path. By constructing a fixed-degree sparse connectivity graph — combining local windows, dilated connections, learned global tokens, and random edges — each token attends to only K neighbors regardless of sequence length. The result is O(N×K) complexity with K=76 by default, yielding a theoretical ~52.5× reduction in attention operations at N=4096.

HALO-S is implemented as a clean, research-ready PyTorch framework. No custom CUDA kernels. No external dependencies beyond PyTorch and NumPy. Just gather-based sparse attention that runs on any hardware.

⚠️ Honest disclaimer: HALO-S is a promising architectural exploration. The theoretical complexity advantages are mathematically sound, but large-scale empirical validation against established models on standard benchmarks is still in progress. Use it for research, experimentation, and learning. The numbers in this README reflect theoretical analysis and small-scale experiments, not production-validated results.

Key Features
Architecture Overview
Performance Analysis (Theoretical)
Installation
Quick Start
Advanced Usage
Project Structure
Why HALO-S?
License
🇪🇸 Versión en Español

Key Features

Feature	Description
Linear Attention Complexity	O(N×K) instead of O(N²) — scales to long sequences efficiently
Gather-Based Sparse Attention	No custom CUDA kernels needed; runs on CPU and GPU
Learned Global Tokens	Shared memory parameters that attend to the full sequence
Dilated Connections	Exponentially expanding receptive field across layers
Random Edges	Small-world graph properties for information propagation
Grouped Query Attention (GQA)	Reduced KV memory with configurable head ratios
Rotary Position Embeddings (RoPE)	Relative position encoding without learned parameters
Mixed Precision Training	Native AMP support with GradScaler
Gradient Accumulation	Train with effective large batches on limited hardware
Checkpoint Save/Load	Full training state persistence and resumption
Streaming Datasets	Train on data larger than RAM with buffer shuffling
Autoregressive Generation	Top-k, top-p, and temperature sampling built-in

Architecture Overview

HALO-S replaces dense self-attention with a structured sparse graph where each token connects to a fixed set of K neighbors:

┌─────────────────────────────────────────────────────────────────┐
│                        HaloSModel                                │
│                                                                  │
│  ┌──────────────┐   ┌──────────────────────────────────┐        │
│  │ token_emb    │   │ global_memory (nn.Parameter)      │        │
│  │ (Embedding)  │   │ shape: (num_globals, hidden_size) │        │
│  └──────┬───────┘   └──────────────┬───────────────────┘        │
│         │                          │                             │
│         └──────────┬───────────────┘                             │
│                    ▼                                              │
│         ┌──────────────────┐                                     │
│         │ cat([globals, x]) │  → (B, G+N, H)                    │
│         └────────┬─────────┘                                     │
│                  ▼                                                │
│         ┌──────────────────┐                                     │
│         │ RoPE (cos, sin)  │                                     │
│         └────────┬─────────┘                                     │
│                  ▼                                                │
│  ┌───────────────────────────────────────────────────┐           │
│  │              HaloBlock × num_layers                │           │
│  │                                                    │           │
│  │  ┌─────────────┐                                  │           │
│  │  │ LayerNorm 1 │                                  │           │
│  │  └──────┬──────┘                                  │           │
│  │         │                                          │           │
│  │    ┌────┴────────────────────────┐                │           │
│  │    ▼                             ▼                │           │
│  │ ┌────────────────┐   ┌─────────────────────┐     │           │
│  │ │GlobalFullAttn  │   │ HaloSparseAttention │     │           │
│  │ │(dense, G×N)    │   │ (gather, N×K)       │     │           │
│  │ └───────┬────────┘   └──────────┬──────────┘     │           │
│  │         │                       │                  │           │
│  │         └───────────┬───────────┘                  │           │
│  │                     ▼                              │           │
│  │           cat([globals_out, tokens_out])            │           │
│  │                     │ + residual                    │           │
│  │                     ▼                              │           │
│  │  ┌─────────────┐  ┌────────────┐                 │           │
│  │  │ LayerNorm 2 │→ │ FeedForward│ + residual      │           │
│  │  └─────────────┘  └────────────┘                 │           │
│  └───────────────────────────────────────────────────┘           │
│                  ▼                                                │
│         ┌──────────────────┐                                     │
│         │ LayerNorm final  │                                     │
│         └────────┬─────────┘                                     │
│                  ▼                                                │
│         ┌──────────────────┐                                     │
│         │ lm_head (Linear) │  → (B, N, vocab_size)              │
│         └──────────────────┘                                     │
└─────────────────────────────────────────────────────────────────┘

Connectivity Components

Each token's neighbor list is composed of:

Component	Neighbors	Purpose
Global Tokens (G)	2	Learned parameters attending to full sequence — shared memory
Local Window (w)	64	Captures sequential/syntactic dependencies
Dilated Connections (2d)	8	Exponentially expanding receptive field
Random Edges (r)	2	Guarantees small-world graph properties
Total (K)	76	Fixed budget per token regardless of N

Mathematical Formulation

Given input sequence x ∈ ℝ^(B×N), the forward pass:

Embed: e = Embedding(x) ∈ ℝ^(B×N×H)
Prepend globals: x̂ = [g₁,...,g_G ; e₁,...,e_N] ∈ ℝ^(B×(G+N)×H)
Per layer: Pre-norm → Split attention (global dense + token sparse) → Residual → Pre-norm → FFN → Residual
Output: logits = W_lm · LN_f(x̂^(L)_{G:}) ∈ ℝ^(B×N×V)

Performance Analysis (Theoretical)

⚠️ All performance data below is THEORETICAL, derived from complexity analysis. Large-scale empirical benchmarks are in progress.

Attention Operation Reduction

At sequence length N=4096 with K=76 neighbors per token:

Dense Transformer attention operations:  N²      = 16,777,216
HALO-S attention operations:             N×(K+G) =    319,488

Reduction factor: 16,777,216 / 319,488 ≈ 52.5×

Scaling Comparison (Attention FLOPs)

Sequence Length (N)	Dense Transformer (N²)	HALO-S (N×76)	Theoretical Speedup
512	262,144	38,912	6.7×
1,024	1,048,576	77,824	13.5×
2,048	4,194,304	155,648	26.9×
4,096	16,777,216	311,296	53.9×
8,192	67,108,864	622,592	107.8×
16,384	268,435,456	1,245,184	215.6×

The speedup grows linearly with N because dense attention is O(N²) while HALO-S is O(N×K).

Theoretical Comparison with Other Architectures

⚠️ THEORETICAL COMPARISON — based on published complexity analyses, not head-to-head benchmarks.

Model	Attention Complexity	Memory (Scores)	Global Context	Dilated	Random Edges	GQA
Dense Transformer	O(N²·d)	O(N²)	Full (implicit)	✗	✗	Optional
Longformer	O(N·w·d)	O(N·w)	✓ (fixed)	✓	✗	✗
BigBird	O(N·(w+g+r)·d)	O(N·(w+g+r))	✓ (fixed)	✗	✓	✗
Mamba (SSM)	O(N·d²)	O(d²)	Implicit (state)	✗	✗	N/A
HALO-S	O(N·K·d)	O(N·K)	✓ (learned)	✓	✓	✓

Memory Efficiency

Component	Dense Transformer	HALO-S	Advantage
Attention scores (B=1, N=4096)	512 MB	9.5 MB	54× less
KV cache (GQA effect)	16 MB	4 MB	4× less
Crossover point	—	N > 9,728	Total memory advantage

Qualitative Comparison (THEORETICAL)

Capability	Transformer	Mamba	Longformer	HALO-S
Long-range dependencies	★★★★★	★★★☆☆	★★★☆☆	★★★★☆ (theoretical)
Training efficiency	★★☆☆☆	★★★★★	★★★★☆	★★★★☆ (theoretical)
Inference speed	★★☆☆☆	★★★★★	★★★☆☆	★★★★☆ (theoretical)
Hardware compatibility	★★★★★	★★★☆☆	★★★☆☆	★★★★★
Implementation simplicity	★★★★★	★★☆☆☆	★★★☆☆	★★★★☆

Installation

From PyPI

# Core installation (PyTorch + NumPy only)
pip install pyhalos

# Full installation (includes tqdm progress bars + SentencePiece tokenizer)
pip install pyhalos[full]

From Source

git clone https://github.com/bueormnew/pyhalo.git
cd pyhalo
pip install -e ".[full,dev]"

Requirements

Python ≥ 3.10
PyTorch ≥ 2.1.0
NumPy ≥ 1.24.0
(Optional) tqdm, sentencepiece

Quick Start

Minimal Example

from halo import HaloConfig, HaloSModel, set_seed

set_seed(42)

# Configure a small model
config = HaloConfig(
    vocab_size=256,
    hidden_size=512,
    num_layers=6,
    num_heads=8,
    num_kv_heads=2,       # GQA: 4:1 ratio
    num_globals=2,
    local_window=64,
    max_seq_len=4096,
)

# Instantiate
model = HaloSModel(config)

# Inspect
print(model.summary())
print(f"Parameters: {model.count_parameters():,}")
print(f"FLOPs (N=1024): {model.estimate_flops(seq_len=1024)['total_gflops']:.2f} GFLOPs")

Text Generation (String API)

from halo import HaloConfig, HaloSModel, CharacterTokenizer, set_seed

set_seed(42)

config = HaloConfig(vocab_size=256, hidden_size=256, num_layers=4, num_heads=4)
model = HaloSModel(config)
tok = CharacterTokenizer()

# Generate from a text prompt (returns string)
output = model.generate(
    "Hello world",
    tokenizer=tok,
    max_new_tokens=50,
    temperature=0.8,
    top_k=40,
)
print(output)

Tensor Generation

import torch
from halo import HaloConfig, HaloSModel

config = HaloConfig(vocab_size=256, hidden_size=256, num_layers=4, num_heads=4)
model = HaloSModel(config)

# Generate from tensor input (returns tensor)
input_ids = torch.randint(0, 256, (1, 20))
output_ids = model.generate(
    input_ids,
    max_new_tokens=100,
    temperature=1.0,
    top_p=0.9,
)
print(f"Input: {input_ids.shape} → Output: {output_ids.shape}")

Advanced Usage

Training with Mixed Precision & Gradient Accumulation

from halo import HaloConfig, HaloSModel, Trainer, CharacterTokenizer, set_seed
from halo.datasets import JSONLDataset

set_seed(42)

# Model
config = HaloConfig(
    vocab_size=256,
    hidden_size=512,
    num_layers=6,
    num_heads=8,
    num_kv_heads=2,
    max_seq_len=2048,
)
model = HaloSModel(config)

# Dataset
tok = CharacterTokenizer()
dataset = JSONLDataset(
    file_path="data/train.jsonl",
    tokenizer=tok,
    max_seq_len=2048,
    text_field="text",
)

# Trainer with full features
trainer = Trainer(
    model=model,
    learning_rate=3e-4,
    mixed_precision=True,              # FP16/BF16 automatic mixed precision
    gradient_accumulation_steps=4,     # Effective batch = 4 × batch_size
    max_grad_norm=1.0,                 # Gradient clipping
    checkpoint_dir="./checkpoints",
    log_every=10,
)

# Train
history = trainer.fit(
    dataset=dataset,
    epochs=10,
    batch_size=8,
    save_every=2,  # Checkpoint every 2 epochs
)

# Access training history
for epoch_data in history:
    print(f"Epoch {epoch_data['epoch']}: loss={epoch_data['train_loss']:.4f}")

Checkpoint Save & Resume

# Save checkpoint manually
trainer.save_checkpoint(path="my_checkpoint.pt")

# Resume training from checkpoint
trainer.load_checkpoint("my_checkpoint.pt")
# Continue training...
trainer.fit(dataset=dataset, epochs=5, batch_size=8)

Streaming Dataset (Files Larger Than RAM)

from halo import HaloConfig, HaloSModel, Trainer, CharacterTokenizer
from halo.datasets import StreamingDataset

tok = CharacterTokenizer()

# StreamingDataset reads files lazily with buffer shuffling
stream_dataset = StreamingDataset(
    file_paths=["data/shard_01.jsonl", "data/shard_02.jsonl"],
    tokenizer=tok,
    max_seq_len=2048,
    buffer_size=10000,     # Local shuffle buffer
    text_field="text",
    file_format="jsonl",   # or "txt"
)

# Use with DataLoader (IterableDataset compatible)
from torch.utils.data import DataLoader
loader = DataLoader(stream_dataset, batch_size=4)

Benchmarking

from halo import HaloConfig, HaloSModel
from halo.utils.benchmarks import benchmark_speed, benchmark_generation, estimate_flops

config = HaloConfig(vocab_size=256, hidden_size=512, num_layers=6, num_heads=8)
model = HaloSModel(config)

# Latency benchmark across sequence lengths
speed_results = benchmark_speed(
    model, config,
    seq_lengths=[128, 512, 1024, 2048, 4096],
    batch_size=1,
    warmup_runs=3,
    timed_runs=10,
)
for r in speed_results:
    print(f"  N={r['seq_len']:>5} | {r['avg_ms']:.2f} ms | {r['tokens_per_sec']:,.0f} tok/s")

# Generation throughput
gen_results = benchmark_generation(
    model, config,
    prompt_len=10,
    max_new_tokens=200,
    num_runs=5,
)
print(f"Generation: {gen_results['tokens_per_sec']:.1f} tokens/sec")

# Theoretical FLOPs (no model instantiation needed)
flops = estimate_flops(config, seq_len=4096)
print(f"Total: {flops['total_gflops']:.2f} GFLOPs")
print(f"  Sparse attention: {flops['attention_flops']/1e9:.2f} G")
print(f"  Global attention: {flops['global_flops']/1e9:.2f} G")
print(f"  FFN:              {flops['ffn_flops']/1e9:.2f} G")

Model Introspection

from halo import HaloConfig, HaloSModel, count_parameters

config = HaloConfig(vocab_size=256, hidden_size=512, num_layers=6, num_heads=8)
model = HaloSModel(config)

# Summary with architecture details and memory estimate
print(model.summary())

# Parameter count
print(f"Trainable params: {model.count_parameters():,}")

# Standalone parameter counter (works on any nn.Module)
print(f"Via utility: {count_parameters(model):,}")

# FLOPs breakdown
flops = model.estimate_flops(seq_len=2048)
for key, value in flops.items():
    print(f"  {key}: {value}")

Word-Level Tokenizer

from halo import WordTokenizer

tok = WordTokenizer()
tok.build_vocab(["The cat sat on the mat.", "Hello world!"], min_freq=1)

encoded = tok.encode("The cat sat")
decoded = tok.decode(encoded)
print(f"Vocab size: {tok.vocab_size}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")

Project Structure

pyhalo/
├── halo/                          # Main package
│   ├── __init__.py                # Public API exports
│   ├── attention/
│   │   ├── global_attention.py    # Dense attention for global tokens
│   │   ├── graph.py              # Neighbor list generation (local + dilated + random)
│   │   └── halo_attention.py     # Gather-based sparse attention
│   ├── core/
│   │   ├── config.py             # HaloConfig dataclass
│   │   └── logging.py           # Structured logging utilities
│   ├── datasets/
│   │   ├── jsonl.py             # JSONLDataset for structured data
│   │   ├── streaming.py         # StreamingDataset (IterableDataset, infinite)
│   │   ├── synthetic.py         # CopyDataset, NeedleDataset for testing
│   │   └── text.py             # Plain text dataset
│   ├── generation/
│   │   └── samplers.py          # Top-k, top-p, temperature sampling
│   ├── models/
│   │   ├── halo_model.py        # HaloSModel (main model)
│   │   └── baseline_model.py    # Dense baseline for comparison
│   ├── nn/
│   │   ├── feed_forward.py      # SwiGLU / GELU feed-forward
│   │   ├── halo_block.py        # HaloBlock (attention + FFN + residual)
│   │   └── rope.py             # Rotary Positional Embeddings
│   ├── tokenizers/
│   │   ├── base.py             # BaseTokenizer abstract class
│   │   ├── char.py             # CharacterTokenizer (byte-level)
│   │   ├── word.py             # WordTokenizer (whitespace-based)
│   │   └── sentencepiece.py    # SentencePiece wrapper
│   ├── training/
│   │   └── trainer.py          # Trainer with AMP, accumulation, checkpoints
│   └── utils/
│       ├── benchmarks.py        # Speed, generation, memory, FLOPs benchmarks
│       ├── metrics.py          # Parameter counting, memory estimation
│       └── random.py           # Seed management
├── docs/
│   ├── architecture.md          # Full architecture documentation
│   ├── complexity.md           # Complexity analysis and proofs
│   ├── local_attention.md      # Local window mechanism
│   ├── dilated_connections.md  # Dilated connection strategy
│   ├── global_tokens.md        # Global token design
│   ├── sparse_attention.md     # Sparse attention implementation
│   ├── gqa.md                  # Grouped Query Attention
│   ├── rope.md                 # RoPE implementation details
│   └── flash_attention.md      # Flash attention compatibility notes
├── tests/                       # 55 tests covering all components
│   ├── test_attention.py
│   ├── test_model.py
│   ├── test_training.py
│   ├── test_generation.py
│   ├── test_tokenizers.py
│   ├── test_shapes.py
│   ├── test_gradients.py
│   ├── test_memory.py
│   ├── test_checkpoint.py
│   ├── test_config.py
│   └── test_graph.py
├── benchmarks/                  # Benchmark scripts
│   ├── benchmark_speed.py
│   └── benchmark_graph.py
├── scripts/                     # Experiment scripts
│   ├── exp1_baseline.py
│   ├── exp2_ablation.py
│   └── exp3_long_context.py
├── pyproject.toml              # Package configuration
├── LICENSE                     # Custom license (research free, commercial paid)
└── README.md                   # This file

Why HALO-S?

Philosophy

HALO-S was born from a simple question: Can we get most of the representational power of dense attention while paying only a fraction of the computational cost?

The approach is grounded in graph theory. Instead of letting every token attend to every other token (a complete graph), HALO-S constructs a sparse connectivity graph with properties borrowed from network science:

Local clustering (window attention) — nearby tokens form tightly connected neighborhoods, capturing syntax and local semantics
Long-range shortcuts (dilated connections) — exponentially spaced connections prevent information bottlenecks across distance
Small-world properties (random edges) — a few random connections ensure that the graph diameter remains logarithmic, so information can propagate in O(log N) hops
Shared memory (global tokens) — learned parameters that act as a broadcast channel, available to every token in every layer

This combination is inspired by how efficient real-world networks (neural, social, transportation) achieve both local efficiency and global connectivity.

Honest Assessment

What HALO-S does well (demonstrated):

✅ Clean, modular PyTorch implementation with no exotic dependencies
✅ Mathematically sound complexity reduction (O(N×K) vs O(N²))
✅ Runs on any hardware — CPU, single GPU, no custom kernels required
✅ All 55 tests pass — correctness of gradients, shapes, generation, and checkpoints verified
✅ Training loop works end-to-end with AMP, gradient accumulation, and streaming data

What remains to be proven:

⏳ Perplexity parity with dense Transformers at equivalent parameter count on standard benchmarks (WikiText-103, C4, etc.)
⏳ Scaling behavior at 100M+ parameters
⏳ Actual wall-clock speedup vs optimized dense attention (FlashAttention v2)
⏳ Performance on downstream NLP tasks (summarization, QA, etc.)
⏳ Comparison with Mamba/SSM architectures on actual generation quality

The gather-based approach has a known trade-off: while it avoids custom CUDA kernels, the torch.gather operations create intermediate tensors that can be memory-intensive. For sequences shorter than ~9,728 tokens, the gathered KV tensors may exceed dense attention memory. The advantage becomes clear at longer sequences.

Running Tests

# Run all 55 tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=halo --cov-report=term-missing

# Run specific test module
pytest tests/test_attention.py -v

Running Experiments

# Experiment 1: Baseline comparison (HALO-S vs Dense)
python scripts/exp1_baseline.py

# Experiment 2: Ablation study (contribution of each connectivity component)
python scripts/exp2_ablation.py

# Experiment 3: Long context scaling behavior
python scripts/exp3_long_context.py

Citation

If you use HALO-S in your research, please cite:

@software{halo_s_2024,
  author = {BUEORM},
  title = {HALO-S: Hierarchical Attention with Local Offsets — Sparse},
  year = {2024},
  url = {https://github.com/bueormnew/pyhalo},
}

License

HALO-S Framework License — Custom dual-use license:

Use Case	Permission	Conditions
Education & Research	✅ Free	Must credit "HALO-S" in any derivative work
Personal projects & experimentation	✅ Free	Must include copyright notice
Commercial / Production use	❌ Requires license	Contact for commercial licensing

For commercial licensing inquiries: dalusx64@gmail.com

See LICENSE for full terms.

Author

BUEORM

📧 dalusx64@gmail.com
🐙 github.com/bueormnew/pyhalo

🇪🇸 Versión en Español

🌀 HALO-S

Atención Jerárquica con Offsets Locales — Disperso

Un framework de modelos de lenguaje con complejidad lineal que reemplaza la atención cuadrática con un grafo de conectividad dispersa estructurado.

¿Y si la atención no tuviera que ser cuadrática?

Todo modelo de lenguaje moderno paga un precio elevado por secuencias largas: la auto-atención estándar del Transformer escala como O(N²), haciendo que ventanas de contexto mayores a 4K tokens sean prohibitivamente costosas. HALO-S toma un camino diferente. Al construir un grafo de conectividad dispersa de grado fijo — combinando ventanas locales, conexiones dilatadas, tokens globales aprendidos y aristas aleatorias — cada token atiende solo a K vecinos sin importar la longitud de la secuencia. El resultado es complejidad O(N×K) con K=76 por defecto, logrando una reducción teórica de ~52.5× en operaciones de atención para N=4096.

HALO-S está implementado como un framework limpio de PyTorch listo para investigación. Sin kernels CUDA personalizados. Sin dependencias externas más allá de PyTorch y NumPy. Solo atención dispersa basada en gather que funciona en cualquier hardware.

⚠️ Aviso honesto: HALO-S es una exploración arquitectónica prometedora. Las ventajas teóricas de complejidad son matemáticamente sólidas, pero la validación empírica a gran escala contra modelos establecidos en benchmarks estándar aún está en progreso. Úsalo para investigación, experimentación y aprendizaje.

Características Principales

Característica	Descripción
Complejidad de Atención Lineal	O(N×K) en lugar de O(N²) — escala eficientemente a secuencias largas
Atención Dispersa basada en Gather	Sin kernels CUDA personalizados; funciona en CPU y GPU
Tokens Globales Aprendidos	Parámetros de memoria compartida que atienden la secuencia completa
Conexiones Dilatadas	Campo receptivo exponencialmente expansivo entre capas
Aristas Aleatorias	Propiedades de grafo de mundo pequeño para propagación de información
Grouped Query Attention (GQA)	Memoria KV reducida con ratios de cabezas configurables
RoPE (Rotary Position Embeddings)	Codificación posicional relativa sin parámetros aprendidos
Entrenamiento con Precisión Mixta	Soporte nativo de AMP con GradScaler
Acumulación de Gradientes	Entrena con batches efectivos grandes en hardware limitado
Guardado/Carga de Checkpoints	Persistencia y reanudación completa del estado de entrenamiento
Datasets de Streaming	Entrena con datos mayores a la RAM con shuffling por buffer
Generación Autoregresiva	Muestreo top-k, top-p y temperatura integrados

Análisis de Rendimiento (Teórico)

⚠️ Todos los datos de rendimiento son TEÓRICOS, derivados del análisis de complejidad.

Reducción de Operaciones de Atención

Con longitud de secuencia N=4096 y K=76 vecinos por token:

Operaciones de atención Transformer denso:  N²      = 16,777,216
Operaciones de atención HALO-S:             N×(K+G) =    319,488

Factor de reducción: 16,777,216 / 319,488 ≈ 52.5×

Tabla de Escalado (FLOPs de Atención)

Longitud (N)	Transformer Denso (N²)	HALO-S (N×76)	Speedup Teórico
512	262,144	38,912	6.7×
1,024	1,048,576	77,824	13.5×
2,048	4,194,304	155,648	26.9×
4,096	16,777,216	311,296	53.9×
8,192	67,108,864	622,592	107.8×
16,384	268,435,456	1,245,184	215.6×

Comparación Teórica con Otras Arquitecturas

⚠️ COMPARACIÓN TEÓRICA — basada en análisis de complejidad publicados, no en benchmarks directos.

Modelo	Complejidad Atención	Memoria (Scores)	Contexto Global	Dilatación	Aristas Aleatorias	GQA
Transformer Denso	O(N²·d)	O(N²)	Completo (implícito)	✗	✗	Opcional
Longformer	O(N·w·d)	O(N·w)	✓ (fijos)	✓	✗	✗
BigBird	O(N·(w+g+r)·d)	O(N·(w+g+r))	✓ (fijos)	✗	✓	✗
Mamba (SSM)	O(N·d²)	O(d²)	Implícito (estado)	✗	✗	N/A
HALO-S	O(N·K·d)	O(N·K)	✓ (aprendidos)	✓	✓	✓

Instalación

# Instalación básica (solo PyTorch + NumPy)
pip install pyhalos

# Instalación completa (incluye tqdm + SentencePiece)
pip install pyhalos[full]

# Desde código fuente
git clone https://github.com/bueormnew/pyhalo.git
cd pyhalo
pip install -e ".[full,dev]"

Requisitos: Python ≥ 3.10, PyTorch ≥ 2.1.0, NumPy ≥ 1.24.0

Inicio Rápido

from halo import HaloConfig, HaloSModel, set_seed

set_seed(42)

# Configurar modelo
config = HaloConfig(
    vocab_size=256,
    hidden_size=512,
    num_layers=6,
    num_heads=8,
    num_kv_heads=2,       # GQA: ratio 4:1
    num_globals=2,
    local_window=64,
    max_seq_len=4096,
)

model = HaloSModel(config)
print(model.summary())
print(f"Parámetros: {model.count_parameters():,}")

Generación de Texto

from halo import HaloConfig, HaloSModel, CharacterTokenizer, set_seed

set_seed(42)
config = HaloConfig(vocab_size=256, hidden_size=256, num_layers=4, num_heads=4)
model = HaloSModel(config)
tok = CharacterTokenizer()

# Generar desde texto (retorna string)
output = model.generate(
    "Hola mundo",
    tokenizer=tok,
    max_new_tokens=50,
    temperature=0.8,
    top_k=40,
)
print(output)

Entrenamiento Completo

from halo import HaloConfig, HaloSModel, Trainer, CharacterTokenizer, set_seed
from halo.datasets import JSONLDataset

set_seed(42)

config = HaloConfig(vocab_size=256, hidden_size=512, num_layers=6, num_heads=8, num_kv_heads=2)
model = HaloSModel(config)
tok = CharacterTokenizer()

dataset = JSONLDataset(
    file_path="datos/train.jsonl",
    tokenizer=tok,
    max_seq_len=2048,
    text_field="text",
)

trainer = Trainer(
    model=model,
    learning_rate=3e-4,
    mixed_precision=True,
    gradient_accumulation_steps=4,
    max_grad_norm=1.0,
    checkpoint_dir="./checkpoints",
)

history = trainer.fit(dataset=dataset, epochs=10, batch_size=8, save_every=2)

Benchmarks

from halo import HaloConfig, HaloSModel
from halo.utils.benchmarks import benchmark_speed, benchmark_generation, estimate_flops

config = HaloConfig(vocab_size=256, hidden_size=512, num_layers=6, num_heads=8)
model = HaloSModel(config)

# Benchmark de latencia
resultados = benchmark_speed(model, config, seq_lengths=[512, 1024, 2048, 4096])
for r in resultados:
    print(f"  N={r['seq_len']:>5} | {r['avg_ms']:.2f} ms | {r['tokens_per_sec']:,.0f} tok/s")

# FLOPs teóricos
flops = estimate_flops(config, seq_len=4096)
print(f"Total: {flops['total_gflops']:.2f} GFLOPs")

¿Por Qué HALO-S?

Filosofía

HALO-S nació de una pregunta simple: ¿Podemos obtener la mayor parte del poder representacional de la atención densa pagando solo una fracción del costo computacional?

El enfoque se basa en teoría de grafos. En lugar de permitir que cada token atienda a todos los demás (un grafo completo), HALO-S construye un grafo de conectividad dispersa con propiedades de ciencia de redes:

Clustering local (atención de ventana) — tokens cercanos forman vecindarios densamente conectados
Atajos de largo alcance (conexiones dilatadas) — conexiones espaciadas exponencialmente previenen cuellos de botella
Propiedades de mundo pequeño (aristas aleatorias) — garantizan que el diámetro del grafo sea logarítmico
Memoria compartida (tokens globales) — parámetros aprendidos que actúan como canal de broadcast

Evaluación Honesta

Lo que HALO-S hace bien (demostrado):

✅ Implementación limpia y modular en PyTorch sin dependencias exóticas
✅ Reducción de complejidad matemáticamente sólida (O(N×K) vs O(N²))
✅ Funciona en cualquier hardware — CPU, GPU, sin kernels personalizados
✅ 55 tests pasan — correctitud de gradientes, formas, generación y checkpoints
✅ Loop de entrenamiento funciona end-to-end con AMP y streaming

Lo que queda por demostrar:

⏳ Paridad de perplejidad con Transformers densos a parámetros equivalentes
⏳ Comportamiento de escalado a 100M+ parámetros
⏳ Speedup real de wall-clock vs atención densa optimizada (FlashAttention v2)
⏳ Rendimiento en tareas NLP downstream
⏳ Comparación con Mamba/SSM en calidad de generación

Licencia

Licencia HALO-S Framework — Licencia dual personalizada:

Caso de Uso	Permiso	Condiciones
Educación e Investigación	✅ Gratis	Debe acreditar "HALO-S" en cualquier trabajo derivado
Proyectos personales	✅ Gratis	Debe incluir aviso de copyright
Uso Comercial / Producción	❌ Requiere licencia	Contactar para licencia comercial

Para consultas de licencia comercial: dalusx64@gmail.com

Autor

BUEORM

📧 dalusx64@gmail.com
🐙 github.com/bueormnew/pyhalo

Built with focus on clarity, correctness, and computational efficiency.
Construido con enfoque en claridad, correctitud y eficiencia computacional.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

neo_bueorm

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.2.1

Jun 26, 2026

2.1.0

Jun 26, 2026

This version

2.0.0

Jun 25, 2026

1.0.3

Jun 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyhalos-2.0.0.tar.gz (75.5 kB view details)

Uploaded Jun 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyhalos-2.0.0-py3-none-any.whl (55.5 kB view details)

Uploaded Jun 25, 2026 Python 3

File details

Details for the file pyhalos-2.0.0.tar.gz.

File metadata

Download URL: pyhalos-2.0.0.tar.gz
Upload date: Jun 25, 2026
Size: 75.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyhalos-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`40554a3a4d382b5466ef90e300f506e76fbd84f6a6a9b065d40f8736bc38c12b`
MD5	`b3431c401fc7aa19eb7c199a2e20aab3`
BLAKE2b-256	`a8a839f7093878ec16d78cfa23684ea043b77598f0e7baa1824c6d804c8a9a1f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyhalos-2.0.0.tar.gz:

Publisher: publish.yml on bueormnew/pyhalo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyhalos-2.0.0.tar.gz
- Subject digest: 40554a3a4d382b5466ef90e300f506e76fbd84f6a6a9b065d40f8736bc38c12b
- Sigstore transparency entry: 1959747140
- Sigstore integration time: Jun 25, 2026
Source repository:
- Permalink: bueormnew/pyhalo@ce429db05d3e7337f84b937deb7e17531c419400
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/bueormnew
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ce429db05d3e7337f84b937deb7e17531c419400
- Trigger Event: release

File details

Details for the file pyhalos-2.0.0-py3-none-any.whl.

File metadata

Download URL: pyhalos-2.0.0-py3-none-any.whl
Upload date: Jun 25, 2026
Size: 55.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyhalos-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6798938f14a3531d5660bdf02b082373903d4c3dfb1383c23cbdd8a7ae0e08e8`
MD5	`c8f22aaab2c4d0fbc50d35e070cd9ca2`
BLAKE2b-256	`c6759f17a08fd719ace36636065d09c10d6970cffa2bdd26fdfc3d12c406c5af`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyhalos-2.0.0-py3-none-any.whl:

Publisher: publish.yml on bueormnew/pyhalo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyhalos-2.0.0-py3-none-any.whl
- Subject digest: 6798938f14a3531d5660bdf02b082373903d4c3dfb1383c23cbdd8a7ae0e08e8
- Sigstore transparency entry: 1959747267
- Sigstore integration time: Jun 25, 2026
Source repository:
- Permalink: bueormnew/pyhalo@ce429db05d3e7337f84b937deb7e17531c419400
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/bueormnew
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ce429db05d3e7337f84b937deb7e17531c419400
- Trigger Event: release

pyhalos 2.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

🌀 HALO-S

What if attention didn't have to be quadratic?

Table of Contents

Key Features

Architecture Overview

Connectivity Components

Mathematical Formulation

Performance Analysis (Theoretical)

Attention Operation Reduction

Scaling Comparison (Attention FLOPs)

Theoretical Comparison with Other Architectures

Memory Efficiency

Qualitative Comparison (THEORETICAL)

Installation

From PyPI

From Source

Requirements

Quick Start

Minimal Example

Text Generation (String API)

Tensor Generation

Advanced Usage

Training with Mixed Precision & Gradient Accumulation

Checkpoint Save & Resume

Streaming Dataset (Files Larger Than RAM)

Benchmarking

Model Introspection

Word-Level Tokenizer

Project Structure

Why HALO-S?

Philosophy

Honest Assessment

Running Tests

Running Experiments

Citation

License

Author

🇪🇸 Versión en Español

🌀 HALO-S

¿Y si la atención no tuviera que ser cuadrática?

Características Principales

Análisis de Rendimiento (Teórico)

Reducción de Operaciones de Atención

Tabla de Escalado (FLOPs de Atención)

Comparación Teórica con Otras Arquitecturas

Instalación

Inicio Rápido

Generación de Texto

Entrenamiento Completo

Benchmarks

¿Por Qué HALO-S?

Filosofía

Evaluación Honesta

Licencia

Autor

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes