Skip to main content

Surgical Activation Sparsity for LLMs on Blackwell Architectures

Project description

GhostWeight ๐Ÿ‘ป

Training-Free Activation Sparsity for LLM Inference on Consumer Hardware

License: MIT Python 3.10+ CUDA 12.6 Hardware: RTX 5060


What Is This

GhostWeight is a framework that exploits activation sparsity in large language models to achieve significant hardware speedup without retraining.

The core finding: in Qwen2.5-7B-Instruct, 27.3% of MLP neurons never fire across diverse prompts. They sit in VRAM consuming memory and bandwidth while contributing exactly zero to the output. By identifying and removing them permanently, and combining with a threshold-gated activation function, we achieve up to 110% hardware speedup on a consumer NVIDIA RTX 5060 (8GB Blackwell).

All results are measured on real hardware. No simulations.


Results

Model: Qwen2.5-7B-Instruct (4-bit NF4) Hardware: NVIDIA RTX 5060 8GB Blackwell GDDR7 CUDA: 12.6 | Driver: 591.86

Main Results Table

Strategy Total Sparsity Speedup Perplexity ฮ” Status
Baseline 0% +0.00% +0.00% Reference
Static Dead Neuron Mask 27.3% +38.35% +0.00% โœ… Production
Static + GhostGate (t=0.05) 42.4% +74.71% +5.91% โœ… Production
Static + GhostGate (t=0.10) 52.3% +110.53% +11.16% โš ๏ธ Research

Note: Speedup measured using sparse row-packing on Qwen-7B MLP dimensions (hidden=3584, intermediate=18944). Perplexity measured on 10 diverse AI/ML texts. All results reproducible with scripts in /benchmarks.

Kernel Benchmark

Kernel Time vs Dense
Dense baseline 0.9665 ms reference
GhostWeight (row-packed) 0.5691 ms +69.83%

Kernel efficiency: 95.8% of theoretical maximum at 72.88% sparsity. Hardware: RTX 5060 Blackwell. Kernel: CuPy JIT CUDA.

Streaming Pipeline

Metric Value
PCIe Gen4 x8 bandwidth 14.28 GB/s
Active layer swap time 8.26 ms
Layer compute time 14 ms
Async overlap 35%
Layer swap reduction 86.33%

Swap is 1.7x faster than compute. Memory latency hides behind computation.


How It Works

1. Static Dead Neuron Masking

Run 25+ diverse prompts through the model and record which neurons fire. Neurons that never fire are permanently removed from weight matrices before inference. This is a one-time offline operation with zero inference overhead.

Result: 27.3% of neurons removed | +0.00% perplexity impact | +38.35% speedup

2. GhostGate

Replace SiLU activations with a thresholded variant:

GhostGate(x) = SiLU(x) * (|SiLU(x)| > threshold)

Values below threshold are hard-zeroed. The threshold controls the speed-quality tradeoff. No retraining required.

from ghostweight import apply_ghostgate

model = apply_ghostgate(model, threshold=0.05)

3. Sparse Row-Packing CUDA Kernel

Instead of computing zero rows (which wastes GPU cycles due to warp divergence), active neurons are packed into a dense buffer first. Then a smaller dense matmul runs on only the active rows.

Branch skipping (naive):  -9.43%  vs dense  โ† SLOWER
Row packing (ours):       +69.83% vs dense  โ† FASTER

This eliminates warp divergence and achieves 95.8% of theoretical maximum kernel efficiency.


Honest Assessment

What Is Measured

  • Perplexity tradeoff curve on 10 texts โœ…
  • Hardware speedup on Qwen-7B MLP dimensions โœ…
  • PCIe bandwidth and async overlap โœ…
  • Static dead neuron prevalence (27.3%) โœ…
  • Predictor overhead analysis โœ…

What Is Projected

  • 70B throughput (3.08 tok/s) is a mathematical projection from measured layer times. Not an end-to-end measured result. We do not currently have a 70B model downloaded to verify live.

What Is Future Work

  • Sparsity-aware fine-tuning to recover quality at t=0.10
  • End-to-end integration of static mask + GhostGate + streaming
  • MMLU and standardized benchmark evaluation
  • Extension to attention sparsity
  • Benchmarks on Llama-4 and Gemma-3
  • Native Blackwell C++ kernel (blocked by CUDA 12.6 + VS 2026 toolchain incompatibility)

Installation

git clone https://github.com/manjitpokhrel/GhostWeight
cd GhostWeight
pip install -r requirements.txt

Or use the conda environment:

conda env create -f environment.yml
conda activate ghostweight

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
from ghostweight import apply_ghostgate

model_id = "Qwen/Qwen2.5-7B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    torch_dtype="auto"
)

# One line to enable GhostGate
model = apply_ghostgate(model, threshold=0.05)

# Inference is now 74% faster with 5.91% perplexity cost
inputs = tokenizer("Explain quantum entanglement", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Reproduce The Results

Step 1: Build the weight index

python prototypes/build_weight_index.py

Step 2: Run the threshold sweep

python prototypes/threshold_sweep.py

Step 3: Measure perplexity tradeoff

python benchmarks/perplexity_eval_v2.py

Step 4: Measure speedup curve

python benchmarks/speedup_curve_v2.py

Step 5: Static mask analysis

python benchmarks/static_mask_speedup.py
python benchmarks/static_mask_perplexity.py

Step 6: Full paper table

python benchmarks/final_table.py

Repository Structure

GhostWeight/
โ”œโ”€โ”€ ghostweight/
โ”‚   โ””โ”€โ”€ ghostgate.py          # GhostGate implementation + utilities
โ”œโ”€โ”€ kernels/
โ”‚   โ”œโ”€โ”€ ghost_sparse_matmul.cu # Row-packing sparse kernel (CuPy verified)
โ”‚   โ”œโ”€โ”€ ghost_tiled.cu         # Tiled shared memory kernel (future work)
โ”‚   โ”œโ”€โ”€ ghost_kernel.cu        # Native CUDA kernel (toolchain pending)
โ”‚   โ””โ”€โ”€ ghost_engine.h         # CUDA header
โ”œโ”€โ”€ benchmarks/
โ”‚   โ”œโ”€โ”€ final_table.py         # Reproduces main paper table
โ”‚   โ”œโ”€โ”€ perplexity_eval_v2.py  # Perplexity tradeoff measurement
โ”‚   โ”œโ”€โ”€ speedup_curve_v2.py    # Speedup vs sparsity measurement
โ”‚   โ”œโ”€โ”€ predictor_overhead.py  # Predictor cost analysis
โ”‚   โ”œโ”€โ”€ static_mask_speedup.py # Static masking benchmark
โ”‚   โ””โ”€โ”€ static_mask_perplexity.py # Static masking quality
โ”œโ”€โ”€ prototypes/
โ”‚   โ”œโ”€โ”€ scan_sparsity.py       # Phase 1: Initial sparsity measurement
โ”‚   โ”œโ”€โ”€ threshold_sweep.py     # Phase 1: Threshold analysis
โ”‚   โ”œโ”€โ”€ build_weight_index.py  # Phase 2: Dead neuron identification
โ”‚   โ”œโ”€โ”€ ghost_predictor_v2.py  # Phase 2: Dynamic predictor (abandoned)
โ”‚   โ”œโ”€โ”€ hardware_benchmark_v2.py # Phase 3: Kernel benchmark
โ”‚   โ”œโ”€โ”€ async_pipe_test_torch.py # Phase 4: Streaming pipeline
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ training/
โ”‚   โ””โ”€โ”€ sparsity_finetune.py   # Sparsity-aware fine-tuning (WIP)
โ”œโ”€โ”€ data/
โ”‚   โ””โ”€โ”€ *.json                 # All benchmark results
โ”œโ”€โ”€ models/
โ”‚   โ””โ”€โ”€ .gitkeep               # Weights hosted on HuggingFace (link below)
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ environment.yml

Key Finding: Predictor vs Static Mask

We originally designed a 23.4MB neural network (Ghost Predictor) to dynamically predict which neurons would fire before each layer computed them. Active neuron recall reached 85.59%.

However, the predictor cost 0.2509ms per call โ€” 38.30% of dense layer time. This reduced net speedup from +74.71% to +3.40%.

Conclusion: For neurons that are permanently dead, static masking outperforms dynamic prediction by 20x in net speedup. The predictor architecture is only worthwhile for context-dependent neurons with async pre-fetching, which is left as future work.


๐Ÿ 72B Live Demo

We successfully ran Qwen2.5-72B-Instruct-Q4_K_M on a single RTX 5060 (8GB) using llama.cpp partial GPU offload.

  • Status: Generated coherent output โœ…
  • Speed: ~0.022 tokens/sec (IO-bound, dense weights)
  • RAM used: 11.5 GB (model paged from NVMe)
  • VRAM used: Partial offload (8 GPU layers)

The bottleneck is not compute. It is the 47GB IO footprint. GhostWeight's 72.88% sparsity reduction is the path to real-time 70B inference on consumer hardware.

Hardware

All experiments run on:

GPU:          NVIDIA GeForce RTX 5060 (Blackwell sm_89)
VRAM:         8GB GDDR7
PCIe:         Gen 4 x8  (14.28 GB/s measured)
System RAM:   16GB
OS:           Windows 11 Pro
CUDA:         12.6
Driver:       591.86
Python:       3.10.11
PyTorch:      2.5.x

This is a consumer gaming GPU available for approximately $300. Not a research cluster. Not an H100.


Citation

If you use GhostWeight in your research, please cite:

@misc{pokhrel2026ghostweight,
  author    = {Pokhrel, Manjit},
  title     = {GhostWeight: Training-Free Activation Sparsity
               for LLM Inference on Consumer Hardware},
  year      = {2026},
  publisher = {GitHub},
  url       = {https://github.com/manjitpokhrel/GhostWeight}
}

Author

Manjit Pokhrel AI Researcher, Nepal


License

MIT License. See LICENSE for details.

Model weights (Qwen2.5-7B-Instruct) are subject to the Tongyi Qiwen Community License. GhostWeight code is independent of model weights and is MIT licensed.


Built in one research session on a consumer GPU in Nepal. The VRAM wall is not as solid as it looks.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghostweight-0.1.0.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ghostweight-0.1.0-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file ghostweight-0.1.0.tar.gz.

File metadata

  • Download URL: ghostweight-0.1.0.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for ghostweight-0.1.0.tar.gz
Algorithm Hash digest
SHA256 63718195360abf913979c41c195ec957eb630a44638124c8b06ce58a1ee69eaf
MD5 e15f75557c86670a50baec0b9d59cc6d
BLAKE2b-256 a417bcd64d0821ef28f14e23b55fb024e10c3282ea9f3b9cd96630271f4485e4

See more details on using hashes here.

File details

Details for the file ghostweight-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ghostweight-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for ghostweight-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 39b71c0a7d8dbb0869b88b589cda6fc3825e37d3dfd8e91096a2ba70077eaf21
MD5 7f01143e2ce11a88706d573eff1add23
BLAKE2b-256 9d61986790de4af75adfc7d4df8216f328c63b58c051900f0aeed079ed2ceb1d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page