Surgical Activation Sparsity for LLMs on Blackwell Architectures
Project description
GhostWeight ๐ป
Training-Free Activation Sparsity for LLM Inference on Consumer Hardware
What Is This
GhostWeight is a framework that exploits activation sparsity in large language models to achieve significant hardware speedup without retraining.
The core finding: in Qwen2.5-7B-Instruct, 27.3% of MLP neurons never fire across diverse prompts. They sit in VRAM consuming memory and bandwidth while contributing exactly zero to the output. By identifying and removing them permanently, and combining with a threshold-gated activation function, we achieve up to 110% hardware speedup on a consumer NVIDIA RTX 5060 (8GB Blackwell).
All results are measured on real hardware. No simulations.
Results
Model: Qwen2.5-7B-Instruct (4-bit NF4) Hardware: NVIDIA RTX 5060 8GB Blackwell GDDR7 CUDA: 12.6 | Driver: 591.86
Main Results Table
| Strategy | Total Sparsity | Speedup | Perplexity ฮ | Status |
|---|---|---|---|---|
| Baseline | 0% | +0.00% | +0.00% | Reference |
| Static Dead Neuron Mask | 27.3% | +38.35% | +0.00% | โ Production |
| Static + GhostGate (t=0.05) | 42.4% | +74.71% | +5.91% | โ Production |
| Static + GhostGate (t=0.10) | 52.3% | +110.53% | +11.16% | โ ๏ธ Research |
Note: Speedup measured using sparse row-packing on Qwen-7B MLP dimensions (hidden=3584, intermediate=18944). Perplexity measured on 10 diverse AI/ML texts. All results reproducible with scripts in
/benchmarks.
Kernel Benchmark
| Kernel | Time | vs Dense |
|---|---|---|
| Dense baseline | 0.9665 ms | reference |
| GhostWeight (row-packed) | 0.5691 ms | +69.83% |
Kernel efficiency: 95.8% of theoretical maximum at 72.88% sparsity. Hardware: RTX 5060 Blackwell. Kernel: CuPy JIT CUDA.
Streaming Pipeline
| Metric | Value |
|---|---|
| PCIe Gen4 x8 bandwidth | 14.28 GB/s |
| Active layer swap time | 8.26 ms |
| Layer compute time | 14 ms |
| Async overlap | 35% |
| Layer swap reduction | 86.33% |
Swap is 1.7x faster than compute. Memory latency hides behind computation.
How It Works
1. Static Dead Neuron Masking
Run 25+ diverse prompts through the model and record which neurons fire. Neurons that never fire are permanently removed from weight matrices before inference. This is a one-time offline operation with zero inference overhead.
Result: 27.3% of neurons removed | +0.00% perplexity impact | +38.35% speedup
2. GhostGate
Replace SiLU activations with a thresholded variant:
GhostGate(x) = SiLU(x) * (|SiLU(x)| > threshold)
Values below threshold are hard-zeroed. The threshold controls the speed-quality tradeoff. No retraining required.
from ghostweight import apply_ghostgate
model = apply_ghostgate(model, threshold=0.05)
3. Sparse Row-Packing CUDA Kernel
Instead of computing zero rows (which wastes GPU cycles due to warp divergence), active neurons are packed into a dense buffer first. Then a smaller dense matmul runs on only the active rows.
Branch skipping (naive): -9.43% vs dense โ SLOWER
Row packing (ours): +69.83% vs dense โ FASTER
This eliminates warp divergence and achieves 95.8% of theoretical maximum kernel efficiency.
Honest Assessment
What Is Measured
- Perplexity tradeoff curve on 10 texts โ
- Hardware speedup on Qwen-7B MLP dimensions โ
- PCIe bandwidth and async overlap โ
- Static dead neuron prevalence (27.3%) โ
- Predictor overhead analysis โ
What Is Projected
- 70B throughput (3.08 tok/s) is a mathematical projection from measured layer times. Not an end-to-end measured result. We do not currently have a 70B model downloaded to verify live.
What Is Future Work
- Sparsity-aware fine-tuning to recover quality at t=0.10
- End-to-end integration of static mask + GhostGate + streaming
- MMLU and standardized benchmark evaluation
- Extension to attention sparsity
- Benchmarks on Llama-4 and Gemma-3
- Native Blackwell C++ kernel (blocked by CUDA 12.6 + VS 2026 toolchain incompatibility)
Installation
git clone https://github.com/manjitpokhrel/GhostWeight
cd GhostWeight
pip install -r requirements.txt
Or use the conda environment:
conda env create -f environment.yml
conda activate ghostweight
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
from ghostweight import apply_ghostgate
model_id = "Qwen/Qwen2.5-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
torch_dtype="auto"
)
# One line to enable GhostGate
model = apply_ghostgate(model, threshold=0.05)
# Inference is now 74% faster with 5.91% perplexity cost
inputs = tokenizer("Explain quantum entanglement", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Reproduce The Results
Step 1: Build the weight index
python prototypes/build_weight_index.py
Step 2: Run the threshold sweep
python prototypes/threshold_sweep.py
Step 3: Measure perplexity tradeoff
python benchmarks/perplexity_eval_v2.py
Step 4: Measure speedup curve
python benchmarks/speedup_curve_v2.py
Step 5: Static mask analysis
python benchmarks/static_mask_speedup.py
python benchmarks/static_mask_perplexity.py
Step 6: Full paper table
python benchmarks/final_table.py
Repository Structure
GhostWeight/
โโโ ghostweight/
โ โโโ ghostgate.py # GhostGate implementation + utilities
โโโ kernels/
โ โโโ ghost_sparse_matmul.cu # Row-packing sparse kernel (CuPy verified)
โ โโโ ghost_tiled.cu # Tiled shared memory kernel (future work)
โ โโโ ghost_kernel.cu # Native CUDA kernel (toolchain pending)
โ โโโ ghost_engine.h # CUDA header
โโโ benchmarks/
โ โโโ final_table.py # Reproduces main paper table
โ โโโ perplexity_eval_v2.py # Perplexity tradeoff measurement
โ โโโ speedup_curve_v2.py # Speedup vs sparsity measurement
โ โโโ predictor_overhead.py # Predictor cost analysis
โ โโโ static_mask_speedup.py # Static masking benchmark
โ โโโ static_mask_perplexity.py # Static masking quality
โโโ prototypes/
โ โโโ scan_sparsity.py # Phase 1: Initial sparsity measurement
โ โโโ threshold_sweep.py # Phase 1: Threshold analysis
โ โโโ build_weight_index.py # Phase 2: Dead neuron identification
โ โโโ ghost_predictor_v2.py # Phase 2: Dynamic predictor (abandoned)
โ โโโ hardware_benchmark_v2.py # Phase 3: Kernel benchmark
โ โโโ async_pipe_test_torch.py # Phase 4: Streaming pipeline
โ โโโ ...
โโโ training/
โ โโโ sparsity_finetune.py # Sparsity-aware fine-tuning (WIP)
โโโ data/
โ โโโ *.json # All benchmark results
โโโ models/
โ โโโ .gitkeep # Weights hosted on HuggingFace (link below)
โโโ README.md
โโโ LICENSE
โโโ requirements.txt
โโโ environment.yml
Key Finding: Predictor vs Static Mask
We originally designed a 23.4MB neural network (Ghost Predictor) to dynamically predict which neurons would fire before each layer computed them. Active neuron recall reached 85.59%.
However, the predictor cost 0.2509ms per call โ 38.30% of dense layer time. This reduced net speedup from +74.71% to +3.40%.
Conclusion: For neurons that are permanently dead, static masking outperforms dynamic prediction by 20x in net speedup. The predictor architecture is only worthwhile for context-dependent neurons with async pre-fetching, which is left as future work.
๐ 72B Live Demo
We successfully ran Qwen2.5-72B-Instruct-Q4_K_M on a single RTX 5060 (8GB) using llama.cpp partial GPU offload.
- Status: Generated coherent output โ
- Speed: ~0.022 tokens/sec (IO-bound, dense weights)
- RAM used: 11.5 GB (model paged from NVMe)
- VRAM used: Partial offload (8 GPU layers)
The bottleneck is not compute. It is the 47GB IO footprint. GhostWeight's 72.88% sparsity reduction is the path to real-time 70B inference on consumer hardware.
Hardware
All experiments run on:
GPU: NVIDIA GeForce RTX 5060 (Blackwell sm_89)
VRAM: 8GB GDDR7
PCIe: Gen 4 x8 (14.28 GB/s measured)
System RAM: 16GB
OS: Windows 11 Pro
CUDA: 12.6
Driver: 591.86
Python: 3.10.11
PyTorch: 2.5.x
This is a consumer gaming GPU available for approximately $300. Not a research cluster. Not an H100.
Citation
If you use GhostWeight in your research, please cite:
@misc{pokhrel2026ghostweight,
author = {Pokhrel, Manjit},
title = {GhostWeight: Training-Free Activation Sparsity
for LLM Inference on Consumer Hardware},
year = {2026},
publisher = {GitHub},
url = {https://github.com/manjitpokhrel/GhostWeight}
}
Author
Manjit Pokhrel AI Researcher, Nepal
- GitHub: manjitpokhrel
- X/Twitter: @manjitpokhrel_
- LinkedIn: manjitpokhrel
License
MIT License. See LICENSE for details.
Model weights (Qwen2.5-7B-Instruct) are subject to the Tongyi Qiwen Community License. GhostWeight code is independent of model weights and is MIT licensed.
Built in one research session on a consumer GPU in Nepal. The VRAM wall is not as solid as it looks.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ghostweight-0.1.0.tar.gz.
File metadata
- Download URL: ghostweight-0.1.0.tar.gz
- Upload date:
- Size: 8.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63718195360abf913979c41c195ec957eb630a44638124c8b06ce58a1ee69eaf
|
|
| MD5 |
e15f75557c86670a50baec0b9d59cc6d
|
|
| BLAKE2b-256 |
a417bcd64d0821ef28f14e23b55fb024e10c3282ea9f3b9cd96630271f4485e4
|
File details
Details for the file ghostweight-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ghostweight-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39b71c0a7d8dbb0869b88b589cda6fc3825e37d3dfd8e91096a2ba70077eaf21
|
|
| MD5 |
7f01143e2ce11a88706d573eff1add23
|
|
| BLAKE2b-256 |
9d61986790de4af75adfc7d4df8216f328c63b58c051900f0aeed079ed2ceb1d
|