Surgical Activation Sparsity for LLMs on Blackwell Architectures

Project description

GhostWeight 👻

Training-Free Activation Sparsity for LLM Inference on Consumer Hardware

What Is This

GhostWeight is a framework that exploits activation sparsity in large language models to achieve significant hardware speedup without retraining.

The core finding: in Qwen2.5-7B-Instruct, 27.3% of MLP neurons never fire across diverse prompts. They sit in VRAM consuming memory and bandwidth while contributing exactly zero to the output. By identifying and removing them permanently, and combining with a threshold-gated activation function, we achieve up to 110% hardware speedup on a consumer NVIDIA RTX 5060 (8GB Blackwell).

All results are measured on real hardware. No simulations.

Results

Model: Qwen2.5-7B-Instruct (4-bit NF4) Hardware: NVIDIA RTX 5060 8GB Blackwell GDDR7 CUDA: 12.6 | Driver: 591.86

Main Results Table

Strategy	Total Sparsity	Speedup	Perplexity Δ	Status
Baseline	0%	+0.00%	+0.00%	Reference
Static Dead Neuron Mask	27.3%	+38.35%	+0.00%	✅ Production
Static + GhostGate (t=0.05)	42.4%	+74.71%	+5.91%	✅ Production
Static + GhostGate (t=0.10)	52.3%	+110.53%	+11.16%	⚠️ Research

Note: Speedup measured using sparse row-packing on Qwen-7B MLP dimensions (hidden=3584, intermediate=18944). Perplexity measured on 10 diverse AI/ML texts. All results reproducible with scripts in /benchmarks.

Kernel Benchmark

Kernel	Time	vs Dense
Dense baseline	0.9665 ms	reference
GhostWeight (row-packed)	0.5691 ms	+69.83%

Kernel efficiency: 95.8% of theoretical maximum at 72.88% sparsity. Hardware: RTX 5060 Blackwell. Kernel: CuPy JIT CUDA.

Streaming Pipeline

Metric	Value
PCIe Gen4 x8 bandwidth	14.28 GB/s
Active layer swap time	8.26 ms
Layer compute time	14 ms
Async overlap	35%
Layer swap reduction	86.33%

Swap is 1.7x faster than compute. Memory latency hides behind computation.

How It Works

1. Static Dead Neuron Masking

Run 25+ diverse prompts through the model and record which neurons fire. Neurons that never fire are permanently removed from weight matrices before inference. This is a one-time offline operation with zero inference overhead.

Result: 27.3% of neurons removed | +0.00% perplexity impact | +38.35% speedup

2. GhostGate

Replace SiLU activations with a thresholded variant:

GhostGate(x) = SiLU(x) * (|SiLU(x)| > threshold)

Values below threshold are hard-zeroed. The threshold controls the speed-quality tradeoff. No retraining required.

from ghostweight import apply_ghostgate

model = apply_ghostgate(model, threshold=0.05)

3. Sparse Row-Packing CUDA Kernel

Instead of computing zero rows (which wastes GPU cycles due to warp divergence), active neurons are packed into a dense buffer first. Then a smaller dense matmul runs on only the active rows.

Branch skipping (naive):  -9.43%  vs dense  ← SLOWER
Row packing (ours):       +69.83% vs dense  ← FASTER

This eliminates warp divergence and achieves 95.8% of theoretical maximum kernel efficiency.

Honest Assessment

What Is Measured

Perplexity tradeoff curve on 10 texts ✅
Hardware speedup on Qwen-7B MLP dimensions ✅
PCIe bandwidth and async overlap ✅
Static dead neuron prevalence (27.3%) ✅
Predictor overhead analysis ✅

What Is Projected

70B throughput (3.08 tok/s) is a mathematical projection from measured layer times. Not an end-to-end measured result. We do not currently have a 70B model downloaded to verify live.

What Is Future Work

Sparsity-aware fine-tuning to recover quality at t=0.10
End-to-end integration of static mask + GhostGate + streaming
MMLU and standardized benchmark evaluation
Extension to attention sparsity
Benchmarks on Llama-4 and Gemma-3
Native Blackwell C++ kernel (blocked by CUDA 12.6 + VS 2026 toolchain incompatibility)

Installation

git clone https://github.com/manjitpokhrel/GhostWeight
cd GhostWeight
pip install -r requirements.txt

Or use the conda environment:

conda env create -f environment.yml
conda activate ghostweight

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
from ghostweight import apply_ghostgate

model_id = "Qwen/Qwen2.5-7B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    torch_dtype="auto"
)

# One line to enable GhostGate
model = apply_ghostgate(model, threshold=0.05)

# Inference is now 74% faster with 5.91% perplexity cost
inputs = tokenizer("Explain quantum entanglement", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Reproduce The Results

Step 1: Build the weight index

python prototypes/build_weight_index.py

Step 2: Run the threshold sweep

python prototypes/threshold_sweep.py

Step 3: Measure perplexity tradeoff

python benchmarks/perplexity_eval_v2.py

Step 4: Measure speedup curve

python benchmarks/speedup_curve_v2.py

Step 5: Static mask analysis

python benchmarks/static_mask_speedup.py
python benchmarks/static_mask_perplexity.py

Step 6: Full paper table

python benchmarks/final_table.py

Repository Structure

GhostWeight/
├── ghostweight/
│   └── ghostgate.py          # GhostGate implementation + utilities
├── kernels/
│   ├── ghost_sparse_matmul.cu # Row-packing sparse kernel (CuPy verified)
│   ├── ghost_tiled.cu         # Tiled shared memory kernel (future work)
│   ├── ghost_kernel.cu        # Native CUDA kernel (toolchain pending)
│   └── ghost_engine.h         # CUDA header
├── benchmarks/
│   ├── final_table.py         # Reproduces main paper table
│   ├── perplexity_eval_v2.py  # Perplexity tradeoff measurement
│   ├── speedup_curve_v2.py    # Speedup vs sparsity measurement
│   ├── predictor_overhead.py  # Predictor cost analysis
│   ├── static_mask_speedup.py # Static masking benchmark
│   └── static_mask_perplexity.py # Static masking quality
├── prototypes/
│   ├── scan_sparsity.py       # Phase 1: Initial sparsity measurement
│   ├── threshold_sweep.py     # Phase 1: Threshold analysis
│   ├── build_weight_index.py  # Phase 2: Dead neuron identification
│   ├── ghost_predictor_v2.py  # Phase 2: Dynamic predictor (abandoned)
│   ├── hardware_benchmark_v2.py # Phase 3: Kernel benchmark
│   ├── async_pipe_test_torch.py # Phase 4: Streaming pipeline
│   └── ...
├── training/
│   └── sparsity_finetune.py   # Sparsity-aware fine-tuning (WIP)
├── data/
│   └── *.json                 # All benchmark results
├── models/
│   └── .gitkeep               # Weights hosted on HuggingFace (link below)
├── README.md
├── LICENSE
├── requirements.txt
└── environment.yml

Key Finding: Predictor vs Static Mask

We originally designed a 23.4MB neural network (Ghost Predictor) to dynamically predict which neurons would fire before each layer computed them. Active neuron recall reached 85.59%.

However, the predictor cost 0.2509ms per call — 38.30% of dense layer time. This reduced net speedup from +74.71% to +3.40%.

Conclusion: For neurons that are permanently dead, static masking outperforms dynamic prediction by 20x in net speedup. The predictor architecture is only worthwhile for context-dependent neurons with async pre-fetching, which is left as future work.

🏁 72B Live Demo

We successfully ran Qwen2.5-72B-Instruct-Q4_K_M on a single RTX 5060 (8GB) using llama.cpp partial GPU offload.

Status: Generated coherent output ✅
Speed: ~0.022 tokens/sec (IO-bound, dense weights)
RAM used: 11.5 GB (model paged from NVMe)
VRAM used: Partial offload (8 GPU layers)

The bottleneck is not compute. It is the 47GB IO footprint. GhostWeight's 72.88% sparsity reduction is the path to real-time 70B inference on consumer hardware.

Hardware

All experiments run on:

GPU:          NVIDIA GeForce RTX 5060 (Blackwell sm_89)
VRAM:         8GB GDDR7
PCIe:         Gen 4 x8  (14.28 GB/s measured)
System RAM:   16GB
OS:           Windows 11 Pro
CUDA:         12.6
Driver:       591.86
Python:       3.10.11
PyTorch:      2.5.x

This is a consumer gaming GPU available for approximately $300. Not a research cluster. Not an H100.

Citation

If you use GhostWeight in your research, please cite:

@misc{pokhrel2026ghostweight,
  author    = {Pokhrel, Manjit},
  title     = {GhostWeight: Training-Free Activation Sparsity
               for LLM Inference on Consumer Hardware},
  year      = {2026},
  publisher = {GitHub},
  url       = {https://github.com/manjitpokhrel/GhostWeight}
}

Author

Manjit Pokhrel AI Researcher, Nepal

GitHub: manjitpokhrel
X/Twitter: @manjitpokhrel_
LinkedIn: manjitpokhrel

License

MIT License. See LICENSE for details.

Model weights (Qwen2.5-7B-Instruct) are subject to the Tongyi Qiwen Community License. GhostWeight code is independent of model weights and is MIT licensed.

Built in one research session on a consumer GPU in Nepal. The VRAM wall is not as solid as it looks.

Project details

Release history Release notifications | RSS feed

This version

0.1.0

May 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghostweight-0.1.0.tar.gz (8.4 kB view details)

Uploaded May 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ghostweight-0.1.0-py3-none-any.whl (8.7 kB view details)

Uploaded May 1, 2026 Python 3

File details

Details for the file ghostweight-0.1.0.tar.gz.

File metadata

Download URL: ghostweight-0.1.0.tar.gz
Upload date: May 1, 2026
Size: 8.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for ghostweight-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`63718195360abf913979c41c195ec957eb630a44638124c8b06ce58a1ee69eaf`
MD5	`e15f75557c86670a50baec0b9d59cc6d`
BLAKE2b-256	`a417bcd64d0821ef28f14e23b55fb024e10c3282ea9f3b9cd96630271f4485e4`

See more details on using hashes here.

File details

Details for the file ghostweight-0.1.0-py3-none-any.whl.

File metadata

Download URL: ghostweight-0.1.0-py3-none-any.whl
Upload date: May 1, 2026
Size: 8.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for ghostweight-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`39b71c0a7d8dbb0869b88b589cda6fc3825e37d3dfd8e91096a2ba70077eaf21`
MD5	`7f01143e2ce11a88706d573eff1add23`
BLAKE2b-256	`9d61986790de4af75adfc7d4df8216f328c63b58c051900f0aeed079ed2ceb1d`

See more details on using hashes here.

ghostweight 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

GhostWeight 👻

What Is This

Results

Main Results Table

Kernel Benchmark

Streaming Pipeline

How It Works

1. Static Dead Neuron Masking

2. GhostGate

3. Sparse Row-Packing CUDA Kernel

Honest Assessment

What Is Measured

What Is Projected

What Is Future Work

Installation

Quick Start

Reproduce The Results

Step 1: Build the weight index

Step 2: Run the threshold sweep

Step 3: Measure perplexity tradeoff

Step 4: Measure speedup curve

Step 5: Static mask analysis

Step 6: Full paper table

Repository Structure

Key Finding: Predictor vs Static Mask

🏁 72B Live Demo

Hardware

Citation

Author

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes