JANG — Adaptive Mixed-Precision Quantization for Apple Silicon. The GGUF equivalent for MLX.
Project description
Jang Adaptive N-bit Grading
Mixed-Precision Quantization for Apple Silicon
The GGUF equivalent for MLX — models stay quantized in GPU memory at full Metal speed.
Open-source quantization format + tools + inference engine.
Website • Pre-quantized Models • Format Spec • Quantization Tools • Research & Experiments
What is JANG?
JANG (Jang Adaptive N-bit Grading) is an open-source quantization format and toolkit that makes large language models run on Apple Silicon at 2-bit precision while staying coherent.
Unlike uniform quantization (where every weight gets the same bits), JANG classifies tensors by sensitivity and gives critical layers (attention) more bits while aggressively compressing the bulk (MLP/experts). The result: a 122B model fits in 46 GB of GPU memory and answers questions correctly — where MLX uniform 2-bit produces garbage.
Key features:
- Models stay quantized in GPU memory (like GGUF) — no float16 expansion
- Uses MLX native Metal kernels (
quantized_matmul,gather_qmm) — full speed - Supports every architecture: MoE, Mamba, MLA, VL, hybrid SSM, dense transformers
- One command to quantize any HuggingFace model
- 11 profiles from extreme 2-bit to near-lossless 6-bit
- Works with FP8 source models (MiniMax, DeepSeek)
Results
MMLU Benchmark — 122B MoE at 2-bit
200 questions, 10 subjects. Qwen3.5-122B-A10B on M4 Max 128 GB.
| Method | Size | GPU | MMLU |
|---|---|---|---|
| JANG_1L (2.24b) | 51 GB | 46 GB | 73.0% |
| MLX mixed_2_6 | 44 GB | 45 GB | 46.0% |
| MLX uniform 2-bit | 36 GB | 36 GB | 56.0% |
JANG scores +27 points over MLX's best mixed-precision mode. Wins every subject except one.
Free-Form Quality — 35B MoE at 2-bit
| Prompt | JANG_2L (15 GB) | MLX mixed_2_6 (13 GB) | MLX uniform (10 GB) |
|---|---|---|---|
| What is 2+2? | "2+2 equals 4" ✅ | Loops ❌ | Number spam ❌ |
| Photosynthesis | "convert light energy into chemical energy" ✅ | "I cannot respond" ❌ | Garbage ❌ |
| Three planets | "Jupiter, Saturn, Uranus" ✅ | "Antina" loops ❌ | Number spam ❌ |
| Capital of France | "Paris" with details ✅ | Never answers ❌ | Partial ⚠️ |
JANG 4/6. MLX mixed 0/6. MLX uniform 0/6.
Why JANG Wins on MoE
MLX mixed_2_6 only protects v_proj + down_proj in select layers — a strategy designed for dense models. JANG protects all attention everywhere, including:
- GatedDeltaNet linear attention (Qwen3.5)
- MoE expert routing gates
- MLA latent projections (DeepSeek)
On MoE models, 94-98% of parameters are expert MLP. Protecting the other 2-6% at 8-bit costs almost nothing but makes the difference between 73% and 46% MMLU.
Note: JANG is designed for MoE/hybrid models. For dense models (Llama, Mistral), MLX uniform quantization is recommended.
How It Works
JANG protects the small fraction of weights that control output quality while compressing everything else.
CRITICAL (attention, output head) → 6-8 bit → Controls coherence
IMPORTANT (embeddings, routers) → 4-8 bit → Moderate sensitivity
COMPRESS (MLP, MoE experts) → 2-3 bit → Bulk of parameters
On a 122B MoE model, 98% of parameters are expert MLP. Giving the other 2% more bits costs almost nothing — but makes the difference between working and broken.
Install
pip install jang
For inference on Apple Silicon:
pip install "jang[mlx]"
Or install from source:
pip install git+https://github.com/jjang-ai/jangq.git#subdirectory=jang-tools
Quick Start
Convert any model
# Simple: pick 1-8 for target bits
jang convert path/to/model -p 2
# Specific profile
jang convert path/to/model -p JANG_1L
# From HuggingFace
jang convert Qwen/Qwen3.5-35B-A3B -p 2
Run inference
# Load and generate (requires: pip install mlx mlx-lm)
from jang_tools.loader import load_jang_model
from mlx_lm.sample_utils import make_sampler
from mlx_lm.generate import generate_step
import mlx.core as mx
model, tokenizer = load_jang_model("path/to/jang-model")
sampler = make_sampler(temp=0.7)
tokens = tokenizer.encode("What is photosynthesis?")
for tok, _ in generate_step(prompt=mx.array(tokens), model=model, max_tokens=200, sampler=sampler):
print(tokenizer.decode([tok.item()]), end="", flush=True)
if tok.item() == tokenizer.eos_token_id:
break
JANG models also work with any OpenAI-compatible server that supports MLX (e.g., vMLX, MLX Studio).
Python API
from jang_tools import convert_model, JANG_PROFILES, load_jang_model
# Convert any HuggingFace model
convert_model("Qwen/Qwen3.5-35B-A3B", "output-JANG_2L", profile="JANG_2L")
# Inspect
model = load_jang_model("output-JANG_2L")
print(model.summary())
# Estimate size before converting
from jang_tools import estimate_size_gb
print(estimate_size_gb(122_000_000_000, "JANG_1L"))
# → {'total_gb': 36.9, 'avg_bits_approx': 2.1, ...}
MMLU Benchmark
200-question MMLU (10 subjects, 20 per subject). Apple M4 Max 128 GB. All quantized in GPU memory.
Qwen3.5-122B MoE — JANG 73% vs MLX 46%
| Method | Size | GPU | MMLU Score |
|---|---|---|---|
| JANG_1L (2.24b) | 51 GB | 46 GB | 73.0% |
| MLX mixed_2_6 | 44 GB | 45 GB | 46.0% |
| MLX uniform 2-bit | 36 GB | 36 GB | 56.0% |
+27 points over MLX mixed. JANG wins every subject except one.
Qwen3.5-35B MoE — JANG 4/6 vs MLX 0/6
| Method | Size | Speed | Free-form Score |
|---|---|---|---|
| JANG_2L (2.28b) | 15 GB | 100 tok/s | 4/6 correct |
| MLX mixed_2_6 | 13 GB | 120 tok/s | 0/6 correct |
| MLX uniform 2-bit | 10 GB | 128 tok/s | 0/6 correct |
Small Dense Models — 65 Wins at the Breaking Point
On dense models (1-7B), JANG wins at the degradation boundary — the exact bit level where MLX uniform starts producing garbage:
| Model | JANG | MLX Uniform | Result |
|---|---|---|---|
| Phi-2 (2.7B) at 2-bit | Correct scientific answer | Empty output | JANG wins |
| SmolLM2 (1.7B) at 3-bit | "8 legs" (correct) | Number spam | JANG wins |
| Mistral-7B at 3-bit | Correct explanation | Number garbage | JANG wins |
65 wins, 0 losses across 7 models. At the breaking point, attention protection prevents catastrophic failure.
When JANG Helps vs When It Doesn't
| Scenario | Attention % of params | JANG Overhead | Benefit | Verdict |
|---|---|---|---|---|
| MoE at any bit level | 1-2% | ~2% bigger | Always better attention | JANG wins |
| Dense at breaking point (2-3 bit) | ~12% | ~12% bigger | Coherent vs garbage | JANG wins |
| Dense at 4-bit+ | ~12% | ~12% bigger | Already works fine | MLX wins |
Why: On MoE models, expert MLP is 94-98% of parameters. Boosting the other 2% costs almost nothing. On dense models at 4-bit, attention already has enough precision — the 12% overhead for 8-bit attention isn't justified.
Recommendation:
- MoE models (Qwen3.5 MoE, MiniMax, DeepSeek, Mixtral): Use JANG at any bit level
- Dense models at extreme compression (2-3 bit): Use JANG — it's the difference between working and broken
- Dense models at 4-bit+ (Llama, Mistral, Gemma): Use MLX uniform — JANG overhead isn't worth it
Profiles
| # | Profile | CRITICAL | IMPORTANT | COMPRESS | Best for |
|---|---|---|---|---|---|
| 1 | JANG_1L |
8 | 8 | 2 | Maximum quality ~2-bit |
| 2 | JANG_2L |
8 | 6 | 2 | Balanced 2-bit |
| 3 | JANG_3M |
8 | 3 | 3 | 3-bit with 8-bit attention |
| 4 | JANG_4M |
8 | 4 | 4 | The standard — same as MLX 4-bit + 8-bit attention |
| 5 | JANG_4L |
8 | 6 | 4 | High quality 4-bit |
| 6 | JANG_6M |
8 | 6 | 6 | Near-lossless |
Use -p 2 as shorthand for JANG_2L, -p 3 for JANG_3M, etc.
Supported Architectures
| Architecture | Examples | Tested |
|---|---|---|
| Dense Transformer | Llama, Qwen, Gemma, Phi, Mistral | ✅ |
| Mixture of Experts | Mixtral, Qwen3.5 MoE, DeepSeek, MiniMax | ✅ |
| Hybrid SSM + Attention | Jamba, Zamba, Nemotron-H | ✅ |
| Linear Attention | Qwen3.5 GatedDeltaNet | ✅ |
| Multi-head Latent Attention | DeepSeek-V3/R1 | ✅ |
| Vision-Language | Qwen-VL, LLaVA, Pixtral | ✅ |
| Pure SSM | Mamba, Mamba2 | ✅ |
| FP8 Source Models | MiniMax-M2.5, DeepSeek FP8 | ✅ |
Pre-quantized Models
Available on HuggingFace:
| Model | Profile | Score | Download |
|---|---|---|---|
| Qwen3.5-122B-A10B | JANG_1L | 6/6 | JANGQ-AI/Qwen3.5-122B-A10B-JANG_1L |
| Qwen3.5-35B-A3B | JANG_2L | 4/6 | JANGQ-AI/Qwen3.5-35B-A3B-JANG_2L |
| Qwen3.5-27B | JANG_1L | 4/6 | JANGQ-AI/Qwen3.5-27B-JANG_1L |
Format
JANG v1.1 uses .jang.safetensors — standard safetensors with per-tensor quantized weights. See FORMAT.md for the complete specification.
License
Apache 2.0
Author
Created by Jinho Jang
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jang-1.1.0.tar.gz.
File metadata
- Download URL: jang-1.1.0.tar.gz
- Upload date:
- Size: 52.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a77437e1793bc8aebe243fc1945176d4386b5bf0ab4e416f0234cf2712511e44
|
|
| MD5 |
aae541472ab5c58f9f5cf623b064015a
|
|
| BLAKE2b-256 |
60c4de3ed5d8c6ae0c293182c571368bc4af29077710db65f41a5371891657a7
|
File details
Details for the file jang-1.1.0-py3-none-any.whl.
File metadata
- Download URL: jang-1.1.0-py3-none-any.whl
- Upload date:
- Size: 56.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d98f81da0ae591df91018798321d6234cdcc594ffffb1e5e8f032c3ba0f9ecd1
|
|
| MD5 |
e696dd42306b6414feaa7a0232ac0da8
|
|
| BLAKE2b-256 |
f730a2cb05e840c43bbbb26926c0ee38b874eb242f76dbda21247b465b8291f1
|