Skip to main content

JANG — Adaptive Mixed-Precision Quantization for Apple Silicon. The GGUF equivalent for MLX.

Project description

JANG

Jang Adaptive N-bit Grading

Mixed-Precision Quantization for Apple Silicon

The GGUF equivalent for MLX — models stay quantized in GPU memory at full Metal speed.
Open-source quantization format + tools + inference engine.

WebsitePre-quantized ModelsFormat SpecQuantization ToolsResearch & Experiments

License Python Platform Format


What is JANG?

JANG (Jang Adaptive N-bit Grading) is an open-source quantization format and toolkit that makes large language models run on Apple Silicon at 2-bit precision while staying coherent.

Unlike uniform quantization (where every weight gets the same bits), JANG classifies tensors by sensitivity and gives critical layers (attention) more bits while aggressively compressing the bulk (MLP/experts). The result: a 122B model fits in 46 GB of GPU memory and answers questions correctly — where MLX uniform 2-bit produces garbage.

Key features:

  • Models stay quantized in GPU memory (like GGUF) — no float16 expansion
  • Uses MLX native Metal kernels (quantized_matmul, gather_qmm) — full speed
  • Supports every architecture: MoE, Mamba, MLA, VL, hybrid SSM, dense transformers
  • One command to quantize any HuggingFace model
  • 11 profiles from extreme 2-bit to near-lossless 6-bit
  • Works with FP8 source models (MiniMax, DeepSeek)

Results

MMLU Benchmark — 122B MoE at 2-bit

200 questions, 10 subjects. Qwen3.5-122B-A10B on M4 Max 128 GB.

Method Size GPU MMLU
JANG_1L (2.24b) 51 GB 46 GB 73.0%
MLX mixed_2_6 44 GB 45 GB 46.0%
MLX uniform 2-bit 36 GB 36 GB 56.0%

JANG scores +27 points over MLX's best mixed-precision mode. Wins every subject except one.

Free-Form Quality — 35B MoE at 2-bit

Prompt JANG_2L (15 GB) MLX mixed_2_6 (13 GB) MLX uniform (10 GB)
What is 2+2? "2+2 equals 4" ✅ Loops ❌ Number spam ❌
Photosynthesis "convert light energy into chemical energy" ✅ "I cannot respond" ❌ Garbage ❌
Three planets "Jupiter, Saturn, Uranus" ✅ "Antina" loops ❌ Number spam ❌
Capital of France "Paris" with details ✅ Never answers ❌ Partial ⚠️

JANG 4/6. MLX mixed 0/6. MLX uniform 0/6.

Why JANG Wins on MoE

MLX mixed_2_6 only protects v_proj + down_proj in select layers — a strategy designed for dense models. JANG protects all attention everywhere, including:

  • GatedDeltaNet linear attention (Qwen3.5)
  • MoE expert routing gates
  • MLA latent projections (DeepSeek)

On MoE models, 94-98% of parameters are expert MLP. Protecting the other 2-6% at 8-bit costs almost nothing but makes the difference between 73% and 46% MMLU.

Note: JANG is designed for MoE/hybrid models. For dense models (Llama, Mistral), MLX uniform quantization is recommended.

How It Works

JANG protects the small fraction of weights that control output quality while compressing everything else.

CRITICAL  (attention, output head)     →  6-8 bit  →  Controls coherence
IMPORTANT (embeddings, routers)        →  4-8 bit  →  Moderate sensitivity
COMPRESS  (MLP, MoE experts)           →  2-3 bit  →  Bulk of parameters

On a 122B MoE model, 98% of parameters are expert MLP. Giving the other 2% more bits costs almost nothing — but makes the difference between working and broken.

Install

pip install jang

For inference on Apple Silicon:

pip install "jang[mlx]"

Or install from source:

pip install git+https://github.com/jjang-ai/jangq.git#subdirectory=jang-tools

Quick Start

Convert any model

# Simple: pick 1-8 for target bits
jang convert path/to/model -p 2

# Specific profile
jang convert path/to/model -p JANG_1L

# From HuggingFace
jang convert Qwen/Qwen3.5-35B-A3B -p 2

Run inference

# Load and generate (requires: pip install mlx mlx-lm)
from jang_tools.loader import load_jang_model
from mlx_lm.sample_utils import make_sampler
from mlx_lm.generate import generate_step
import mlx.core as mx

model, tokenizer = load_jang_model("path/to/jang-model")
sampler = make_sampler(temp=0.7)

tokens = tokenizer.encode("What is photosynthesis?")
for tok, _ in generate_step(prompt=mx.array(tokens), model=model, max_tokens=200, sampler=sampler):
    print(tokenizer.decode([tok.item()]), end="", flush=True)
    if tok.item() == tokenizer.eos_token_id:
        break

JANG models also work with any OpenAI-compatible server that supports MLX (e.g., vMLX, MLX Studio).

Python API

from jang_tools import convert_model, JANG_PROFILES, load_jang_model

# Convert any HuggingFace model
convert_model("Qwen/Qwen3.5-35B-A3B", "output-JANG_2L", profile="JANG_2L")

# Inspect
model = load_jang_model("output-JANG_2L")
print(model.summary())

# Estimate size before converting
from jang_tools import estimate_size_gb
print(estimate_size_gb(122_000_000_000, "JANG_1L"))
# → {'total_gb': 36.9, 'avg_bits_approx': 2.1, ...}

MMLU Benchmark

200-question MMLU (10 subjects, 20 per subject). Apple M4 Max 128 GB. All quantized in GPU memory.

Qwen3.5-122B MoE — JANG 73% vs MLX 46%

Method Size GPU MMLU Score
JANG_1L (2.24b) 51 GB 46 GB 73.0%
MLX mixed_2_6 44 GB 45 GB 46.0%
MLX uniform 2-bit 36 GB 36 GB 56.0%

+27 points over MLX mixed. JANG wins every subject except one.

Qwen3.5-35B MoE — JANG 4/6 vs MLX 0/6

Method Size Speed Free-form Score
JANG_2L (2.28b) 15 GB 100 tok/s 4/6 correct
MLX mixed_2_6 13 GB 120 tok/s 0/6 correct
MLX uniform 2-bit 10 GB 128 tok/s 0/6 correct

Small Dense Models — 65 Wins at the Breaking Point

On dense models (1-7B), JANG wins at the degradation boundary — the exact bit level where MLX uniform starts producing garbage:

Model JANG MLX Uniform Result
Phi-2 (2.7B) at 2-bit Correct scientific answer Empty output JANG wins
SmolLM2 (1.7B) at 3-bit "8 legs" (correct) Number spam JANG wins
Mistral-7B at 3-bit Correct explanation Number garbage JANG wins

65 wins, 0 losses across 7 models. At the breaking point, attention protection prevents catastrophic failure.

When JANG Helps vs When It Doesn't

Scenario Attention % of params JANG Overhead Benefit Verdict
MoE at any bit level 1-2% ~2% bigger Always better attention JANG wins
Dense at breaking point (2-3 bit) ~12% ~12% bigger Coherent vs garbage JANG wins
Dense at 4-bit+ ~12% ~12% bigger Already works fine MLX wins

Why: On MoE models, expert MLP is 94-98% of parameters. Boosting the other 2% costs almost nothing. On dense models at 4-bit, attention already has enough precision — the 12% overhead for 8-bit attention isn't justified.

Recommendation:

  • MoE models (Qwen3.5 MoE, MiniMax, DeepSeek, Mixtral): Use JANG at any bit level
  • Dense models at extreme compression (2-3 bit): Use JANG — it's the difference between working and broken
  • Dense models at 4-bit+ (Llama, Mistral, Gemma): Use MLX uniform — JANG overhead isn't worth it

Profiles

# Profile CRITICAL IMPORTANT COMPRESS Best for
1 JANG_1L 8 8 2 Maximum quality ~2-bit
2 JANG_2L 8 6 2 Balanced 2-bit
3 JANG_3M 8 3 3 3-bit with 8-bit attention
4 JANG_4M 8 4 4 The standard — same as MLX 4-bit + 8-bit attention
5 JANG_4L 8 6 4 High quality 4-bit
6 JANG_6M 8 6 6 Near-lossless

Use -p 2 as shorthand for JANG_2L, -p 3 for JANG_3M, etc.

Supported Architectures

Architecture Examples Tested
Dense Transformer Llama, Qwen, Gemma, Phi, Mistral
Mixture of Experts Mixtral, Qwen3.5 MoE, DeepSeek, MiniMax
Hybrid SSM + Attention Jamba, Zamba, Nemotron-H
Linear Attention Qwen3.5 GatedDeltaNet
Multi-head Latent Attention DeepSeek-V3/R1
Vision-Language Qwen-VL, LLaVA, Pixtral
Pure SSM Mamba, Mamba2
FP8 Source Models MiniMax-M2.5, DeepSeek FP8

Pre-quantized Models

Available on HuggingFace:

Model Profile Score Download
Qwen3.5-122B-A10B JANG_1L 6/6 JANGQ-AI/Qwen3.5-122B-A10B-JANG_1L
Qwen3.5-35B-A3B JANG_2L 4/6 JANGQ-AI/Qwen3.5-35B-A3B-JANG_2L
Qwen3.5-27B JANG_1L 4/6 JANGQ-AI/Qwen3.5-27B-JANG_1L

Format

JANG v1.1 uses .jang.safetensors — standard safetensors with per-tensor quantized weights. See FORMAT.md for the complete specification.

License

Apache 2.0

Author

Created by Jinho Jang

jangq.aiGitHubHuggingFace

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jang-1.1.0.tar.gz (52.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jang-1.1.0-py3-none-any.whl (56.6 kB view details)

Uploaded Python 3

File details

Details for the file jang-1.1.0.tar.gz.

File metadata

  • Download URL: jang-1.1.0.tar.gz
  • Upload date:
  • Size: 52.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for jang-1.1.0.tar.gz
Algorithm Hash digest
SHA256 a77437e1793bc8aebe243fc1945176d4386b5bf0ab4e416f0234cf2712511e44
MD5 aae541472ab5c58f9f5cf623b064015a
BLAKE2b-256 60c4de3ed5d8c6ae0c293182c571368bc4af29077710db65f41a5371891657a7

See more details on using hashes here.

File details

Details for the file jang-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: jang-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 56.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for jang-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d98f81da0ae591df91018798321d6234cdcc594ffffb1e5e8f032c3ba0f9ecd1
MD5 e696dd42306b6414feaa7a0232ac0da8
BLAKE2b-256 f730a2cb05e840c43bbbb26926c0ee38b874eb242f76dbda21247b465b8291f1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page