jang · PyPI

JANG — Adaptive Mixed-Precision Quantization for Apple Silicon. The GGUF equivalent for MLX.

These details have not been verified by PyPI

Project links

Project description

JANG

Jang Adaptive N-bit Grading

Mixed-Precision Quantization for Apple Silicon

The GGUF equivalent for MLX — models stay quantized in GPU memory at full Metal speed.
Open-source quantization format + tools + inference engine.

Website • Pre-quantized Models • Format Spec • Quantization Tools • Research & Experiments

License Python Platform Format

What is JANG?

JANG (Jang Adaptive N-bit Grading) is an open-source quantization format and toolkit that makes large language models run on Apple Silicon at 2-bit precision while staying coherent.

Unlike uniform quantization (where every weight gets the same bits), JANG classifies tensors by sensitivity and gives critical layers (attention) more bits while aggressively compressing the bulk (MLP/experts). The result: a 122B model fits in 46 GB of GPU memory and answers questions correctly — where MLX uniform 2-bit produces garbage.

Key features:

Models stay quantized in GPU memory (like GGUF) — no float16 expansion
Uses MLX native Metal kernels (quantized_matmul, gather_qmm) — full speed
Supports every architecture: MoE, Mamba, MLA, VL, hybrid SSM, dense transformers
One command to quantize any HuggingFace model
11 profiles from extreme 2-bit to near-lossless 6-bit
Works with FP8 source models (MiniMax, DeepSeek)

Results

MMLU Benchmark — 122B MoE at 2-bit

200 questions, 10 subjects. Qwen3.5-122B-A10B on M4 Max 128 GB.

Method	Size	GPU	MMLU
JANG_1L (2.24b)	51 GB	46 GB	73.0%
MLX mixed_2_6	44 GB	45 GB	46.0%
MLX uniform 2-bit	36 GB	36 GB	56.0%

JANG scores +27 points over MLX's best mixed-precision mode. Wins every subject except one.

Free-Form Quality — 35B MoE at 2-bit

Prompt	JANG_2L (15 GB)	MLX mixed_2_6 (13 GB)	MLX uniform (10 GB)
What is 2+2?	"2+2 equals 4" ✅	Loops ❌	Number spam ❌
Photosynthesis	"convert light energy into chemical energy" ✅	"I cannot respond" ❌	Garbage ❌
Three planets	"Jupiter, Saturn, Uranus" ✅	"Antina" loops ❌	Number spam ❌
Capital of France	"Paris" with details ✅	Never answers ❌	Partial ⚠️

JANG 4/6. MLX mixed 0/6. MLX uniform 0/6.

Why JANG Wins on MoE

MLX mixed_2_6 only protects v_proj + down_proj in select layers — a strategy designed for dense models. JANG protects all attention everywhere, including:

GatedDeltaNet linear attention (Qwen3.5)
MoE expert routing gates
MLA latent projections (DeepSeek)

On MoE models, 94-98% of parameters are expert MLP. Protecting the other 2-6% at 8-bit costs almost nothing but makes the difference between 73% and 46% MMLU.

Note: JANG is designed for MoE/hybrid models. For dense models (Llama, Mistral), MLX uniform quantization is recommended.

How It Works

JANG protects the small fraction of weights that control output quality while compressing everything else.

CRITICAL  (attention, output head)     →  6-8 bit  →  Controls coherence
IMPORTANT (embeddings, routers)        →  4-8 bit  →  Moderate sensitivity
COMPRESS  (MLP, MoE experts)           →  2-3 bit  →  Bulk of parameters

On a 122B MoE model, 98% of parameters are expert MLP. Giving the other 2% more bits costs almost nothing — but makes the difference between working and broken.

Install

pip install jang

For inference on Apple Silicon:

pip install "jang[mlx]"

Or install from source:

pip install git+https://github.com/jjang-ai/jangq.git#subdirectory=jang-tools

Quick Start

Convert any model

# Simple: pick 1-8 for target bits
jang convert path/to/model -p 2

# Specific profile
jang convert path/to/model -p JANG_1L

# From HuggingFace
jang convert Qwen/Qwen3.5-35B-A3B -p 2

Run inference

# Load and generate (requires: pip install mlx mlx-lm)
from jang_tools.loader import load_jang_model
from mlx_lm.sample_utils import make_sampler
from mlx_lm.generate import generate_step
import mlx.core as mx

model, tokenizer = load_jang_model("path/to/jang-model")
sampler = make_sampler(temp=0.7)

tokens = tokenizer.encode("What is photosynthesis?")
for tok, _ in generate_step(prompt=mx.array(tokens), model=model, max_tokens=200, sampler=sampler):
    print(tokenizer.decode([tok.item()]), end="", flush=True)
    if tok.item() == tokenizer.eos_token_id:
        break

JANG models also work with any OpenAI-compatible server that supports MLX (e.g., vMLX, MLX Studio).

Python API

from jang_tools import convert_model, JANG_PROFILES, load_jang_model

# Convert any HuggingFace model
convert_model("Qwen/Qwen3.5-35B-A3B", "output-JANG_2L", profile="JANG_2L")

# Inspect
model = load_jang_model("output-JANG_2L")
print(model.summary())

# Estimate size before converting
from jang_tools import estimate_size_gb
print(estimate_size_gb(122_000_000_000, "JANG_1L"))
# → {'total_gb': 36.9, 'avg_bits_approx': 2.1, ...}

MMLU Benchmark

200-question MMLU (10 subjects, 20 per subject). Apple M4 Max 128 GB. All quantized in GPU memory.

Qwen3.5-122B MoE — JANG 73% vs MLX 46%

Method	Size	GPU	MMLU Score
JANG_1L (2.24b)	51 GB	46 GB	73.0%
MLX mixed_2_6	44 GB	45 GB	46.0%
MLX uniform 2-bit	36 GB	36 GB	56.0%

+27 points over MLX mixed. JANG wins every subject except one.

Qwen3.5-35B MoE — JANG 4/6 vs MLX 0/6

Method	Size	Speed	Free-form Score
JANG_2L (2.28b)	15 GB	100 tok/s	4/6 correct
MLX mixed_2_6	13 GB	120 tok/s	0/6 correct
MLX uniform 2-bit	10 GB	128 tok/s	0/6 correct

Small Dense Models — 65 Wins at the Breaking Point

On dense models (1-7B), JANG wins at the degradation boundary — the exact bit level where MLX uniform starts producing garbage:

Model	JANG	MLX Uniform	Result
Phi-2 (2.7B) at 2-bit	Correct scientific answer	Empty output	JANG wins
SmolLM2 (1.7B) at 3-bit	"8 legs" (correct)	Number spam	JANG wins
Mistral-7B at 3-bit	Correct explanation	Number garbage	JANG wins

65 wins, 0 losses across 7 models. At the breaking point, attention protection prevents catastrophic failure.

When JANG Helps vs When It Doesn't

Scenario	Attention % of params	JANG Overhead	Benefit	Verdict
MoE at any bit level	1-2%	~2% bigger	Always better attention	JANG wins
Dense at breaking point (2-3 bit)	~12%	~12% bigger	Coherent vs garbage	JANG wins
Dense at 4-bit+	~12%	~12% bigger	Already works fine	MLX wins

Why: On MoE models, expert MLP is 94-98% of parameters. Boosting the other 2% costs almost nothing. On dense models at 4-bit, attention already has enough precision — the 12% overhead for 8-bit attention isn't justified.

Recommendation:

MoE models (Qwen3.5 MoE, MiniMax, DeepSeek, Mixtral): Use JANG at any bit level
Dense models at extreme compression (2-3 bit): Use JANG — it's the difference between working and broken
Dense models at 4-bit+ (Llama, Mistral, Gemma): Use MLX uniform — JANG overhead isn't worth it

Profiles

#	Profile	CRITICAL	IMPORTANT	COMPRESS	Best for
1	`JANG_1L`	8	8	2	Maximum quality ~2-bit
2	`JANG_2L`	8	6	2	Balanced 2-bit
3	`JANG_3M`	8	3	3	3-bit with 8-bit attention
4	`JANG_4M`	8	4	4	The standard — same as MLX 4-bit + 8-bit attention
5	`JANG_4L`	8	6	4	High quality 4-bit
6	`JANG_6M`	8	6	6	Near-lossless

Use -p 2 as shorthand for JANG_2L, -p 3 for JANG_3M, etc.

Supported Architectures

Architecture	Examples	Tested
Dense Transformer	Llama, Qwen, Gemma, Phi, Mistral	✅
Mixture of Experts	Mixtral, Qwen3.5 MoE, DeepSeek, MiniMax	✅
Hybrid SSM + Attention	Jamba, Zamba, Nemotron-H	✅
Linear Attention	Qwen3.5 GatedDeltaNet	✅
Multi-head Latent Attention	DeepSeek-V3/R1	✅
Vision-Language	Qwen-VL, LLaVA, Pixtral	✅
Pure SSM	Mamba, Mamba2	✅
FP8 Source Models	MiniMax-M2.5, DeepSeek FP8	✅

Pre-quantized Models

Available on HuggingFace:

Model	Profile	Score	Download
Qwen3.5-122B-A10B	JANG_1L	6/6	JANGQ-AI/Qwen3.5-122B-A10B-JANG_1L
Qwen3.5-35B-A3B	JANG_2L	4/6	JANGQ-AI/Qwen3.5-35B-A3B-JANG_2L
Qwen3.5-27B	JANG_1L	4/6	JANGQ-AI/Qwen3.5-27B-JANG_1L

Format

JANG v1.1 uses .jang.safetensors — standard safetensors with per-tensor quantized weights. See FORMAT.md for the complete specification.

License

Apache 2.0

Author

Created by Jinho Jang

jangq.ai • GitHub • HuggingFace

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.5.9

Apr 30, 2026

2.5.8

Apr 26, 2026

2.5.4

Apr 24, 2026

2.5.3

Apr 24, 2026

2.5.2

Apr 24, 2026

2.5.1

Apr 24, 2026

2.5.0

Apr 24, 2026

2.4.2

Apr 22, 2026

2.3.2

Apr 5, 2026

2.3.1

Apr 4, 2026

2.3.0

Mar 26, 2026

2.2.0

Mar 23, 2026

2.1.5

Mar 20, 2026

2.1.4

Mar 20, 2026

2.1.3

Mar 18, 2026

2.1.2

Mar 18, 2026

2.1.1

Mar 18, 2026

2.1.0

Mar 18, 2026

2.0.1

Mar 17, 2026

2.0.0

Mar 17, 2026

1.4.0

Mar 17, 2026

1.3.0

Mar 16, 2026

1.2.1

Mar 16, 2026

1.2.0

Mar 16, 2026

This version

1.1.0

Mar 16, 2026

1.0.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jang-1.1.0.tar.gz (52.4 kB view details)

Uploaded Mar 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

jang-1.1.0-py3-none-any.whl (56.6 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file jang-1.1.0.tar.gz.

File metadata

Download URL: jang-1.1.0.tar.gz
Upload date: Mar 16, 2026
Size: 52.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for jang-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a77437e1793bc8aebe243fc1945176d4386b5bf0ab4e416f0234cf2712511e44`
MD5	`aae541472ab5c58f9f5cf623b064015a`
BLAKE2b-256	`60c4de3ed5d8c6ae0c293182c571368bc4af29077710db65f41a5371891657a7`

See more details on using hashes here.

File details

Details for the file jang-1.1.0-py3-none-any.whl.

File metadata

Download URL: jang-1.1.0-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 56.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for jang-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d98f81da0ae591df91018798321d6234cdcc594ffffb1e5e8f032c3ba0f9ecd1`
MD5	`e696dd42306b6414feaa7a0232ac0da8`
BLAKE2b-256	`f730a2cb05e840c43bbbb26926c0ee38b874eb242f76dbda21247b465b8291f1`

See more details on using hashes here.

jang 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Jang Adaptive N-bit Grading

Mixed-Precision Quantization for Apple Silicon

What is JANG?

Results

MMLU Benchmark — 122B MoE at 2-bit

Free-Form Quality — 35B MoE at 2-bit

Why JANG Wins on MoE

How It Works

Install

Quick Start

Convert any model

Run inference

Python API

MMLU Benchmark

Qwen3.5-122B MoE — JANG 73% vs MLX 46%

Qwen3.5-35B MoE — JANG 4/6 vs MLX 0/6

Small Dense Models — 65 Wins at the Breaking Point

When JANG Helps vs When It Doesn't

Profiles

Supported Architectures

Pre-quantized Models

Format

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes