Portable mixed-precision math, linear-algebra, & retrieval library with 2000+ SIMD kernels for x86, Arm, RISC-V, LoongArch, Power, & WebAssembly

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

ashvardanian

Project description

NumKong for Python

NumKong for Python is the main high-level SDK in the project. It targets the gap between numpy and low-level native kernels: you keep buffer-protocol interoperability and shape-aware outputs, but you stop giving up mixed precision, widened accumulators, packed reuse, and backend-specific optimizations every time you leave float64. It combines NumPy-friendly buffers with native mixed-precision kernels, zero-copy tensor views, packed and symmetric matrix operations, sparse helpers, geometric mesh alignment, and MaxSim. The API feels NumPy-shaped with familiar scalar, batched, and all-pairs entrypoints, while Tensor keeps shape, dtype, and strides visible through a memoryview-backed container. Low-precision dtypes (BFloat16, Float8, Float6, packed bits) flow through the same API, and dense, packed, and symmetric kernels release the GIL around native work.

Ecosystem Comparison

Feature	NumKong	NumPy/SciPy	PyTorch
Operation families	dots, distances, binary, probability, geospatial, curved, mesh, sparse, MaxSim, elementwise, reductions, cast, trig	dots, distances, elementwise, reductions, some probability via `cdist`	dots, distances, elementwise, reductions
Precision	BFloat16 through sub-byte — Float8, Float6, Int4, packed bits; automatic widening; Kahan summation; 0 ULP in Float32/Float64	Float16, partial BFloat16; no auto-widening; standard accuracy	Float16, BFloat16, partial Float8; explicit AMP required; standard accuracy
Runtime SIMD dispatch	auto-selects best ISA per-thread at runtime on x86, ARM, RISC-V	compile-time only	CPU: compile-time; CUDA: runtime
Packed matrix, GEMM-like	pack once, reuse across query batches	`np.dot`/`@` — no persistent packing	`torch.mm` — no persistent distance-oriented packing
Symmetric kernels, SYRK-like	skip duplicate pairs, up to 2x speedup for self-distance	`pdist` computes one triangle; `cdist` recomputes both	`X @ X.T` recomputes both triangles
Output parameter `out=`	Yes — all major entrypoints	Yes — most `ufunc`s and functions; SciPy: some functions only	Yes for `torch.mm`, `torch.matmul`; No for `torch.cdist`
Fast CPython calling convention	Yes — direct `METH_FASTCALL`	Yes — `vectorcall` in 2.0+	No — tensor dispatch overhead
GIL release	batched, packed, and symmetric kernels	some ops only	most ops

Quickstart

import numpy as np
import numkong as nk

a, b = np.random.randn(1536).astype(np.float32), np.random.randn(1536).astype(np.float32)
dot = nk.dot(a, b)  # widened accumulation, not same-dtype
print(dot)

Installation

From PyPI:

python -m pip install numkong

From a local checkout:

python -m pip install .

Quick runtime check:

python -c "import numkong as nk; print(nk.get_capabilities())"

Wheel Compatibility and Building from Source

Pre-built wheels are available on PyPI for Linux (x86_64, aarch64, riscv64, plus i686, ppc64le, s390x), macOS (x86_64, arm64), and Windows (AMD64, ARM64). Python 3.9 through 3.14 is supported, including free-threading variants (3.13t, 3.14t). Every wheel is built with NK_DYNAMIC_DISPATCH=1, so a single wheel covers all CPU generations on a given architecture.

When building from source, the compiler requirements depend on the platform. On macOS x86 only AVX2 is available; on macOS ARM NEON is always present, but SME requires Apple M4+ with Xcode 16+ (AppleClang 16+). RISC-V builds require Clang and LLD because GCC lacks zvfh, zvfbfwma, and zvbb support. On Windows, MSVC 19.44+ (Visual Studio 2022 17.14+) is recommended for full AVX-512 with FP16/BF16/VNNI. Build parallelism is controlled by NK_BUILD_PARALLEL, which defaults to min(cpu_count, 4) and should be lowered in memory-constrained containers. There is no OpenMP dependency. Python-side parallelism uses concurrent.futures with GIL-free kernels.

NK_BUILD_PARALLEL=2 pip install . --no-build-isolation

Dot Products

Dot products are their own family because storage type, conjugation rules, and output widening matter.

import numpy as np
import numkong as nk

a = (np.random.randn(256) + 1j * np.random.randn(256)).astype(np.complex64)
b = (np.random.randn(256) + 1j * np.random.randn(256)).astype(np.complex64)

dot = nk.dot(a, b)   # numpy.dot(a, b)
vdot = nk.vdot(a, b) # numpy.vdot(a, b)

print(dot, vdot)

Real low-precision inputs can also be routed through explicit dtype tags when the storage buffer itself is raw bytes.

Dense Distances

The dense distance entrypoints cover sqeuclidean, euclidean, and angular. The first important difference from NumPy or SciPy is that the accumulator policy is not forced to match the storage dtype.

import numpy as np
import numkong as nk

a = np.random.randn(768).astype(np.float16)
b = np.random.randn(768).astype(np.float16)

sqeuclidean = nk.sqeuclidean(a, b)
euclidean = nk.euclidean(a, b)
angular = nk.angular(a, b)

For float16, a naive same-dtype implementation is exactly the kind of path that loses precision or widens too late. NumKong's API makes the widening policy part of the kernel contract.

Output Control: `out=`, `dtype=`, and `out_dtype=`

Most distance and dot-product entrypoints accept out=, dtype=, and out_dtype= keyword arguments. Passing them avoids dynamic memory allocations for temporary objects.

import numpy as np
import numkong as nk

queries = np.random.randn(100, 768).astype(np.float32)
database = np.random.randn(100, 768).astype(np.float32)

# Pre-allocated output with out=
out = nk.zeros((100,), dtype="float32")
nk.sqeuclidean(queries, database[:100], out=out)  # writes in-place, returns None

# Explicit input dtype for raw byte buffers
raw = np.frombuffer(some_bytes, dtype=np.uint16)
nk.dot(raw, raw, dtype=nk.bfloat16)  # reinterpret uint16 as bf16

# Output dtype override
nk.euclidean(queries[0], database[0], out_dtype="float32")  # accumulate in f64, downcast result

When out= is provided, the function writes results in-place and returns None. The out array must be pre-allocated with the correct shape and a supported dtype. For custom float types (bfloat16, float16, float8_e4m3, float8_e5m2, float6_e2m3, float6_e3m2), type objects are preferred over strings — they are faster to dispatch and provide IDE autocomplete:

nk.dot(a, b, dtype=nk.bfloat16) # works faster
nk.dot(a, b, dtype="bfloat16")  # works a bit slower

Set Similarity

Packed-binary metrics operate on packed bits. That is why the right NumPy equivalent uses np.packbits, not bool arrays fed to scalar Python code.

import numpy as np
import numkong as nk

a_bits = np.random.randint(0, 2, size=256, dtype=np.uint8)
b_bits = np.random.randint(0, 2, size=256, dtype=np.uint8)
a, b = np.packbits(a_bits), np.packbits(b_bits)

hamming = nk.hamming(a, b, dtype="uint1")
jaccard = nk.jaccard(a, b, dtype="uint1")

Integer set Jaccard works on sorted ascending arrays of integer identifiers. Both inputs must be sorted in ascending order for correct results.

set_a = np.array([1, 3, 5, 7, 9], dtype=np.uint32)  # must be sorted ascending
set_b = np.array([3, 5, 8, 9, 10], dtype=np.uint32)  # must be sorted ascending
jaccard_sets = nk.jaccard(set_a, set_b) # |A ∩ B| / |A ∪ B|
assert 0.0 < jaccard_sets < 1.0, "|A ∩ B| / |A ∪ B| should be in (0, 1)"

Probability Metrics

Probability divergences deserve their own section because they are not just "one more distance".

import numpy as np
import numkong as nk

p = np.array([0.2, 0.3, 0.5], dtype=np.float32)
q = np.array([0.1, 0.3, 0.6], dtype=np.float32)

kl_forward, kl_reverse = nk.kullbackleibler(p, q), nk.kullbackleibler(q, p)
assert kl_forward != kl_reverse, "KLD is asymmetric"

js_forward, js_reverse = nk.jensenshannon(p, q), nk.jensenshannon(q, p)
np.testing.assert_allclose(js_forward, js_reverse, atol=1e-6)  # JSD is symmetric

Geospatial Metrics

Geospatial kernels take four coordinate arrays. Inputs are in radians. Outputs are in meters.

import numpy as np
import numkong as nk

# Statue of Liberty (40.6892°N, 74.0445°W) → Big Ben (51.5007°N, 0.1246°W)
liberty_lat, liberty_lon = np.array([0.7101605100], dtype=np.float64), np.array([-1.2923203180], dtype=np.float64)
big_ben_lat, big_ben_lon = np.array([0.8988567821], dtype=np.float64), np.array([-0.0021746802], dtype=np.float64)

vincenty = nk.vincenty(liberty_lat, liberty_lon, big_ben_lat, big_ben_lon)    # ≈ 5,589,857 m (ellipsoidal, baseline)
haversine = nk.haversine(liberty_lat, liberty_lon, big_ben_lat, big_ben_lon)  # ≈ 5,543,723 m (spherical, ~46 km less)

# Vincenty in f32 — drifts ~2 m from f64
liberty_lat32 = liberty_lat.astype(np.float32)
liberty_lon32 = liberty_lon.astype(np.float32)
big_ben_lat32 = big_ben_lat.astype(np.float32)
big_ben_lon32 = big_ben_lon.astype(np.float32)
vincenty_f32 = nk.vincenty(liberty_lat32, liberty_lon32, big_ben_lat32, big_ben_lon32)  # ≈ 5,589,859 m (+2 m drift)

Curved Metrics

Curved-space kernels use an extra metric tensor or inverse covariance and should not be mixed into the Euclidean section.

import numpy as np
import numkong as nk

# Complex bilinear form: aᴴ M b
a = (np.ones(16) + 1j * np.zeros(16)).astype(np.complex64)
b = (np.zeros(16) + 1j * np.ones(16)).astype(np.complex64)
m = np.eye(16, dtype=np.complex64)
bilinear = nk.bilinear(a, b, m)

# Real Mahalanobis distance: √((a−b)ᵀ M⁻¹ (a−b))
x = np.ones(32, dtype=np.float32)
y = np.full(32, 2.0, dtype=np.float32)
inv_cov = np.eye(32, dtype=np.float32)
mahalanobis = nk.mahalanobis(x, y, inv_cov)

Scalar Types and Low-Precision Formats

NumKong exposes two different low-precision stories in Python. It exposes Python scalar objects for a few formats. And it exposes tensor dtypes for the broader buffer-oriented path.

The six scalar types have stable payload sizes even though Python object headers are not:

Type	Bits	Bytes	Range	Inf	NaN
`nk.float16`	1+5+10	2	±65504	yes	yes
`nk.bfloat16`	1+8+7	2	±3.4×10³⁸	yes	yes
`nk.float8_e4m3`	1+4+3	1	±448	no	yes
`nk.float8_e5m2`	1+5+2	1	±57344	yes	yes
`nk.float6_e2m3`	1+2+3	1	±7.5	no	no
`nk.float6_e3m2`	1+3+2	1	±28	no	no

The Bits column shows sign + exponent + mantissa bit counts. The Bytes column is the stable payload size; float8_* and float6_* both store 1 byte because the sub-byte formats are padded to byte alignment.

The full object footprint is interpreter-dependent. Use sys.getsizeof(nk.float16(1.0)) if you need the heap footprint of the Python wrapper object itself. Use Tensor.itemsize and Tensor.nbytes for the stable payload sizes of array storage.

ml_dtypes matters here because NumKong explicitly interoperates with the formats that NumPy still does not model well. The test suite compares bfloat16, float8_e4m3, float8_e5m2, float6_e2m3, and float6_e3m2 behavior against ml_dtypes where that comparison is meaningful.

Promotion is intentional. Mixed exotic floats are routed through wider compute types rather than pretending a same-width accumulator is good enough.

ml_dtypes Interoperability

NumKong accepts ml_dtypes arrays directly — no .view(np.uint8) workaround needed:

import ml_dtypes
a = np.random.randn(100, 768).astype(np.float32).astype(ml_dtypes.bfloat16)
b = np.random.randn(100, 768).astype(np.float32).astype(ml_dtypes.bfloat16)
result = nk.cdist(a, b, "dot")  # just works

NumKong scalars also work as NumPy dtype specifiers:

arr = np.array([1.0, 2.0, 3.0], dtype=nk.bfloat16)
float(arr[0])  # → 1.0

Type name mapping between the two libraries:

ml_dtypes	NumKong	Status
`ml_dtypes.bfloat16`	`nk.bfloat16` / `"bfloat16"`	Identical format
`ml_dtypes.float8_e4m3`	`nk.float8_e4m3` / `"e4m3"`	Identical (IEEE E4M3)
`ml_dtypes.float8_e4m3fn`	`nk.float8_e4m3` / `"e4m3"`	Identical (E4M3FN = no inf)
`ml_dtypes.float8_e5m2`	`nk.float8_e5m2` / `"e5m2"`	Identical format
`ml_dtypes.float6_e2m3fn`	`nk.float6_e2m3` / `"e2m3"`	Identical (MX E2M3)
`ml_dtypes.float6_e3m2fn`	`nk.float6_e3m2` / `"e3m2"`	Identical (MX E3M2)
`ml_dtypes.float8_e4m3fnuz`	—	Rejected: different bias, NaN, and zero
`ml_dtypes.float8_e5m2fnuz`	—	Rejected: different NaN and zero encoding
`ml_dtypes.float8_e4m3b11fnuz`	—	Rejected: bias=11, incompatible encoding
`ml_dtypes.float8_e8m0fnu`	—	Not supported: exponent-only MX scale format
`ml_dtypes.float8_e3m4`	—	Not supported: no NumKong kernel
`ml_dtypes.float4_e2m1fn`	—	Not supported: 4-bit MX float
`ml_dtypes.int4`	`"int4"`	Compatible via buffer protocol
`ml_dtypes.uint4`	`"uint4"`	Compatible via buffer protocol
`ml_dtypes.int2`	—	Not supported
`ml_dtypes.uint2`	—	Not supported

Tensor Objects and Buffer Interop

Tensor is a memoryview-backed object with NumPy-like metadata. It is the central container for strided views, transpose, reshape, flatten, and axis reductions.

import numpy as np
import numkong as nk

t = nk.Tensor(np.arange(12, dtype=np.float32).reshape(3, 4))

print(t.shape, t.dtype, t.ndim, t.strides, t.itemsize, t.nbytes)
print(np.asarray(t))      # zero-copy array view when layout allows it
print(t.T.shape)          # transposed Tensor view
print(t.reshape(2, 6).shape)
print(t.flatten().shape)

# Slicing — row, column, and scalar access
row0 = t[0, :]            # first row, shape (4,)
col2 = t[:, 2]            # third column, strided view, shape (3,)
val  = t[1, 2]            # scalar element access → 6.0

# Reductions compose with sliced views
idx = col2.argmin()        # index of the minimum in the third column
mn, i0, mx, i1 = col2.minmax()

The important layout rules are:

Tensor preserves shape and byte strides.
Transpose and slicing can produce non-contiguous views.
General reductions accept those views.
Matrix-style packed kernels require row-contiguous left operands.
Packed and symmetric outputs require C-contiguous out buffers.

Memory Layout Requirements

API family	Input requirement	Output requirement
Dense distances (`dot`, `euclidean`, etc.)	Rows must be contiguous (`strides[last] <= itemsize`). Strided rows (sliced columns) are rejected.	`out=` can have any stride along dim 0, but inner dim must be contiguous.
`cdist`	Same as dense distances	`out=` must be rank-2 with shape `(a.count, b.count)`
Elementwise (`scale`, `blend`, `fma`)	Arbitrary strides (strided views are supported)	`out=` must match input shape; strides are preserved
Packed matrix (`dots_packed`)	Left operand: rank-2, contiguous rows, no negative strides	Output: C-contiguous with expected dtype
Symmetric (`dots_symmetric`)	Contiguous rows	`out=`: C-contiguous square matrix
Tensor reductions (`sum`, `min`, `argmin`, etc.)	Arbitrary strides (strided views supported)	N/A (returns scalar or reduced tensor)

All-Pairs APIs and cdist

cdist is the NumPy/SciPy-shaped all-pairs entrypoint. It handles rectangular matrix pairs and symmetric self-distance cases.

import numpy as np
import numkong as nk

queries = np.random.randn(100, 768).astype(np.float32)
database = np.random.randn(10_000, 768).astype(np.float32)

pairwise = nk.angular(queries, database[:100])             # rectangular broadcasted pairwise call
all_pairs = nk.cdist(queries, database, metric="angular")  # scipy.spatial.distance.cdist analogue

assert np.asarray(pairwise).shape == (100, 100)
assert np.asarray(all_pairs).shape == (100, 10_000)

The intended large-scale parallel model for packed and symmetric kernels is external partitioning with row ranges, not a hidden threads= argument.

Elementwise Operations

Elementwise arithmetic and fused operations are their own family. They share the tensor infrastructure but should not be collapsed into the reduction or matrix sections.

import numpy as np
import numkong as nk

a = np.arange(8, dtype=np.float32)
b = np.arange(8, dtype=np.float32)[::-1].copy()

scaled = nk.scale(a, alpha=2.0, beta=1.0)     # 2 * a + 1
blended = nk.blend(a, b, alpha=0.25, beta=0.75)
fused = nk.fma(a, b, a, alpha=1.0, beta=1.0)  # a * b + a

assert np.asarray(scaled).shape == (8,)
assert np.asarray(fused).shape == (8,)

Moments Reductions

Moments reductions return (sum, sum_of_squares). The key property is that NumKong does not force you into same-storage accumulation.

import numpy as np
import numkong as nk

x = np.full(4096, 255, dtype=np.uint8)

nk_sum, nk_sumsq = nk.moments(nk.Tensor(x))
naive_sum = np.sum(x, dtype=np.uint8)      # overflows immediately
naive_sumsq = np.sum(x * x, dtype=np.uint8) # also overflows

print(nk_sum, nk_sumsq, naive_sum, naive_sumsq)
assert nk_sum > int(naive_sum)
assert nk_sumsq > int(naive_sumsq)

Same-width accumulation is a bad default for low-precision storage.

Min/Max Reductions

Min/max reductions are in a separate section because they cover strided reduction cases. NumKong provides SIMD-accelerated strided reductions that are not common in other libraries.

import numpy as np
import numkong as nk

matrix = nk.Tensor(np.array([
    [ 3.0,  0.0, 7.0],
    [ 1.0,  2.0, 5.0],
    [ 4.0, -1.0, 6.0],
], dtype=np.float32))

second_column = matrix[:, 1]  # strided view into a row-major Nx3 tensor

idx = second_column.argmin()
mn, i0, mx, i1 = second_column.minmax()

assert idx == 2
assert int(i0) == 2
assert float(np.asarray(mn)) == -1.0

Fresh measurement for the rewritten docs: on an Apple M2 Pro, np.argmin(matrix[:, 1]) on a row-major 2,000,000 x 3 float32 array took about 1.63 ms median. The equivalent NumKong Tensor(... )[:, 1].argmin() took about 0.67 ms median. That is about 2.45x faster on this strided reduction case.

Sparse Operations and Intersections

Sparse helpers cover both sorted-index intersections and weighted sparse dot products.

import numpy as np
import numkong as nk

idx_a, idx_b = np.array([1, 3, 5, 7], dtype=np.uint32), np.array([3, 4, 5, 8], dtype=np.uint32)
intersection_size = nk.intersect(idx_a, idx_b) # len(np.intersect1d(idx_a, idx_b))
assert intersection_size == 2, "indices 3 and 5"

val_a, val_b = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32), np.array([5.0, 6.0, 7.0, 8.0], dtype=np.float32)
sparse_dot = nk.sparse_dot(idx_a, val_a, idx_b, val_b)
assert sparse_dot > 0, "weighted dot over shared indices"

Packed Matrix Kernels for GEMM-Like Workloads

Packed matrix kernels are the right tool when the right-hand side is reused across many query batches. This is the GEMM-like story.

import numpy as np
import numkong as nk

left = np.random.randn(128, 768).astype(np.float32)
right = np.random.randn(10_000, 768).astype(np.float32)

right_packed = nk.dots_pack(right, dtype="float32")  # pack once, reuse many times
scores = nk.dots_packed(left, right_packed)          # equivalent to left @ right.T

assert scores.shape == (128, 10_000)
assert right_packed.nbytes == nk.PackedMatrix.packed_size(10_000, 768, dtype="float32")

Important runtime rules from the current implementation:

a must be rank-2
a must have contiguous rows
negative strides are rejected for these matrix kernels
out, when provided, must be C-contiguous with the expected dtype
start_row and end_row split the left operand rows

The arithmetic advantages are:

one-time packing of B
one-time internal layout conversion and depth padding
norm reuse for angulars_packed and euclideans_packed
no repeated scan of the original right-hand-side layout

Packing itself does not require aligned caller buffers. The packed object owns its internal payload and handles the layout under the hood.

Tensor @ PackedMatrix is also supported and maps to the same packed dot-product path.

Symmetric Kernels for SYRK-Like Workloads

Symmetric kernels solve a different problem from packed cross-matrix kernels. They compute self-similarity or self-distance matrices. This is the SYRK-like story.

import numpy as np
import numkong as nk

vectors = np.random.randn(1024, 768).astype(np.float32)
out = nk.zeros((1024, 1024), dtype="float64")

nk.dots_symmetric(vectors, out=out, start_row=0, end_row=256)
nk.dots_symmetric(vectors, out=out, start_row=256, end_row=512)

assert out.shape == (1024, 1024)

This family has different economics from packed GEMM-like work. It avoids duplicate (i, j) and (j, i) evaluations. It is naturally partitioned by row windows of one square output.

angulars_symmetric and euclideans_symmetric also benefit from reuse of dot-product-derived work inside the symmetric sweep. That is why these APIs are faster than a nested Python loop over angular(a[i], a[j]).

Geometric Mesh Alignment

Mesh alignment returns a structured result object. The current implementation exposes rotation, scale, rmsd, a_centroid, and b_centroid.

import numpy as np
import numkong as nk

source = np.array(
    [[0.0, 0.0, 0.0],
     [1.0, 0.0, 0.0],
     [0.0, 1.0, 0.0]],
    dtype=np.float32,
)

result = nk.kabsch(source, source.copy())
assert np.asarray(result.rotation).shape == (3, 3)
assert float(np.asarray(result.scale)) == 1.0

# Umeyama with known 2x scaling
target = source * 2.0
result = nk.umeyama(source, target)
assert float(np.asarray(result.rmsd)) < 1e-6, "umeyama should recover exact alignment"
assert abs(float(np.asarray(result.scale)) - 2.0) < 0.01, "umeyama should recover 2x scale"

That field-level check is the right style for this API family. It tells the reader exactly what the result object owns.

MaxSim and ColBERT-Style Late Interaction

MaxSim is the late-interaction primitive used by systems such as ColBERT. It is not generic matrix multiplication.

import numpy as np
import numkong as nk

queries = np.random.randn(32, 128).astype(np.float32)
documents = np.random.randn(192, 128).astype(np.float32)

q = nk.maxsim_pack(queries, dtype="float32")
d = nk.maxsim_pack(documents, dtype="float32")
score = nk.maxsim_packed(q, d)

assert np.isfinite(score)
assert q.nbytes == nk.MaxSimPackedMatrix.packed_size(32, 128, dtype="float32")

Capabilities, GIL Behavior, and Parallel Partitioning

Capability detection is explicit:

import numkong as nk

caps = nk.get_capabilities()
print({k: v for k, v in caps.items() if v})

The current implementation releases the GIL around the native dense metric calls and around the packed and symmetric matrix kernels. The repository also has threading tests for packed and symmetric row-range partitioning.

GEMM-like packed work and SYRK-like symmetric work should be documented differently:

import concurrent.futures
import numpy as np
import numkong as nk

left = np.random.randn(4096, 768).astype(np.float32)
right = np.random.randn(8192, 768).astype(np.float32)
packed = nk.dots_pack(right, dtype="float32")
out = nk.zeros((4096, 8192), dtype="float64")  # out must be pre-allocated with correct shape and dtype

def packed_chunk(start, end):
    nk.dots_packed(left, packed, out=out, start_row=start, end_row=end) # split left rows against one shared packed RHS

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as pool:
    for start in range(0, 4096, 1024):
        pool.submit(packed_chunk, start, min(start + 1024, 4096))

import concurrent.futures
import numpy as np
import numkong as nk

vectors = np.random.randn(4096, 768).astype(np.float32)
out = nk.zeros((4096, 4096), dtype="float64")  # out must be pre-allocated with correct shape and dtype

def symmetric_chunk(start, end):
    nk.dots_symmetric(vectors, out=out, start_row=start, end_row=end) # split row windows of one square output

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as pool:
    for start in range(0, 4096, 1024):
        pool.submit(symmetric_chunk, start, min(start + 1024, 4096))

OpenMP and other native schedulers still matter in lower layers. For Python, the intended user-facing story is external partitioning around the GIL-free kernels you actually use.

Addressing External Memory

NumKong implements the Python buffer protocol for zero-copy interop with NumPy, PyTorch, and other buffer-aware libraries. Two additional primitives cover pointer-level workflows: data_ptr reads the integer address out of any Tensor, and from_pointer() wraps any integer address back into one.

data_ptr returns the raw address, suitable for passing into ctypes, CUDA, or any FFI boundary. from_pointer(address, shape, dtype, *, strides=None, owner=None) creates a non-owning Tensor view. The optional owner keeps the source object alive for the lifetime of the view.

import numpy as np
import numkong as nk

# Round-trip through an integer address
matrix = nk.zeros((3, 4), dtype='float32')
address = matrix.data_ptr
matrix_view = nk.from_pointer(address, (3, 4), 'float32', owner=matrix)

# Wrap a NumPy array with zero copies
embeddings = np.random.randn(1024).astype(np.float32)
embeddings_view = nk.from_pointer(embeddings.ctypes.data, (1024,), 'float32', owner=embeddings)
nk.dot(embeddings, embeddings_view)  # same underlying data

PyTorch tensors already implement the buffer protocol, so most functions accept them directly. For explicit pointer-level control, or to go the other direction, the same primitives apply:

import torch

query = torch.randn(512)
nk.dot(query, query)  # buffer protocol, zero copy

# Explicit pointer wrap
query_view = nk.from_pointer(query.data_ptr(), tuple(query.shape), 'float32', owner=query)

# NumKong → PyTorch: 1D via buffer protocol, N-D via numpy bridge
flat = torch.frombuffer(memoryview(nk_tensor), dtype=torch.float32)
shaped = torch.as_tensor(np.asarray(nk_tensor))

CUDA unified memory, pinned buffers, and mmap'd files all work the same way — any CPU-accessible pointer is valid.

import ctypes, mmap

# CUDA unified memory (ensure CPU accessibility first)
cudart = ctypes.CDLL("libcudart.so")
unified_ptr = ctypes.c_void_p()
cudart.cudaMallocManaged(ctypes.byref(unified_ptr), 4096, 1)
cudart.cudaDeviceSynchronize()
unified = nk.from_pointer(unified_ptr.value, (1024,), 'float32')

# Memory-mapped file
with open("data.bin", "r+b") as f:
    mapping = mmap.mmap(f.fileno(), 0)
    mapped = nk.from_pointer(ctypes.addressof(
        ctypes.c_char.from_buffer(mapping)),
        (1024,), 'float32', owner=mapping)

NumKong: Mixed Precision for All

Portable mixed-precision math, linear-algebra, & retrieval library with 2'000+ SIMD kernels for x86, Arm, RISC-V, LoongArch, Power, & WebAssembly, leveraging rare algebraic transforms with both 1D & 2D registers like AMX & SME, covering 15+ numeric types from 4-bit integers & 6-bit floats to 128-bit complex numbers, validated against 118-bit extended-precision baselines with saturation, casting, & rounding edge-case coverage, in a 5-100x smaller binary than other BLAS-like alternatives, co-designed with Tensor abstractions in C++, Python, Rust, JavaScript, GoLang, & Swift.

NumKong banner

Latency, Throughput, & Numerical Stability

Most libraries return dot products in the same type as the input — Float16 × Float16 → Float16, Int8 × Int8 → Int8. This leads to quiet overflow: a 2048-dimensional i8 dot product can reach ±10 million, but i8 maxes out at 127. NumKong promotes to wider accumulators — Float16 → Float32, BFloat16 → Float32, Int8 → Int32, Float32 → Float64 — so results stay in range.

Single 2048-d dot product on Intel Sapphire Rapids, single-threaded. Each cell shows gso/s, mean relative error vs higher-precision reference. gso/s = Giga Scalar Operations per Second — a more suitable name than GFLOP/s when counting both integer and floating-point work. NumPy 2.4, PyTorch 2.10, JAX 0.9.

Input	NumPy + OpenBLAS	PyTorch + MKL	JAX	NumKong
	░░░░░░░░░░░░░░	░░░░░░░░░░░░░░	░░░░░░░░░░░░░░	░░░░░░░░░░░░░░
`f64`	2.0 gso/s, 1e-15 err	0.6 gso/s, 1e-15 err	0.4 gso/s, 1e-14 err	5.8 gso/s, 1e-16 err
`f32`	1.5 gso/s, 2e-6 err	0.6 gso/s, 2e-6 err	0.4 gso/s, 5e-6 err	7.1 gso/s, 2e-7 err
`bf16`	—	0.5 gso/s, 1.9% err	0.5 gso/s, 1.9% err	9.7 gso/s, 1.8% err
`f16`	0.2 gso/s, 0.25% err	0.5 gso/s, 0.25% err	0.4 gso/s, 0.25% err	11.5 gso/s, 0.24% err
`e5m2`	—	0.7 gso/s, 4.6% err	0.5 gso/s, 4.6% err	7.1 gso/s, 0% err
`i8`	1.1 gso/s, overflow	0.5 gso/s, overflow	0.5 gso/s, overflow	14.8 gso/s, 0% err

A fair objection: PyTorch and JAX are designed for throughput, not single-call latency. They lower execution graphs through XLA or vendored BLAS libraries like Intel MKL and Nvidia cuBLAS. So here's the same comparison on a throughput-oriented workload — matrix multiplication:

Matrix multiplication (2048 × 2048) × (2048 × 2048) on Intel Sapphire Rapids, single-threaded. gso/s = Giga Scalar Operations per Second, same format. NumPy 2.4, PyTorch 2.10, JAX 0.9, same versions.

Input	NumPy + OpenBLAS	PyTorch + MKL	JAX	NumKong
	░░░░░░░░░░░░░░	░░░░░░░░░░░░░░	░░░░░░░░░░░░░░	░░░░░░░░░░░░░░
`f64`	65.5 gso/s, 1e-15 err	68.2 gso/s, 1e-15 err	~14.3 gso/s, 1e-15 err	8.6 gso/s, 1e-16 err
`f32`	140 gso/s, 9e-7 err	145 gso/s, 1e-6 err	~60.5 gso/s, 1e-6 err	37.7 gso/s, 4e-7 err
`bf16`	—	851 gso/s, 1.8% err	~25.8 gso/s, 3.4% err	458 gso/s, 3.6% err
`f16`	0.3 gso/s, 0.25% err	140 gso/s, 0.37% err	~26.1 gso/s, 0.35% err	103 gso/s, 0.26% err
`e5m2`	—	0.4 gso/s, 4.6% err	~26.4 gso/s, 4.6% err	398 gso/s, 0% err
`i8`	0.4 gso/s, overflow	50.0 gso/s, overflow	~0.0 gso/s, overflow	1279 gso/s, 0% err

For f64, compensated "Dot2" summation reduces error by 10–50× compared to naive Float64 accumulation, depending on vector length. For f32, widening to Float64 gives 5–10× lower error. The library ships as a relatively small binary:

Package	Size	Parallelism & Memory	Available For
PyTorch + MKL	705 MB	Vector & Tile SIMD, OpenMP Threads, Hidden Allocs	Python, C++, Java
JAX + jaxlib	357 MB	Vector SIMD, XLA Threads, Hidden Allocs	Python
NumPy + OpenBLAS	30 MB	Vector SIMD, Built-in Threads, Hidden Allocs	Python
mathjs	9 MB	No SIMD, No Threads, Many Allocs	JS
NumKong	5 MB	Vector & Tile SIMD, Your Threads, Your Allocs	7 languages

Every kernel is validated against 118-bit extended-precision baselines with per-type ULP budgets across log-normal, uniform, and Cauchy input distributions. Tests check triangle inequality, Cauchy-Schwarz bounds, NaN propagation, overflow detection, and probability-simplex constraints for each ISA variant. Results are cross-validated against OpenBLAS, Intel MKL, and Apple Accelerate. A broader throughput comparison is maintained in NumWars.

Quick Start

Language	Install	Compatible with	Guide
C / C++	CMake, headers, & prebuilt	Linux, macOS, Windows, Android	include/README.md
Python	`pip install`	Linux, macOS, Windows	python/README.md
Rust	`cargo add`	Linux, macOS, Windows	rust/README.md
JS	`npm install` & `import`	Node.js, Bun, Deno & browsers	javascript/README.md
Swift	Swift Package Manager	macOS, iOS, tvOS, watchOS	swift/README.md
Go	`go get`	Linux, macOS, Windows via cGo	golang/README.md

What's Inside

NumKong covers 17 numeric types — from 6-bit floats to 64-bit complex numbers — across dozens of operations and 30+ SIMD backends, with hardware-aware defaults: Arm prioritizes f16, x86 prioritizes bf16.

Operations × Backend

Backend	dot	dots	spatial	spatials	set	sets	cast	reduce	trig	maxsim	mesh
x86
Haswell	●	●	●	●	●	●	●	●	●	●	●
Skylake	●	●	●	●	·	·	●	●	●	·	●
Ice Lake	●	●	●	●	●	●	●	●	·	●	·
Genoa	●	●	●	●	·	·	·	●	·	●	·
Sapphire	●	·	●	·	·	·	●	·	·	·	·
Sapphire AMX	·	●	·	●	·	·	·	·	·	●	·
Diamond	●	●	●	●	·	·	·	·	·	·	·
Alder Lake	●	●	●	●	·	·	·	●	·	●	·
Sierra Forest	●	●	●	●	·	·	·	●	·	·	·
Turin	·	·	·	·	·	·	·	·	·	·	·
Arm
NEON	●	●	●	●	●	●	●	●	●	·	●
NEON Half	●	●	●	●	·	·	·	●	·	·	●
NEON FHM	●	●	·	●	·	·	·	●	·	·	·
NEON BF16	●	●	●	●	·	·	·	●	·	·	●
NEON SDot	●	●	●	●	·	·	·	●	·	●	·
NEON FP8	●	●	●	●	·	·	·	·	·	·	·
SVE	●	·	●	·	●	·	·	·	·	·	·
SVE Half	●	·	●	·	·	·	·	·	·	·	·
SVE BF16	●	·	●	·	·	·	·	·	·	·	·
SVE SDot	·	·	·	·	·	·	·	·	·	·	·
SVE2	·	·	·	·	·	·	·	·	·	·	·
SME	·	●	·	●	·	·	·	·	·	●	·
SME F64	·	●	·	●	·	·	·	·	·	·	·
SME BI32	·	●	·	·	·	●	·	·	·	·	·
Other
Power VSX	●	●	●	●	●	●	·	·	·	·	·
LoongArch LASX	●	●	●	●	●	●	·	·	·	·	·
RVV	●	●	●	●	●	·	●	●	●	·	●
RVV Half	●	·	●	·	·	·	·	·	·	·	·
RVV BF16	●	·	●	·	·	·	·	·	·	·	·
RVV BB	●	·	·	·	●	·	·	·	·	·	·
WASM V128	●	●	●	●	●	●	·	●	●	●	●
Serial	●	●	●	●	●	●	●	●	●	●	●

Numeric Types × Backend

Backend	f64	f32	bf16	f16	e5m2	e4m3	e3m2	e2m3	i8	u8	i4	u4	u1	f64c	f32c	bf16c	f16c
x86
Haswell	●	●	●	●	●	●	●	●	●	●	●	●	●	●	●	●	●
Skylake	●	●	●	●	●	●	●	●	●	●	●	●	●	●	●	·	·
Ice Lake	·	●	·	●	·	·	●	●	●	●	●	●	●	·	·	·	·
Genoa	·	·	●	·	●	●	·	·	·	·	·	·	·	·	·	●	·
Sapphire	·	·	·	●	·	●	●	●	●	●	·	·	·	·	·	·	·
Sapphire AMX	·	●	●	●	●	●	●	●	●	●	·	·	·	·	·	·	·
Diamond	·	·	·	●	●	●	·	·	·	·	·	·	·	·	·	·	·
Alder Lake	·	●	●	●	·	·	●	●	●	●	·	·	·	·	·	·	·
Sierra Forest	·	·	·	·	·	·	●	●	●	●	·	·	·	·	·	·	·
Turin	·	●	●	·	·	·	·	·	·	·	·	·	·	·	·	·	·
Arm
NEON	●	●	·	·	●	●	●	●	●	●	·	·	●	●	●	·	·
NEON Half	·	·	·	●	·	·	·	·	●	●	·	·	·	·	·	·	●
NEON FHM	·	·	·	●	●	●	·	·	·	·	·	·	·	·	·	·	●
NEON BF16	·	·	●	·	●	●	·	·	·	·	·	·	·	·	·	●	·
NEON SDot	·	●	●	●	·	·	●	●	●	●	●	●	·	·	·	·	·
NEON FP8	·	·	·	·	●	●	●	●	·	·	·	·	·	·	·	·	·
SVE	●	●	●	●	·	·	·	·	●	●	·	·	●	●	●	·	·
SVE Half	·	·	·	●	·	·	·	·	·	·	·	·	·	·	·	·	●
SVE BF16	·	·	●	·	·	·	·	·	·	·	·	·	·	·	·	·	·
SVE SDot	·	·	·	·	·	·	·	·	·	·	·	·	·	·	·	·	·
SVE2	·	●	●	·	·	·	·	·	·	·	·	·	·	·	·	·	·
SME	·	●	●	●	●	●	●	●	●	●	●	●	·	·	·	·	·
SME F64	●	●	·	·	·	·	·	·	·	·	·	·	·	●	●	·	·
SME BI32	·	·	·	·	·	·	·	·	●	·	·	·	●	·	·	·	·
Other
Power VSX	●	●	●	●	·	·	·	·	●	●	·	·	●	·	·	·	·
LoongArch LASX	●	●	●	●	·	·	·	·	●	●	·	·	●	·	·	·	·
RVV	●	●	●	●	●	●	●	●	●	●	●	●	●	●	●	·	·
RVV Half	·	·	·	●	●	●	·	·	·	·	·	·	·	·	·	·	·
RVV BF16	·	·	●	·	●	●	·	·	·	·	·	·	·	·	·	·	·
RVV BB	·	·	·	·	·	·	·	·	·	·	·	·	●	·	·	·	·
WASM V128	●	●	●	●	●	●	●	●	●	●	●	●	●	·	·	·	·

Language Bindings

Operation	C/C++	Python	Rust	JavaScript	Swift	Go
dot	●	●	●	●	●	●
dots	●	●	●	●	●	●
spatial	●	●	●	●	●	●
spatials	●	●	●	●	●	●
set	●	●	●	●	●	●
sets	●	●	●	·	●	●
cast	●	●	●	●	·	·
reduce	●	●	●	·	·	·
trig	●	●	●	·	·	·
geospatial	●	●	●	·	●	●
maxsim	●	●	●	·	●	●
mesh	●	●	●	·	·	·

Not every combination is implemented — only the ones that unlock real performance gains. The icelake level doesn't get a dot_bf16 variant, for example, and falls through to dot_bf16_skylake. Every operation has a serial fallback, but even types no CPU supports today get optimized via lookup tables and bit-twiddling hacks rather than scalar loops. For details on compile-time and run-time dispatch, see the contributor guide.

Design Decisions

Avoid loop unrolling and scalar tails.
Don't manage threads and be compatible with any parallelism models.
Don't manage memory and be compatible with arbitrary allocators & alignment.
Don't constrain ourselves to traditional BLAS-like Matrix Multiplication APIs.
Don't throw exceptions and pass values by pointers.
Prefer saturated arithmetic and avoid overflows, where needed.
Cover most modern CPUs with flexible dispatch and wait for them to converge with GPUs.

The rest of this document unpacks the functionality and the logic behind the design decisions.

Auto-Vectorization & Loop Unrolling

Most "optimized SIMD code" is a 2–4x unrolled data-parallel for-loop over f32 arrays with a serial scalar tail for the last few elements:

float boring_dot_product_f32(float const *a, float const *b, size_t n) {
    __m256 sum0 = _mm256_setzero_ps(), sum1 = _mm256_setzero_ps();
    size_t i = 0;
    for (; i + 16 <= n; i += 16) {
        sum0 = _mm256_fmadd_ps(_mm256_loadu_ps(a + i), _mm256_loadu_ps(b + i), sum0);
        sum1 = _mm256_fmadd_ps(_mm256_loadu_ps(a + i + 8), _mm256_loadu_ps(b + i + 8), sum1);
    }
    float result = _mm256_reduce_add_ps(_mm256_add_ps(sum0, sum1));
    for (; i < n; i++) result += a[i] * b[i]; // serial tail
    return result;
}

This kind of unrolling has been a common request for NumKong, but the library avoids it by design.

Modern CPUs already "unroll" in hardware. Out-of-order engines with reorder buffers of 320–630 entries (Zen 4: 320, Golden Cove: 512, Apple Firestorm: ~630) can keep a dozen of loop iterations in-flight simultaneously. The physical register file is much larger than the ISA-visible architectural registers — Skylake has ~180 physical integer registers behind 16 architectural GPRs, and ~168 physical vector registers behind 32 architectural ZMMs. The register renaming unit maps the same zmm0 in iteration N and iteration N+1 to different physical registers, extracting cross-iteration parallelism automatically — exactly the benefit that source-level unrolling was historically supposed to provide.

Unrolling works against NumKong's goals. Every unrolled copy is a distinct instruction in the binary. With 1,500+ kernel endpoints across 30+ backends, even 2x unrolling would inflate the .text section by megabytes — directly impacting install size for Python wheels, NPM packages, and Rust crates. Larger loop bodies also increase instruction-cache and micro-op-cache pressure; Agner Fog also recommends:

"avoid loop unrolling where possible in order to economize the use of the micro-op cache".

A loop that spills out of the uop cache falls back to the slower legacy decoder, making the "optimized" version slower than the compact original. For a header-only library, unrolling also compounds compilation time: register allocation is NP-hard (reducible to graph coloring), and unrolling multiplies the number of simultaneously live ranges the allocator must consider, increasing compile time super-linearly across every translation unit that includes the headers.

Serial tails are a correctness hazard. The leftover elements after the last full SIMD chunk run through a scalar loop that silently drops FMA fusion, compensated accumulation, and saturating arithmetic — producing results with different numerical properties than the SIMD body. NumKong often uses masked loads instead (_mm512_maskz_loadu_ps on AVX-512, predicated svld1_f32 on SVE), processing every element through the same arithmetic path regardless of alignment. It's not exactly orthogonal to loop-unrolling, but makes a different kernel layout more compatible.

The gains come from elsewhere. On Intel Sapphire Rapids, NumKong was benchmarked against auto-vectorized code compiled with GCC 12. GCC handles single-precision float well, but struggles with _Float16 and other mixed-precision paths:

Kind	GCC 12 `f32`	GCC 12 `f16`	NumKong `f16`	`f16` improvement
Inner Product	3,810 K/s	192 K/s	5,990 K/s	31 x
Cosine Distance	3,280 K/s	336 K/s	6,880 K/s	20 x
Euclidean Distance ²	4,620 K/s	147 K/s	5,320 K/s	36 x
Jensen-Shannon Divergence	1,180 K/s	18 K/s	2,140 K/s	118 x

NumKong's f16 kernels are faster than GCC's f32 output — not because of unrolling, but because they use F16C conversion instructions, widening FMA pipelines, and compensated accumulation that compilers do not synthesize from a plain for loop. The same story repeats for bf16, e4m3, i8, and i4: these types require algorithmic transformations — lookup tables, algebraic domain shifts, asymmetric VNNI tricks — that live beyond the reach of auto-vectorization.

Parallelism & Multi-Threading

BLAS libraries traditionally manage their own thread pools. OpenBLAS spawns threads controlled by OPENBLAS_NUM_THREADS, Intel MKL forks its own OpenMP runtime via MKL_NUM_THREADS, and Apple Accelerate delegates to GCD (Grand Central Dispatch). This works in isolation — but the moment your application adds its own parallelism (joblib, std::thread, Tokio, GCD, OpenMP), you get thread oversubscription: MKL spawns 8 threads inside each of your 8 joblib workers, producing 64 threads on 8 cores, thrashing caches and stalling on context switches. The Python ecosystem has built entire libraries just to work around this problem, and scikit-learn's documentation devotes a full page to managing the interaction between joblib parallelism and BLAS thread pools.

NumKong takes a different position: the numerics layer should not own threads. Modern hardware makes the "spawn N threads and split evenly" model increasingly untenable:

Server-grade CPUs have hundreds of cores split across sockets, chiplets, and tiles, resulting in dozens of physical NUMA domains with vastly different memory access latencies. A thread pool that ignores NUMA topology will spend more time on remote memory stalls than on arithmetic.
Consumer-grade CPUs pack heterogeneous Quality-of-Service core types on the same die — Intel P-cores and E-cores run at different frequencies and sometimes support different ISA extensions. A naive work-split gives equal chunks to fast and slow cores, and the whole task stalls waiting for the slowest partition.
Real-time operating systems in robotics and edge AI cannot afford to yield the main thread to a BLAS-managed pool. These systems need deterministic latency, not maximum throughput.

Instead, NumKong exposes row-range parameters that let the caller partition work across any threading model. For GEMM-shaped dots_packed, this is straightforward — pass a slice of A's rows and the full packed B to compute the corresponding slice of C. For SYRK-shaped dots_symmetric, explicit start_row / end_row parameters control which rows of the symmetric output matrix a given thread computes. The GIL (Global Interpreter Lock) is released around every kernel call, making NumKong compatible with concurrent.futures, multiprocessing, or any other parallelism model:

import concurrent.futures, numkong as nk, numpy as np

vectors, num_threads = np.random.randn(1000, 768).astype(np.float32), 4
output = nk.zeros((1000, 1000), dtype="float32")

def compute_slice(t):
    start = t * (len(vectors) // num_threads)
    end = start + len(vectors) // num_threads if t < num_threads - 1 else len(vectors)
    nk.dots_symmetric(vectors, out=output, start_row=start, end_row=end)

with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as pool:
    list(pool.map(compute_slice, range(num_threads)))

For users who want a ready-made low-latency thread pool without the oversubscription baggage of OpenMP, we built ForkUnion — a minimalist fork-join library for C, C++, and Rust that avoids mutexes, CAS atomics, and dynamic allocations on the critical path, with optional NUMA pinning on Linux.

Memory Allocation & Management

BLAS libraries typically allocate internal buffers during GEMM — OpenBLAS packs matrices into L2/L3-sized panels via per-thread buffer pools backed by mmap or shmget. This hidden allocation has caused real problems: 14 lock/unlock pairs per small GEMM call throttling 12-thread scaling to 2x, silently incorrect results from thread-unsafe allocation in np.dot, and deadlocks after fork() due to mutex state not being reset in child processes. The BLASFEO library was created specifically for embedded model-predictive control where malloc during computation is unacceptable.

NumKong never allocates memory. Following the same philosophy as Intel MKL's packed GEMM API (cblas_sgemm_pack_get_size → cblas_sgemm_pack → cblas_sgemm_compute), NumKong exposes typed three-phase interfaces — nk_dots_packed_size_* → nk_dots_pack_* → nk_dots_packed_* — where the caller owns the buffer and NumKong only fills it.

The reason GEMM libraries repack matrices at all is that every hardware target has a different preferred layout — Intel AMX expects B in a VNNI-interleaved tile format (pairs of BFloat16 values packed into DWORDs across the K dimension), while Arm SME wants column vectors for its FMOPA outer-product instructions. Since GEMM is $O(N^3)$ and repacking is $O(N^2)$, the cost is asymptotically free — but the allocation and locking overhead is not.

NumKong's nk_dots_pack_* family performs five transformations beyond simple reordering:

Type pre-conversion — mini-floats (E4M3, BFloat16, etc.) are upcast to the compute type once during packing, not on every GEMM call. This amortizes the conversion cost across all rows of A that will be multiplied against the packed B.
SIMD depth padding — rows are zero-padded to the SIMD vector width (16 for AVX-512 Float32, 64 for AVX-512 Int8), allowing inner loops to load without boundary checks.
Per-column norm precomputation — squared norms ($|b_j|^2$) are computed and stored alongside the packed data, so distance kernels (angulars_packed, euclideans_packed) can reuse them without a separate pass.
ISA-specific tile layout — AMX packing interleaves BFloat16 pairs into 16×32 tiles matching TDPBF16PS expectations; SME packing arranges vectors at SVE granularity for FMOPA outer products; generic backends use simple column-major with depth padding.
Power-of-2 stride breaking — when the padded row stride is a power of 2, one extra SIMD step of padding is added. Power-of-2 strides cause cache set aliasing where consecutive rows map to the same cache sets, effectively shrinking usable L1/L2 capacity — stride-256 traversals can be ~10x slower than stride-257.

import numkong as nk, numpy as np

right_matrix = np.random.randn(1000, 768).astype(np.float16)
right_packed = nk.dots_pack(right_matrix, dtype=nk.float16)                        # pack once
for query_batch in stream: results = nk.dots_packed(query_batch, right_packed)    # reuse many times

Why Not Just GEMM? The Evolution of Matrix Multiplication APIs

The classic BLAS GEMM computes $C = \alpha A B + \beta C$ for Float32/Float64 matrices. It covers many use cases, but LLM inference, vector search, and quantum simulation expose three ways in which the traditional interface falls short.

Frozen weights justify separating packing from computation. During LLM inference, a very large share of GEMM calls use a static weight matrix — weights don't change after loading. This makes offline repacking a one-time cost amortized over the entire serving lifetime: NVIDIA's TurboMind explicitly splits GEMM into offline weight packing (hardware-aware layout conversion) and online mixed-precision computation, and Intel MKL's packed GEMM API exposes the same two-phase pattern. NumKong's nk_dots_pack_* → nk_dots_packed_* path follows this philosophy — pack the weight matrix once, reuse it across all queries.

Mixed precision demands more than an epilogue addition. Modern transformer layers operate in a precision sandwich: weights stored in BFloat16/Float8, GEMM accumulated in Float32, output downcast back to BFloat16 for the next layer. Between GEMM calls, LayerNorm or RMSNorm re-normalizes hidden states, so the next layer is often much closer to an angular or normalized similarity computation than to a plain raw dot product. nGPT takes this to its logical conclusion: all vectors live on the unit hypersphere, and every matrix-vector product is a pure angular distance. This means many "GEMM" workloads in production are semantically closer to many-to-many angular distance computation — which is exactly what NumKong's angulars_packed and euclideans_packed kernels compute directly, fusing norm handling and type conversion into a single pass.

The GEMM-for-distances trick has real costs. A common shortcut in vector search is to decompose pairwise Euclidean distance as $|a - b|^2 = |a|^2 + |b|^2 - 2 \langle a, b \rangle$, precompute norms, and call sgemm for the inner-product matrix. Both FAISS and scikit-learn use this approach — and both document its limitations. Scikit-learn's docs warn of "catastrophic cancellation" in the subtraction; this has caused real bugs with ~37% error on near-identical Float32 vectors. The $O(N^2)$ postprocessing pass (adding norms, square roots, divisions) is not free either — NVIDIA's RAFT measured a 20–25% speedup from fusing it into the GEMM epilogue. Even FAISS switches to direct SIMD when the query count drops below 20. The standard BLAS interface was never designed for sub-byte types either — no vendor supports Int4, and sub-byte types cannot even be strided without bit-level repacking.

Some operations need more than GEMM + postprocessing. NumKong implements several GEMM-shaped operations where the "epilogue" is too complex for a simple addition:

Bilinear forms ($a^T C b$) in quantum computing compute a scalar expectation value — the naive approach materializes an $N$-dimensional intermediate vector $Cb$, but NumKong's typed nk_bilinear_* kernels stream through rows of $C$ with nested compensated dot products, never allocating beyond registers. For complex-valued quantum states, where the intermediate would be a 2N-element complex vector, the savings double.
MaxSim scoring for ColBERT-style late-interaction retrieval computes $\sum_i \min_j \text{angular}(q_i, d_j)$ — a sum-of-min-distances across token pairs. A GEMM would produce the full $M \times N$ similarity matrix, but NumKong's typed nk_maxsim_packed_* kernels fuse a coarse Int8-quantized screening with full-precision angular refinement on winning pairs only, packing both query and document matrices to use all 4 SME tiles as accumulators. PLAID and maxsim-cpu have independently shown that dedicated MaxSim kernels can outperform the GEMM decomposition by 5–10x.

NumKong treats these as first-class operations — dots_packed, euclideans_packed, angulars_packed, typed nk_bilinear_* kernels, and typed nk_maxsim_packed_* kernels — rather than decomposing everything into GEMM + postprocessing.

Precision by Design: Saturation, Rounding, & Float6 Over Float8

Floating-point arithmetic on computers is not associative: $(a + b) + c \neq a + (b + c)$ in general, and upcasting to wider types is not always sufficient. NumKong makes operation-specific decisions about where to spend precision and where to economize, rather than applying one rule uniformly.

Saturation depends on the operation. A reduction over a 4 GB array of i8 values contains ~4 billion elements — but Int32 wrapping overflow occurs after just ~17 million Int8 summands ($127 \times 16.9\text{M} > 2^{31}$). Reductions in NumKong use saturating arithmetic because the input can be arbitrarily long. Matrix multiplications don't need saturation because GEMM depth rarely exceeds tens of thousands — well within Int32 range. x86 provides no saturating 32-bit SIMD add (only byte/word variants), so NumKong implements saturation via overflow detection with XOR-based unsigned comparison on platforms that lack native support.

Square roots & special math ops are platform-specific. Angular distance requires $1/\sqrt{|a|^2 \cdot |b|^2}$ — but the cost of computing this normalization varies dramatically across hardware. x86 VSQRTPS takes ~12 cycles, followed by VDIVPS at ~11 cycles — totalling ~23 cycles for a precise 1/sqrt(x). The VRSQRT14PS alternative starts with a 14-bit estimate in ~4 cycles, then one Newton-Raphson iteration ($y = y \cdot (1.5 - 0.5 x y^2)$, ~4 more cycles) reaches full Float32 precision — roughly 3x faster. ARM's FRSQRTE provides only ~8 bits, requiring two Newton-Raphson iterations to match. NumKong selects the iteration count per platform so the final ULP bound is consistent across ISAs, rather than exposing different precision to different users.

E2M3 and E3M2 can outperform E4M3 and E5M2. 6-bit MX formats can be scaled to exact integers, enabling integer accumulation that avoids E5M2's catastrophic cancellation risk. This works because E2M3's narrower exponent range means every representable value maps to an integer after a fixed shift — no rounding, no cancellation. See Mini-Floats for a worked example.

Every such decision — saturation thresholds, Newton-Raphson iteration counts, integer vs floating-point paths — is documented per operation and per type in the module-specific READMEs.

Calling Convention & Error Handling

NumKong never throws exceptions, never sets errno, and never calls setjmp/longjmp — exceptions bloat call sites with unwind tables and are invisible to C, Python, Rust, Swift, Go, and JavaScript FFI; errno is thread-local state whose storage model varies across C runtimes. Instead, every function takes inputs as const pointers, writes outputs through caller-provided pointers, and returns void:

void nk_dot_f32(nk_f32_t const *a, nk_f32_t const *b, nk_size_t n, nk_f64_t *result);
void nk_dot_bf16(nk_bf16_t const *a, nk_bf16_t const *b, nk_size_t n, nk_f32_t *result);

Pointers eliminate implicit casts for types with platform-dependent storage — this is why they matter for half-precision types. nk_f16_t and nk_bf16_t resolve to native __fp16 / __bf16 when available but fall back to unsigned short otherwise — if passed by value, the compiler would silently apply integer promotion instead of preserving the bit pattern. Passing by pointer keeps the representation opaque: kernels read raw and convert explicitly when needed, so the same binary works regardless of whether the compiler understands _Float16.

The only place that requires error signaling is dynamic dispatch — looking up the best kernel for the current CPU at runtime. When no kernel matches, the dispatcher sets the capabilities mask to zero and fills the function pointer with a family-specific error stub such as nk_error_dense_ from c/dispatch.h and c/numkong.c that writes 0xFF into the output — NaN for floats, −1 for signed integers, TYPE_MAX for unsigned.

Compile-Time and Run-Time Dispatch

NumKong provides two dispatch mechanisms. Compile-time dispatch selects the fastest kernel supported by the target platform at build time — thinner binaries, no indirection overhead, but requires knowing your deployment hardware. Run-time dispatch compiles every supported kernel into the binary and picks the best one on the target machine via nk_capabilities() — one pointer indirection per call, but a single binary runs everywhere. The run-time path is common in DBMS products (ClickHouse), web browsers (Chromium), and other upstream projects that ship to heterogeneous fleets.

All kernel names follow the pattern nk_{operation}_{type}_{backend}. If you need to resolve the best kernel manually, use nk_find_kernel_punned with a nk_kernel_kind_t, nk_dtype_t, and a viable capabilities mask:

nk_metric_dense_punned_t angular = 0;
nk_capability_t used = nk_cap_serial_k;
nk_find_kernel_punned(
    nk_kernel_angular_k, nk_f32_k,            // what functionality? for which input type?
    nk_capabilities(),                        // which capabilities are viable?
    (nk_kernel_punned_t *)&angular, &used);   // the kernel found and capabilties used!

The first call to nk_capabilities() initializes the dispatch table; all subsequent calls are lock-free.

Numeric Types

Float64 & Float32: IEEE Precision

Float64 — NumKong uses compensated summation that tracks numerical errors separately. On serial paths, we use Neumaier's algorithm (1974), an improvement over Kahan-Babuška that correctly handles cases where added terms are larger than the running sum, achieving $O(1)$ error growth instead of $O(n)$. On SIMD paths with FMA support, we implement the Dot2 algorithm (Ogita-Rump-Oishi, 2005), maintaining separate error compensators for both multiplication and accumulation via TwoProd and TwoSum operations. The accuracy differences are visible in the benchmark tables above — compensated Float64 suits scientific computing where numerical stability matters more than raw speed.

Float32 — SIMD implementations load Float32 values, upcast to Float64 for full-precision multiplication and accumulation, then downcast only during finalization. This avoids catastrophic cancellation at minimal cost since modern CPUs have dedicated Float64 vector units operating at nearly the same throughput as Float32. The same compensated accumulation strategy applies to Mahalanobis distance, bilinear forms, and KL/JS divergences.

// Dot2 TwoProd: Capture multiplication rounding error
h = a * b;
r = fma(a, b, -h);  // Extracts rounding error

// Dot2 TwoSum: Capture addition rounding error
t = sum + product;
e = (sum - t) + product;  // Compensator term

BFloat16 & Float16: Half Precision

BFloat16 — not an IEEE 754 standard type, but widely adopted for AI workloads. BFloat16 shares Float32's 8-bit exponent but truncates the mantissa to 7 bits, prioritizing dynamic range over precision (±3.4×10³⁸ with coarser granularity). On old CPUs, upcasting BFloat16 to Float32 requires just an unpack and left-shift by 16 bits (essentially free); on newer CPUs, both Arm and x86 provide widening mixed-precision dot products via DPBF16PS (AVX-512 on Genoa/Sapphire Rapids) and BFDOT (NEON on ARMv8.6-A Graviton 3+). NumKong's Float8 types (E4M3/E5M2) upcast to BFloat16 before using DPBF16PS, creating a three-tier precision hierarchy: Float8 for storage, BFloat16 for compute, Float32 for accumulation.

Float16 — IEEE 754 half-precision with 1 sign bit, 5 exponent bits (bias=15), and 10 mantissa bits, giving a range of ±65504. Float16 prioritizes precision over range (10 vs 7 mantissa bits), making it better suited for values near zero and gradients during training. On x86, older CPUs use F16C extensions (Ivy Bridge+) for fast Float16 → Float32 conversion; Sapphire Rapids+ adds native AVX-512-FP16 with dedicated Float16 arithmetic. On Arm, ARMv8.4-A adds FMLAL/FMLAL2 instructions for fused Float16 → Float32 widening multiply-accumulate, reducing the total latency from 7 cycles to 4 cycles and achieving 20–48% speedup over the separate convert-then-FMA path.

Platform	BFloat16 Path	Elem/Op	Float16 Path	Elem/Op
x86
Sapphire Rapids (2023)	↓ Genoa	32	↓ Skylake	16
Genoa (2022)	`VDPBF16PS` widening dot	32	↓ Skylake	16
Skylake (2015)	`SLLI` + `VFMADD`	16	`VCVTPH2PS` + `VFMADD`	16
Haswell (2013)	`SLLI` + `VFMADD`	8	`VCVTPH2PS` + `VFMADD`	8
Arm
Graviton 3 (2021)	`SVBFDOT` widening dot	4–32	`SVCVT` → `SVFMLA`	4–32
Apple M2+ (2022)	`BFDOT` widening dot	8	↓ FP16FML	8
Apple M1 (2020)	↓ NEON	8	`FMLAL` widening FMA	8
Graviton 2 (2019)	↓ NEON	8	`FCVTL` + `FMLA`	4
Graviton 1 (2018)	`SHLL` + `FMLA`	8	bit-manip → `FMLA`	8

BFloat16 shares Float32's 8-bit exponent, so upcasting is a 16-bit left shift (SLLI on x86, SHLL on Arm) that zero-pads the truncated mantissa — essentially free. Float16 has a different exponent width (5 vs 8 bits), requiring a dedicated convert: VCVTPH2PS (x86 F16C) or FCVTL (Arm NEON). Widening dot products (VDPBF16PS, BFDOT, FMLAL) fuse the conversion and multiply-accumulate into one instruction. Sapphire Rapids has native VFMADDPH for Float16 arithmetic, but NumKong does not use it for general dot products — Float16 accumulation loses precision. It is only used for mini-float (E2M3/E3M2) paths where periodic flush-to-Float32 windows keep error bounded.

Mini-Floats: E4M3, E5M2, E3M2, & E2M3

Format	Bits	Range	NumKong Promotion Rules	Support in GPUs
E5M2FN	8	±57344	BFloat16 → Float32	H100+, MI300+
E4M3FN	8	±448	BFloat16 → Float32	H100+, MI300+
E3M2FN	6 → 8	±28	BFloat16 & Float16 → Float32, Int16 → Int32	only block-scaled
E2M3FN	6 → 8	±7.5	BFloat16 & Float16 → Float32, Int8 → Int32	only block-scaled
Block-scaled NVFP4	4	±6	—	B200+
Block-scaled MXFP4 / E2M1	4	±6	—	B200+, MI325+

Block scaling. NumKong does not implement block-scaled variants (MXFP4, NVFP4, or block-scaled E3M2/E2M3). Block scaling couples elements through a shared exponent per block, introducing structural bias into a fundamentally uniform operation. NumKong treats each element independently; block-scaled inputs should be dequantized before processing.

FNUZ variants. AMD MI300 (CDNA 3) uses FNUZ encoding (negative-zero-is-NaN) rather than the OCP standard. MI350+ and NVIDIA H100/B200 both use OCP-standard E4M3FN/E5M2FN. NumKong follows the OCP convention; FNUZ inputs require conversion before processing.

8-bit floats (E4M3 & E5M2) follow the OCP FP8 standard. E4M3FN (no infinities, NaN only) is preferred for training where precision near zero matters; E5M2FN (with infinities) provides wider dynamic range for inference. On x86 Genoa/Sapphire Rapids, E4M3/E5M2 values upcast to BFloat16 via lookup tables, then use native DPBF16PS for 2-per-lane dot products accumulating to Float32. On Arm Graviton 3+, the same BFloat16 upcast happens via NEON table lookups, then BFDOT instructions complete the computation.

6-bit floats (E3M2 & E2M3) follow the OCP MX v1.0 standard. Their smaller range allows scaling to exact integers that fit in i8/i16, enabling integer VPDPBUSD/SDOT accumulation instead of the floating-point pipeline. Float16 can also serve as an accumulator, accurately representing ~50 products of E3M2FN pairs or ~20 products of E2M3FN pairs before overflow. On Arm, NEON FHM extensions bring widening FMLAL dot-products for Float16 — both faster and more widely available than BFDOT for BFloat16.

E4M3 and E5M2 cannot use the integer path. E4M3 scaled by 16 reaches 7,680 — too large for Int8, barely fitting Int16 with a 128-entry table. E5M2's range (±57,344) makes the scaled product exceed Int32 entirely. Without the integer path, E5M2 falls back to Float32 accumulation — where its 2-bit mantissa (only 4 values per binade) creates a catastrophic cancellation risk that E2M3's integer path avoids completely:

	i = 0	i = 1	i = 2	i = 3	i = 4	i = 5	i = 6
aᵢ	0.00122	20480	−0.00122	1.5	−3072	−640	0.00146
bᵢ	−40	320	−1280	−7.63e⁻⁵	0.000427	10240	−4.58e⁻⁵
aᵢ·bᵢ	−0.04883	6553600	1.5625	−0.000114	−1.3125	−6553600	≈ 0

Why Float32 accumulation fails here. The accurate sum of these 7 products is ≈ 0.201. A vfmaq_f32 call accumulates 4 lanes at a time; the first batch already carries values around ±6.5 M. At that magnitude the Float32 ULP is 0.5 — so the small meaningful terms (−0.049, 1.563, −1.313, −0.0001) are all below one ULP and get absorbed during lane reduction. The large terms then cancel exactly to zero, and the information is gone. Final Float32 result: 0.0 instead of 0.201.

Int8 & Int4: Integer Types

Both signed and unsigned 8-bit and 4-bit integers are supported with Int32 accumulation to prevent overflow. A notable optimization is the VNNI algebraic transform: on Ice Lake+ with AVX-512 VNNI, the native DPBUSD instruction is asymmetric (unsigned × signed → signed), but NumKong uses it for both Int8×Int8 and UInt8×UInt8. For signed Int8×Int8, we convert the signed operand to unsigned via XOR with 0x80, compute DPBUSD(a⊕0x80, b) = (a+128)×b, then subtract a correction term 128×sum(b) to recover the true result. For unsigned UInt8×UInt8, we XOR the second operand to make it signed, compute DPBUSD(a, b⊕0x80) = a×(b-128), then add correction 128×sum(a) via the fast SAD instruction.

Int4 values pack two nibbles per byte, requiring bitmask extraction: low nibbles (byte & 0x0F) and high nibbles (byte >> 4). For signed Int4, the transformation (nibble ⊕ 8) - 8 maps the unsigned range [0,15] to signed range [−8,7]. Separate accumulators for low and high nibbles avoid expensive nibble-interleaving and allow SIMD lanes to work in parallel.

// Asymmetric transform for i8×i8 using DPBUSD (unsigned×signed)
a_unsigned = a XOR 0x80;           // Convert signed→unsigned
result = DPBUSD(a_unsigned, b);    // Computes (a+128)×b
correction = 128 * sum(b);         // Parallel on different port
final = result - correction;       // True a×b value

Binary: Packed Bits

The u1x8 type packs 8 binary values per byte, enabling Hamming distance and Jaccard similarity via population-count instructions. On x86, VPOPCNTDQ (Ice Lake+) counts set bits in 512-bit registers directly; on Arm, CNT (NEON) operates on 8-bit lanes with a horizontal add. Results accumulate into u32 — sufficient for vectors up to 4 billion bits. Binary representations are the most compact option for locality-sensitive hashing and binary neural network inference.

Complex Types

NumKong supports four complex types — f16c, bf16c, f32c, and f64c — stored as interleaved real/imaginary pairs. Complex types are essential in quantum simulation (state vectors, density matrices), signal processing (FFT coefficients, filter design), and electromagnetic modeling. The dot operation computes the unconjugated dot product $\sum a_k b_k$, while vdot computes the conjugated inner product $\sum \bar{a}_k b_k$ standard in physics and signal processing.

For complex dot products, NumKong defers sign flips until after the accumulation loop: instead of using separate FMA and FMS (fused multiply-subtract) instructions for the real component, we compute $a_r b_r + a_i b_i$ treating all products as positive, then apply a single bitwise XOR with 0x80000000 to flip the sign bits. This avoids execution port contention between FMA and FMS, letting dual FMA units stay occupied.

for (...) { // Complex multiply optimization: XOR sign flip after the loop
    sum_real = fma(a, b, sum_real);   // No sign flip in loop
    sum_imag = fma(a, b_swapped, sum_imag);
}
sum_real = xor(sum_real, 0x80000000);  // Single XOR after loop

Reading Materials

Beyond the READMEs in this repository, there are several standalone articles covering different evolution steps and features of this library.

License

Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

ashvardanian

Release history Release notifications | RSS feed

7.6.0

Apr 20, 2026

7.5.0

Apr 14, 2026

7.4.5

Apr 6, 2026

7.4.4

Apr 6, 2026

7.4.3

Apr 5, 2026

7.4.2

Apr 5, 2026

7.4.1

Apr 5, 2026

7.4.0

Apr 5, 2026

This version

7.3.0

Apr 3, 2026

7.2.4

Mar 29, 2026

7.2.2

Mar 28, 2026

7.2.1

Mar 28, 2026

7.2.0

Mar 28, 2026

7.1.1

Mar 22, 2026

7.1.0

Mar 21, 2026

7.0.0

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

numkong-7.3.0.tar.gz (1.1 MB view details)

Uploaded Apr 3, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

numkong-7.3.0-cp314-cp314t-win_arm64.whl (499.2 kB view details)

Uploaded Apr 3, 2026 CPython 3.14tWindows ARM64

numkong-7.3.0-cp314-cp314t-win_amd64.whl (579.5 kB view details)

Uploaded Apr 3, 2026 CPython 3.14tWindows x86-64

numkong-7.3.0-cp314-cp314t-musllinux_1_2_x86_64.whl (10.8 MB view details)

Uploaded Apr 3, 2026 CPython 3.14tmusllinux: musl 1.2+ x86-64

numkong-7.3.0-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (11.0 MB view details)

Uploaded Apr 3, 2026 CPython 3.14tmanylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

numkong-7.3.0-cp314-cp314t-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded Apr 3, 2026 CPython 3.14tmacOS 11.0+ ARM64

numkong-7.3.0-cp314-cp314t-macosx_10_15_x86_64.whl (755.0 kB view details)

Uploaded Apr 3, 2026 CPython 3.14tmacOS 10.15+ x86-64

numkong-7.3.0-cp314-cp314-win_arm64.whl (498.4 kB view details)

Uploaded Apr 3, 2026 CPython 3.14Windows ARM64

numkong-7.3.0-cp314-cp314-win_amd64.whl (576.3 kB view details)

Uploaded Apr 3, 2026 CPython 3.14Windows x86-64

numkong-7.3.0-cp314-cp314-musllinux_1_2_x86_64.whl (10.7 MB view details)

Uploaded Apr 3, 2026 CPython 3.14musllinux: musl 1.2+ x86-64

numkong-7.3.0-cp314-cp314-musllinux_1_2_s390x.whl (2.9 MB view details)

Uploaded Apr 3, 2026 CPython 3.14musllinux: musl 1.2+ s390x

numkong-7.3.0-cp314-cp314-musllinux_1_2_ppc64le.whl (3.3 MB view details)

Uploaded Apr 3, 2026 CPython 3.14musllinux: musl 1.2+ ppc64le

numkong-7.3.0-cp314-cp314-musllinux_1_2_i686.whl (2.9 MB view details)

Uploaded Apr 3, 2026 CPython 3.14musllinux: musl 1.2+ i686

numkong-7.3.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (10.9 MB view details)

Uploaded Apr 3, 2026 CPython 3.14manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

numkong-7.3.0-cp314-cp314-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl (3.2 MB view details)

Uploaded Apr 3, 2026 CPython 3.14manylinux: glibc 2.17+ s390xmanylinux: glibc 2.28+ s390x

numkong-7.3.0-cp314-cp314-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl (3.3 MB view details)

Uploaded Apr 3, 2026 CPython 3.14manylinux: glibc 2.17+ ppc64lemanylinux: glibc 2.28+ ppc64le

numkong-7.3.0-cp314-cp314-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl (2.8 MB view details)

Uploaded Apr 3, 2026 CPython 3.14manylinux: glibc 2.28+ i686manylinux: glibc 2.5+ i686

numkong-7.3.0-cp314-cp314-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded Apr 3, 2026 CPython 3.14macOS 11.0+ ARM64

numkong-7.3.0-cp314-cp314-macosx_10_15_x86_64.whl (753.1 kB view details)

Uploaded Apr 3, 2026 CPython 3.14macOS 10.15+ x86-64

numkong-7.3.0-cp313-cp313t-win_arm64.whl (475.9 kB view details)

Uploaded Apr 3, 2026 CPython 3.13tWindows ARM64

numkong-7.3.0-cp313-cp313t-win_amd64.whl (559.9 kB view details)

Uploaded Apr 3, 2026 CPython 3.13tWindows x86-64

numkong-7.3.0-cp313-cp313t-musllinux_1_2_x86_64.whl (10.8 MB view details)

Uploaded Apr 3, 2026 CPython 3.13tmusllinux: musl 1.2+ x86-64

numkong-7.3.0-cp313-cp313t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (11.0 MB view details)

Uploaded Apr 3, 2026 CPython 3.13tmanylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

numkong-7.3.0-cp313-cp313t-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded Apr 3, 2026 CPython 3.13tmacOS 11.0+ ARM64

numkong-7.3.0-cp313-cp313t-macosx_10_13_x86_64.whl (754.9 kB view details)

Uploaded Apr 3, 2026 CPython 3.13tmacOS 10.13+ x86-64

numkong-7.3.0-cp313-cp313-win_arm64.whl (475.4 kB view details)

Uploaded Apr 3, 2026 CPython 3.13Windows ARM64

numkong-7.3.0-cp313-cp313-win_amd64.whl (557.4 kB view details)

Uploaded Apr 3, 2026 CPython 3.13Windows x86-64

numkong-7.3.0-cp313-cp313-musllinux_1_2_x86_64.whl (10.7 MB view details)

Uploaded Apr 3, 2026 CPython 3.13musllinux: musl 1.2+ x86-64

numkong-7.3.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (10.9 MB view details)

Uploaded Apr 3, 2026 CPython 3.13manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

numkong-7.3.0-cp313-cp313-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded Apr 3, 2026 CPython 3.13macOS 11.0+ ARM64

numkong-7.3.0-cp313-cp313-macosx_10_13_x86_64.whl (752.9 kB view details)

Uploaded Apr 3, 2026 CPython 3.13macOS 10.13+ x86-64

numkong-7.3.0-cp312-cp312-win_arm64.whl (475.3 kB view details)

Uploaded Apr 3, 2026 CPython 3.12Windows ARM64

numkong-7.3.0-cp312-cp312-win_amd64.whl (557.4 kB view details)

Uploaded Apr 3, 2026 CPython 3.12Windows x86-64

numkong-7.3.0-cp312-cp312-musllinux_1_2_x86_64.whl (10.7 MB view details)

Uploaded Apr 3, 2026 CPython 3.12musllinux: musl 1.2+ x86-64

numkong-7.3.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (10.9 MB view details)

Uploaded Apr 3, 2026 CPython 3.12manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

numkong-7.3.0-cp312-cp312-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded Apr 3, 2026 CPython 3.12macOS 11.0+ ARM64

numkong-7.3.0-cp312-cp312-macosx_10_13_x86_64.whl (752.9 kB view details)

Uploaded Apr 3, 2026 CPython 3.12macOS 10.13+ x86-64

numkong-7.3.0-cp311-cp311-win_arm64.whl (475.2 kB view details)

Uploaded Apr 3, 2026 CPython 3.11Windows ARM64

numkong-7.3.0-cp311-cp311-win_amd64.whl (556.9 kB view details)

Uploaded Apr 3, 2026 CPython 3.11Windows x86-64

numkong-7.3.0-cp311-cp311-musllinux_1_2_x86_64.whl (10.7 MB view details)

Uploaded Apr 3, 2026 CPython 3.11musllinux: musl 1.2+ x86-64

numkong-7.3.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (11.0 MB view details)

Uploaded Apr 3, 2026 CPython 3.11manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

numkong-7.3.0-cp311-cp311-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded Apr 3, 2026 CPython 3.11macOS 11.0+ ARM64

numkong-7.3.0-cp311-cp311-macosx_10_9_x86_64.whl (758.1 kB view details)

Uploaded Apr 3, 2026 CPython 3.11macOS 10.9+ x86-64

numkong-7.3.0-cp310-cp310-win_arm64.whl (475.3 kB view details)

Uploaded Apr 3, 2026 CPython 3.10Windows ARM64

numkong-7.3.0-cp310-cp310-win_amd64.whl (557.0 kB view details)

Uploaded Apr 3, 2026 CPython 3.10Windows x86-64

numkong-7.3.0-cp310-cp310-musllinux_1_2_x86_64.whl (10.7 MB view details)

Uploaded Apr 3, 2026 CPython 3.10musllinux: musl 1.2+ x86-64

numkong-7.3.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (10.9 MB view details)

Uploaded Apr 3, 2026 CPython 3.10manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

numkong-7.3.0-cp310-cp310-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded Apr 3, 2026 CPython 3.10macOS 11.0+ ARM64

numkong-7.3.0-cp310-cp310-macosx_10_9_x86_64.whl (758.2 kB view details)

Uploaded Apr 3, 2026 CPython 3.10macOS 10.9+ x86-64

File details

Details for the file numkong-7.3.0.tar.gz.

File metadata

Download URL: numkong-7.3.0.tar.gz
Upload date: Apr 3, 2026
Size: 1.1 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for numkong-7.3.0.tar.gz
Algorithm	Hash digest
SHA256	`c1c92bf190a5282d34fdff68be12f9ab283c830755ffeb4dbcd5809913156907`
MD5	`8e4654900d91456bc58eb064aa5a7953`
BLAKE2b-256	`90979e37c3d28bb151c20e8e0f21a3003c5010a5396d8eb5047bfa08663a95e8`

Algorithm	Hash digest
SHA256	`7a8887e3aca95f1de913f31f3bb5ffcbf99ff3874ad27ce2b18dab676e38e4ec`
MD5	`08f6bdc8b7607a14311a2fc918edce59`
BLAKE2b-256	`9f6d059f58cec4a8f5f497d5297e4160c65c5049f2bcfcb4c6d4c70f4d791aa0`

Algorithm	Hash digest
SHA256	`d6805d3206e9ef9c6a094b1b39fea785a339e07d58b29eb127904dbbf2b8fd97`
MD5	`d2c75f023c43c56089b8ad4da2e51ceb`
BLAKE2b-256	`ba8ead093eae49b159d393ccbf38dbe09d067b85fa4352bfa8c7367b7734fe73`

Algorithm	Hash digest
SHA256	`08183ff41e67de12e2253d30f25e1e79ae88960fcd296d3ca1ebd7759ed6d53d`
MD5	`e4e837b40cc9c52e182555f24e74a742`
BLAKE2b-256	`2173597698d0bc580d0c18b80d554cff8937d81377884645f1a7698cdc791a7b`

Algorithm	Hash digest
SHA256	`4840a500b9a2df9587a91ebe6a656d5a3f3e1def98de6b7af04cd09fa4993bdf`
MD5	`b3bd30d161e2ce6ebb325f8303af9e39`
BLAKE2b-256	`33358bd4d2f1d06a4df955645e728679a1f80ae0f98c0797d245278844684560`

Algorithm	Hash digest
SHA256	`1693e240a3c205c5086f2d8b4ad2786332ba9dc066a91a1099e9c233f2b736da`
MD5	`f2ca78839ae2596de69caa509909ac5a`
BLAKE2b-256	`f3970fe4068d5ccde3cf2c13fdbaea52f5aa0d32ced42946894eee08f74c7771`

Algorithm	Hash digest
SHA256	`0e5d8016276baad422e002440ead5aa13464fd592270f6f1f19de5a4b01ead7c`
MD5	`d0e35e1463e9f53d78153cfed1d4499d`
BLAKE2b-256	`39b9c6d7040361f95d53e4619755265e97a2cab95d596278d2d685de336ee53e`

Algorithm	Hash digest
SHA256	`412f10ad94f7486e805ff18868eb794ac32ce68b9cb4ae2a7169699a9e6eff97`
MD5	`d1b01404f80a3221108117a503b0c213`
BLAKE2b-256	`b2c14b46384c52c1083ec9115d6f7ecd80b9637637fdbc53c5b492c32444cd8b`

Algorithm	Hash digest
SHA256	`a6dcdeaa45b529ed289f0cf0db5b6c2630a45fb485e4fecb8f017a368ae126ec`
MD5	`b5d4d94eadbe5625e5121bc0e3f74279`
BLAKE2b-256	`839ec9be573c9ed11e0bb44e8289f25b20c22af9b60bb45c869b3215f5248f1b`

Algorithm	Hash digest
SHA256	`6e4bc9b5cd5558fd5890307374f77f196bd047e2814a255d2742473291ffa879`
MD5	`84fa08c7e98b09d90ef7f996d47c746a`
BLAKE2b-256	`d0cf14c428188979bf9ea9e493d8020f25b6414244cee3c0d64f577832fb4f44`

Algorithm	Hash digest
SHA256	`9608d71773465127c3437787abcfd779b8993e2aaeba0e667d1ccc74685bd5ed`
MD5	`ad6cef14b8aca85829e093709f28a580`
BLAKE2b-256	`d30507af71013537707e6b8a39ed2fd8f0908fd29b612378387c8b0753de6274`

Algorithm	Hash digest
SHA256	`2869e9fea599cc361ca91331a1f05f0529c14e4465bf37cfc11f35d4a54b3fdf`
MD5	`2c4b27a9012d03210ecc5915d4b0c438`
BLAKE2b-256	`111b1bd6b5dc027b4c5f57011a147c5ec148e63440028890271966c6af7ca462`

Algorithm	Hash digest
SHA256	`a74c15ba9dc4efc2581a069d67e706d627806b7516f42f5c62944fc83d77bc29`
MD5	`adbbc49e02153dfe6e3a0a466477a5fd`
BLAKE2b-256	`dd2dbec18b7bc5ee2bccc04d34037cc2730d676386eaaeeb73e22e7f880649fa`

Algorithm	Hash digest
SHA256	`c667af2e4e85b37aaf39622193eb89e8ea1043a2adec71f4e1c897bead9191dc`
MD5	`c48fea4b41c705249b81322da4ba436f`
BLAKE2b-256	`360038f2fbfbac077730092ae39439a887fb4185f800722d82ee6584f6500c39`

Algorithm	Hash digest
SHA256	`f539dcf507f9fb1a0deec5d7bc3a8877416833aef090085f55592e46a833cad4`
MD5	`2f5fdeff6598ac4602a0410fd2d77c7d`
BLAKE2b-256	`6819050b2d10006c675bbed929d84c296cbbf011cc42f6c19e795b466501967e`

Algorithm	Hash digest
SHA256	`992ec99d8b3f0a78dd4051bfc1378a3ab5de3d97e4bff7c13da74968f3bc018a`
MD5	`8166bbe8ddde9845c0f8991ed824ee39`
BLAKE2b-256	`5ac0c895d1a0622edba71562d9aa0a3728752ad29960ab1fd9bd3d0c4bfdab61`

Algorithm	Hash digest
SHA256	`fe9cff146fa35345d9930f578751bffb50ec5bed6483a8d5a85ebae7777aedc5`
MD5	`ed5ee7e9dc1ed3f03d33faebf55b4dac`
BLAKE2b-256	`6b1d6414de3d9040d27af3c94cb46be24ac024df75b14af52bf1c1c1f7429c73`

Algorithm	Hash digest
SHA256	`ab04061d55bf84e9f6018a50960069d3e6795acd8478813caf150415615eafa1`
MD5	`45f872fe2bbdfa4c3e77870d391ace76`
BLAKE2b-256	`011238984b45e4cac77be12d96dbe8eb89af947220cc6bd789da21fbbe1109f9`

Algorithm	Hash digest
SHA256	`e14383b87acec13447d27be0537cf2fdcfbf3340992a4c7801d365ffeb887424`
MD5	`3b7918e9721978f274b03df3d77d4991`
BLAKE2b-256	`069d95032b5c2fe79b686daa936a7a45e4b39610c763dcb9890742083efc542c`

Algorithm	Hash digest
SHA256	`23366534c323c498699431901cc8ce18d1ca7f542cbecdb51da97bdb4a7a33a6`
MD5	`e92e2abcee3b49437de8a9631303ce27`
BLAKE2b-256	`ab2973fe6b825b47b45c8822048c30a2c7cbb96fa276091fffc806f6c3dbddfb`

Algorithm	Hash digest
SHA256	`0d0bef62158ab226e6810e7ee5887bbccc4c39a89ddfdbfc4cdc21ce94ecfd41`
MD5	`bb36b0c5bea731636686e1f4c81d9457`
BLAKE2b-256	`6db2330514a250fa443c365af548031801ae7eba09a3b58e895499dc0444a096`

numkong 7.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

NumKong for Python

Ecosystem Comparison

Quickstart

Installation

Wheel Compatibility and Building from Source

Dot Products

Dense Distances

Output Control: out=, dtype=, and out_dtype=

Set Similarity

Probability Metrics

Geospatial Metrics

Curved Metrics

Scalar Types and Low-Precision Formats

ml_dtypes Interoperability

Tensor Objects and Buffer Interop

Memory Layout Requirements

All-Pairs APIs and cdist

Elementwise Operations

Moments Reductions

Min/Max Reductions

Sparse Operations and Intersections

Packed Matrix Kernels for GEMM-Like Workloads

Symmetric Kernels for SYRK-Like Workloads

Geometric Mesh Alignment

MaxSim and ColBERT-Style Late Interaction

Capabilities, GIL Behavior, and Parallel Partitioning

Addressing External Memory

NumKong: Mixed Precision for All

Latency, Throughput, & Numerical Stability

Quick Start

What's Inside

Operations × Backend

Numeric Types × Backend

Language Bindings

Design Decisions

Auto-Vectorization & Loop Unrolling

Parallelism & Multi-Threading

Memory Allocation & Management

Why Not Just GEMM? The Evolution of Matrix Multiplication APIs

Precision by Design: Saturation, Rounding, & Float6 Over Float8

Calling Convention & Error Handling

Compile-Time and Run-Time Dispatch

Numeric Types

Float64 & Float32: IEEE Precision

BFloat16 & Float16: Half Precision

Mini-Floats: E4M3, E5M2, E3M2, & E2M3

Int8 & Int4: Integer Types

Binary: Packed Bits

Complex Types

Reading Materials

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Output Control: `out=`, `dtype=`, and `out_dtype=`

Algorithm	Hash digest
SHA256	`b25709d9fb3c5c23ddae627eb0d725b0339d3a46de5eaa81d63c6a4202b6f2c9`
MD5	`abb12b760b62c5d3df2b33294cfb0428`
BLAKE2b-256	`4a55cb234d1e711c0ae7613564d4d9ca62e549952d436dd949318154347a0086`

Algorithm	Hash digest
SHA256	`b241fc2782e3f82be571a7a3cb6556ceda6159f7311af9c1d0feb50330de1a88`
MD5	`655a9d76958df196a3da702c7e19fcae`
BLAKE2b-256	`318987ed9a1f7d471ebeecf70aac3acc388f56efe0acd286016e57905ca998d8`

Algorithm	Hash digest
SHA256	`e45025f5e6db761e468970ec8757b5d24efc2de202a9785a1dd2bbfb928d85a7`
MD5	`50bcf41d4aec180f8423e16390813cf7`
BLAKE2b-256	`73272513bad301cf8d43c5c4e7c7b2c573b6cdc3c8fe8cae56bc267ba57bfdac`

Algorithm	Hash digest
SHA256	`417bcbbc9296f268ec5247de5cbe8fb343420ef41235e738a0dabc6013c6335c`
MD5	`6a2ebe702bad5ecc6ea838df21de688d`
BLAKE2b-256	`d809a2cc2521ecb7a18e52892a236fd9e5469d5f7aa64ae11d40ef0ee13db2ab`

Algorithm	Hash digest
SHA256	`ae042266784a23c747f91a5408f147524ab23f547bc534c974bbac99d36ed29f`
MD5	`cef3f48f5f1a10a33a9da814f4e4ef13`
BLAKE2b-256	`50f64c00a274d03b6f1603dd6ca042296845efd0517c93aebfb610e794e77b54`

Algorithm	Hash digest
SHA256	`68d07ffe5a0a8f8a54bed1366e0c3ab112e04212b6cd7394888374c334bdddbd`
MD5	`f3b88c5477d150e89999a2b2f660fbe0`
BLAKE2b-256	`4fa4f4be235e3f57cbbae75ada7cbe8fbcfc8a01da46f1a6766d9c7360619746`

Algorithm	Hash digest
SHA256	`3b4a4e4ae09c73c284cadc6b06d87598446bd2d3d39449343cfea0dffeca9749`
MD5	`e633ef3bf47b1ed8ab09aaa5366ee129`
BLAKE2b-256	`9f08b9a3010ba97d5c32355511e8055fcb09c1f34fc279b0c6cbc6b6e6fa7ff6`

Algorithm	Hash digest
SHA256	`df69e606dee1e6dd8d149c5f23e6f78b29c920850f66aed40719832cbca9f8e7`
MD5	`ad3387ddf49fa915174ceb80d7c59da8`
BLAKE2b-256	`2a49efcbd0d8bbf5e0a827009fefd1855d924952df3da69b134c138475dd20ae`

Algorithm	Hash digest
SHA256	`bea3e3ba291aea7ca335dfc6190f6e7913c87f2c47fe1b8097b2917509d52afc`
MD5	`7dfba07eafe7e45edf7318d9e958d949`
BLAKE2b-256	`15e7d2b64b10f07fb7080f4c56f486f442cbbc5d2bd4dd71037efe6c4fad3653`

Algorithm	Hash digest
SHA256	`8043d15093787be65e9465345dabf06b64d15316a52d0066b2602ad1662f8d92`
MD5	`34a483655b1dfad6ff050b38a176a92c`
BLAKE2b-256	`08eb69d73072d5034f5f08a03f6cbe885d422b538132c154140217ac61972c1f`