Skip to main content

A Python package for Llama CPP.

Project description

Llama CPP

This is a Python package for Llama CPP ( https://github.com/ggml-org/llama.cpp ).

Installation

You can install the pre-built wheel from the releases page or build it from source.

pip install llama-cpp-pydist

Usage

This section provides a basic overview of how to use the llama_cpp_pydist library.

Deploying Windows Binaries

If you are on Windows, the package attempts to automatically deploy pre-compiled binaries. You can also manually trigger this process.

from llama_cpp import deploy_windows_binary

# Specify the target directory for the binaries
# This is typically within your Python environment's site-packages
# or a custom location if you prefer.
target_dir = "./my_llama_cpp_binaries" 

if deploy_windows_binary(target_dir):
    print(f"Windows binaries deployed successfully to {target_dir}")
else:
    print(f"Failed to deploy Windows binaries or no binaries were found for your system.")

# Once deployed, you would typically add the directory containing llama.dll (or similar)
# to your system's PATH or ensure your application can find it.
# For example, if llama.dll is in target_dir/bin:
# import os
# os.environ["PATH"] += os.pathsep + os.path.join(target_dir, "bin")

Conversion Library Installation

To perform Hugging Face to GGUF model conversions, you need to install additional Python libraries. You can install them via pip:

pip install transformers numpy torch safetensors sentencepiece

Alternatively, you can install them programmatically in Python:

from llama_cpp.install_conversion_libs import install_conversion_libs

if install_conversion_libs():
    print("Conversion libraries installed successfully.")
else:
    print("Failed to install conversion libraries.")

Converting Hugging Face Models to GGUF

This package provides a utility to convert Hugging Face models (including those using Safetensors) into the GGUF format, which is used by llama.cpp. This process leverages the conversion scripts from the underlying llama.cpp submodule.

1. Install Conversion Libraries:

Before converting models, ensure you have the necessary Python libraries. You can install them using a helper function:

from llama_cpp import install_conversion_libs

if install_conversion_libs():
    print("Conversion libraries installed successfully.")
else:
    print("Failed to install conversion libraries. Please check the output for errors.")

2. Convert the Model:

Once the dependencies are installed, you can use the convert_hf_to_gguf function:

from llama_cpp import convert_hf_to_gguf

# Specify the Hugging Face model name or local path
model_name_or_path = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # Example: A small model from Hugging Face Hub
# Or, a local path: model_name_or_path = "/path/to/your/hf_model_directory"

output_directory = "./converted_gguf_models" # Directory to save the GGUF file
output_filename = "tinyllama_1.1b_chat_q8_0.gguf" # Optional: specify a filename
quantization_type = "q8_0"  # Example: 8-bit quantization. Common types: "f16", "q4_0", "q4_K_M", "q5_K_M", "q8_0"

print(f"Starting conversion for model: {model_name_or_path}")
success, result_message = convert_hf_to_gguf(
    model_path_or_name=model_name_or_path,
    output_dir=output_directory,
    output_filename=output_filename, # Can be None to auto-generate
    outtype=quantization_type
)

if success:
    print(f"Model converted successfully! GGUF file saved at: {result_message}")
else:
    print(f"Model conversion failed: {result_message}")

# The `result_message` will contain the path to the GGUF file on success,
# or an error message on failure.

This function will download the model from Hugging Face Hub if a model name is provided and it's not already cached locally by Hugging Face transformers. It then invokes the convert_hf_to_gguf.py script from llama.cpp.

For more detailed examples and advanced usage, please refer to the documentation of the underlying llama.cpp project and explore the examples provided there.

Building and Development

For instructions on how to build the package from source, update the llama.cpp submodule, or other development-related tasks, please see BUILDING.md.

Changelog

2026-04-03: Update to llama.cpp b8646

Summary

Updated llama.cpp from b8635 to b8646, incorporating 10 upstream commits with new features and performance improvements.

Notable Changes

🆕 New Features

  • b8635: Relax prefill parser to allow space. (#21240)
    • As in title.
    • Prefill parser was strictly requiring the reasoning marker at the very start of the message, which interfered with models that liked to insert eg. a newline there.
  • b8639: ggml-webgpu: add vectorized flash attention (#20709)
    • This PR adds a vectorized WebGPU path for FLASH_ATTN_EXT.
    • The implementation follows a split pipeline:
    • blk: optional mask tile classification
  • b8642: [HIP] Bump ROCm version to 7.2.1 (#21066)
    • Bumps the ROCm version from 7.2 to 7.2.1 across all CI/CD workflows and the ROCm Dockerfile, and adds the missing gfx1102 GPU target to the fat-build architecture list.
  • b8646: rpc : reuse compute graph buffers (#21299)
    • Reuse the buffer for the ggml context which is used for creating the compute graph on the server side. This partially addresses a memory leak created by the CUDA backend due to using buffer addresses as cache keys.
    • ref: #21265
    • I have read and agree with the contributing guidelines

🚀 Performance Improvements

  • b8638: tests: allow exporting graph ops from HF file without downloading weights (#21182)
    • This expands the export-graph-ops binary to also allow using --hf-repo instead of --model. It uses the HF metadata loader from #19796 to set up a dummy model graph without loading weights and parses the cgraph from that, which allows running test-backend-ops on tensors from models without downloading them. That should make checking if a backend works correctly for a specific model/quant much easier, and also allows performance benchmark comparisons without downloads.
    • I tried to keep the changes to disable actually downloading the model minimal, but let me know if you can see a better way to do this.
    • I have read and agree with the contributing guidelines

🐛 Bug Fixes

  • b8641: Gemma 4 template parser fixes (#21326)
    • As in topic
    • Quick fixes for some observed discrepancies + refactoring of the parser architecture for the dict format

Additional Changes

4 minor improvements: 2 documentation, 1 examples, 1 maintenance.

  • b8640: Add unit test coverage for llama_tensor_get_type (#20112)
    • This is part of a larger goal of reworking or replacing the llama_tensor_get_type function
    • Before major work starts in that area, I want to capture the current existing behaviour thoroughly, so that any accidental changes are easy to spot, and any purposeful changes are easy to document
    • To that end, this PR introduces unit test coverage for the function itself
  • b8645: chat : avoid including json in chat.h (#21306)
  • b8637: model: support gemma 4 (vision + moe, no audio) (#21309)
    • Fix a bug where model with both vision/audio cannot be converted properly
    • I have read and agree with the contributing guidelines
  • b8644: (revert) kv-cache : do not quantize SWA KV cache (#21332)
    • revert #21277
    • In some cases the SWA cache actually takes significant portion of memory, so it's not always a good idea to keep it full-precision. It could be controlled via flag, but probably not worth the extra logic.

Full Commit Range


2026-04-02: Update to llama.cpp b8635

Summary

Updated llama.cpp from b8635 to b8635, incorporating 1 upstream commits with new features.

Notable Changes

🆕 New Features

  • b8635: Relax prefill parser to allow space. (#21240)
    • As in title.
    • Prefill parser was strictly requiring the reasoning marker at the very start of the message, which interfered with models that liked to insert eg. a newline there.

Full Commit Range


2026-04-02: Update to llama.cpp b8635

Summary

Updated llama.cpp from b8635 to b8635, incorporating 1 upstream commits with new features.

Notable Changes

🆕 New Features

  • b8635: Relax prefill parser to allow space. (#21240)
    • As in title.
    • Prefill parser was strictly requiring the reasoning marker at the very start of the message, which interfered with models that liked to insert eg. a newline there.

Full Commit Range


2026-03-27: Update to llama.cpp b8555

Summary

Updated llama.cpp from b8507 to b8555, incorporating 25 upstream commits with new features and performance improvements.

Notable Changes

🆕 New Features

  • b8517: llama: fix llama-model-saver (#20503)
    • This PR fixes llama-model-saver and makes the --output argument of test-llama-archs functional (the models themselves are still broken though because they lack tokenizers).
    • The first issue fixed in this PR is that llama-model-saver is simply unmaintained: a lot of new KV values were added since I implemented it and those were not being saved correctly. I simply went through the KV values again, added the missing ones and checked where the corresponding information can be extracted from.
    • The second issue fixed in this PR is that on master several archs have broken tensor names: typically what happens is that in llama_model::load_tensors tensors are being created without a corresponding entry in llm_get_tensor_names. As a consequence LLM_TN_IMPL::str then doesn't use the provided arguments to format the tensor name with e.g. the layer index. So you end up with multiple, different tensors that have names like blk.%d.attn_q. Since a GGUF context is populated by tensor name this leads to conflicts and the model cannot be saved correctly. To me it is now clear why we have llm_get_tensor_names in the first place. I think it would make more sense to just check in LLM_TN_IMPL::str() whether suffix, bid, and/or xid are set and to use them in those cases. Also add a warning in cases where the tensor name template and the provided arguments don't match. I would implement this refactor in this PR.
  • b8525: model : allow causal_attn and pooling_type on all architectures (#20973)
    • Change all architectures to read the causal_attn and pooling_type hyperparameters.
    • Transformers has introduced a change that enables all decoder-only models to function as encoders too (see the previous PR #20746). Rather than adding support for each model individually, I thought it would be better to allow all models to be used as embedding models.
  • b8532: CUDA & CPU: support F32 kernel type for CONV_TRANSPOSE_2D (#17094)
    • also updated test case in test-backend-ops.
    • But since F32 kernel type is not supported on CPU, only GGML_TYPE_F16 is kept and GGML_TYPE_F32 can be uncommented back in the future.
  • b8545: hip: use fnuz fp8 for conversion on CDNA3 (#21040)
    • HIP supports the fp8 types e4m3_fnuz and e4m3_ocp, the difference being that fnuz dosent support inf. GFX942 (uniquely) supports only e4m3_fnuz in hardware, due to what looks like an oversight in rocm, the combination of e4m3_ocp on devices with native fp8 support but no ocp support is not implemented.
    • Use native fnuz here to avoid this.
  • b8552: rpc : proper handling of data pointers to CPU buffers (#21030)
    • The compute graph may contain tensors pointing to CPU buffers. In these cases the buffer address is serialized as 0 and sent over the wire. However, the data pointer is serialized as-is and this prevents proper validation on the server side. This patches fixes this by serializing the data pointer as 0 for non-RPC buffers and doing proper validation on the server side.
    • closes: #21006
    • I have read and agree with the contributing guidelines

🚀 Performance Improvements

  • b8507: ggml-backend: re-enable graph reuse with pipeline parallelism (#20927)
    • Fix #20835. This is a sufficient fix but might not be the most performant one. At least restores performance for multi-GPU setups.

🐛 Bug Fixes

  • b8508: models : move the token embedding norms to the first layer (#20943)
    • We were keeping the token embedding norms on the input layer buffers. This results in the operations being performed on the CPU:
    • </code></pre>
      </li>
      <li>make -j && GGML_SCHED_DEBUG=2 ./bin/llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "Hello world" -lv 5</li>
      </ul>
      </li>
      <li><strong>b8513</strong>: [SYCL] fix wrong variable check by assert (<a href="https://github.com/ggml-org/llama.cpp/pull/20903">#20903</a>)
      <ul>
      <li>Fix the issue: <a href="https://github.com/ggml-org/llama.cpp/pull/19920#issuecomment-4107430630">https://github.com/ggml-org/llama.cpp/pull/19920#issuecomment-4107430630</a></li>
      <li>Correct the variable to be checked by assert.</li>
      </ul>
      </li>
      <li><strong>b8514</strong>: fix-pointer-dangling (<a href="https://github.com/ggml-org/llama.cpp/pull/20974">#20974</a>)
      <ul>
      <li>
      <!--In the JNI layer of the sample Android program, when calling processUserInput, the pointer of user_prompt is freed before being referenced, and if the memory is overwritten during this period, it will not be possible to correctly retrieve the input.
      </li>
      <li>--></li>
      </ul>
      </li>
      <li><strong>b8519</strong>: jinja: fix macro with kwargs (<a href="https://github.com/ggml-org/llama.cpp/pull/20960">#20960</a>)
      <ul>
      <li>Fix this case: <code>{% macro my_func(a, b=False) %}{% if b %}{{ a }}{% else %}nope{% endif %}{% endmacro %}{{ my_func(1, b=True) }}</code></li>
      <li>With the <code>master</code> branch version, it fails with this error:</li>
      <li>
      <pre><code>
      
  • b8528: common : fix gguf selection in common_list_cached_models (#20996)
    • Fix regression that makes common_list_cached_models() showing all files
    • Related to #20994
  • b8529: common : fix verbosity setup (#20989)
    • The verbosity threshold was set at the end of common_params_parse_ex(), after doing many things (like downloading files) so -v and LLAMA_LOG_VERBOSITY were useless during this function.
  • b8546: fix: mtmd "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr (#21027)
    • This PR fixes two issues affecting vision models:
      1. Quantization of v.patch_embd
      1. Unsupported im2col (bf16) ops on Metal for DeepSeek-OCR
  • b8548: metal: Fix dimension constraint violation in matmul2d descriptor (#21048)
    • Updates Metal tensor API test probes to fix the dimension constraint violation in the matmul2d descriptor (at least one value must be a multiple of 16).
    • Some investigation detailed here https://github.com/ggml-org/llama.cpp/pull/16634#issuecomment-4138042074 indicated that the test probes for the metal tensor API fails to compile successfully on macOS 26.4, leading to the tensor support in the metal backend being disabled erroneously. This is due to a change in the Apple APIs between the time https://github.com/ggml-org/llama.cpp/pull/16634 was tested and merged by @ggerganov and today. They now require that at least one of the dimensions M and N be a multiple of 16.
    • Notably, the actual kernels used already respect this constraint (obviously, as they are compiling successfully today), and it is only these test probes which violate it.
  • b8551: fix: session_tokens insert range in completion tool (no-op → correct) (#20917)
    • The embd.begin(), embd.begin() range is empty and inserts nothing, so session_tokens never gets updated after
    • decoding. Should be embd.begin(), embd.end(). Introduced in commit 2b6dfe8.

Additional Changes

10 minor improvements: 2 documentation, 6 examples, 2 maintenance.

Full Commit Range


2026-03-24: Update to llama.cpp b8505

Summary

Updated llama.cpp from b8505 to b8505, incorporating 1 upstream commits.

Notable Changes

🐛 Bug Fixes

Full Commit Range


2026-03-18: Update to llama.cpp b8405

Summary

Updated llama.cpp from b8394 to b8405, incorporating 6 upstream commits with breaking changes and new features.

Notable Changes

⚠️ Breaking Changes

  • b8399: vulkan: disable mmvq on Intel Windows driver (#20672)
    • Fixes #17628
    • @savvadesogle This disables MMVQ entirely on Intel Windows, that should remove the need to use the env var. Please try it.
  • b8405: common : rework gpt-oss parser (#20393)
    • Rework the gpt-oss parser.
    • Tighten up the grammar, gpt-oss is very good at following its own Harmony spec.
    • Allow any sequence of analysis/preamble.

🆕 New Features

  • b8398: ggml blas: set mkl threads from thread context (#20602)
    • Commit 1: Set number of threads for MKL
    • Commit 2: Add way to run blas builds through local CI.
  • b8400: hexagon: add neg, exp, sigmoid, softplus ops, cont, repeat ops (#20701)
    • Add element-wise unary ops needed by Qwen 3.5's DeltaNet linear attention layers. These ops follow the existing unary-ops pattern with VTCM DMA double-buffering.
    • neg: negate via scale by -1.0
    • exp: uses existing hvx_exp_f32 HVX intrinsics

🐛 Bug Fixes

  • b8394: vulkan: async and event fixes (#20518)
    • I noticed incoherence with my multi-GPU setup as well when investigating issues like #20462. I found that they can be fixed by disabling cpy_tensor_async, so the problem is with the async path. I narrowed it down to these problems:
    • events were set, but the wait command was never submitted to the queue, so the event_wait function didn't do anything
    • events were resetting command buffers that had long since been reused, because they didn't track that. This was causing validation errors and perhaps driver issues/crashes
  • b8401: Reset graph on control vector change (#20381)
    • This PR makes an existing context pick up a change to its control vector configuration via llama_context::set_adapter_cvec.
    • The issue in short:
    • Initial call to set_adapter_cvec works, steering vector applies to generation.

Full Commit Range


2026-03-17: Update to llama.cpp b8392

Summary

Updated llama.cpp from b8338 to b8392, incorporating 32 upstream commits with breaking changes, new features, and performance improvements.

Notable Changes

⚠️ Breaking Changes

  • b8358: ci : split build.yml + server.yml (#20546)
    • cont #20540
    • Split build.yml + server.yml into parts and move some of the workflows in the new parts
    • Continue to run build.yml + server.yml on all PRs and master branch
  • b8363: ggml: avoid creating CUDA context during device init (#20595)
    • Make sure to read the contributing guidelines before submitting a PR
    • ggml_cuda_init() calls cudaSetDevice() on every GPU just to query free VRAM for logging. This triggers the creation of a CUDA primary context (120-550 MB depending on GPU), which is irreversible for the lifetime of the process. Every process that loads the backend pays this cost, even if it never uses the GPU (router mode).
    • This PR removes cudaSetDevice + cudaMemGetInfo from device init. The log loses the free VRAM part but still shows total VRAM via cudaGetDeviceProperties (no context needed). Free VRAM is queried later by FIT through its own cudaSetDevice path, so the context creation is simply deferred to first real use.

🆕 New Features

  • b8340: ggml : add native AVX512-FP16 support for F16 operations (#20529)
    • The overall benchmark speed remains almost the same because the CPU is now calculating faster than the RAM can deliver the data. (See perf stat results below showing 2.7 billion fewer instructions).
    • Also note that this path will be only enabled for native build or with custom flags.
    • now:
  • b8350: ci : move self-hosted workflows to separate files (#20540)
  • b8351: metal : add FA specialization for HSK = 320, HSV = 256 (#20549)
    • Add Metal kernels
    • Add test-backend-ops tests
  • b8355: cuda : add RDNA4-specific MMVQ parameter table for bs=1 decode (#19478)
    • Add a dedicated MMVQ_PARAMETERS_RDNA4 entry separate from RDNA2/RDNA3. RDNA4 (gfx1201) is wave32-only and has a different memory subsystem, so it benefits from a different MMVQ configuration than RDNA2/RDNA3.
    • For bs=1 decode on RDNA4, optimal config is nwarps=8, rows_per_block=1:
    • 8 warps × 32 threads = 256 threads per block
  • b8372: model : wire up Nemotron-H tensors for NVFP4 support (#20561)
    • prep #20539
  • b8388: model: mistral small 4 support (#20649)
    • Ref upstream PR: https://github.com/huggingface/transformers/pull/44760
    • The model is the same as Mistral Large 3 (deepseek2 arch with llama4 scaling), but I'm moving it to a new arch mistral4 to be aligned with transformers code
    • Disclosure: this PR is made possible with the help from Mistral team. Kudos to @juliendenize for the coordination!
  • b8392: kleidiai : fix MUL_MAT support for batched (3D) inputs (#20620)
    • The supports_op() check incorrectly rejected MUL_MAT operations with 3D inputs (ne[2] > 1), but the actual compute_forward_qx() implementation handles batched inputs correctly via a loop over ne12.
    • This caused models with Q4_0/Q8_0 weights to crash during graph scheduling when n_seq_max > 1, because weights were placed in KLEIDIAI buffers during loading (tested with 2D inputs) but the runtime used 3D inputs.
    • Also relax the buffer check to allow supports_op() to be called during weight loading when src[0]->buffer is NULL.

🚀 Performance Improvements

  • b8348: ci: try to optimize some jobs (#20521)
    • I tried to switch some jobs to arm or ubuntu-slim as per my comment in #20446 for builds where it really doesn't matter. Most jobs didn't fit in the 15 minute ubuntu-slim time limit and some like the sanitizer or android straight up failed on arm. If a job doesn't have ccache set up I also made it work on both x86 and arm so it would pick the first available machine.
    • I'm not sure how much this really helps, but it does reduce the number of x86 machines that we're using at any given time.
    • run in my fork with those jobs forced to run on arm: https://github.com/netrunnereve/llama.cpp/actions/runs/23031702820
  • b8364: CUDA: limit number of FA stream-k CUDA blocks (#20586)
    • On master the CUDA mma FA kernel can launch superfluous CUDA blocks that do not do any useful work but cause overhead. This can happen when running small models on GPUs with many streaming multiprocessors at low batch sizes. This PR fixes this by limiting the number of CUDA blocks to the number that can do useful work.
    • Performance changes

🐛 Bug Fixes

  • b8347: hexagon: Q4_0 and MXFP4 repack fixes (#20527)
    • Turns out our repack logic has bug where tensors with row sizes not multiple of 256 are getting corrupted.
    • Basically, I made the wrong assumption that we can use 0:128,1:129,... INT4 element packing for all blocks of 256
    • This was causing the scales to partially override some of the tail quants (in Hexagon backend we repack the rows into all-quants followed by all-scales format).
  • b8352: llama: Wire up Qwen3.5/Qwen3.5MoE tensors for NVFP4 support (#20506)
    • PR https://github.com/ggml-org/llama.cpp/pull/20505 fixes the conversion errors for making Qwen3.5 NVFP4 GGUF files and properly reorders the Qwen3.5 linear attention layers, but without this update, those models will not load.
    • This update wires up the Qwen3.5 tensors so they are properly loaded from Qwen3.5 NVFP4 gguf files and follows the same design intent using build_lora_mm:
    • This links up the:
  • b8353: Read the persisted llama_kv_cell_ext for n_pos_per_embd > 1 on state_read for all sequence ids (#20273)
    • cont #20132
    • Attempting to call llama_kv_cache::state_read fails when n_pos_per_embd is greater than 1, since llama_kv_cell_ext data is serialised in state_save but not read back in state_read, leading to deserialisation failure since the cell_ext data is being parsed as a seq_id.
    • I assume the attached fix is correct -- kv cache persistence to host memory is now working as expected.
  • b8354: vulkan: use graphics queue on AMD (#20551)
    • I'm not sure why, but the graphics queue is slightly faster in tg on AMD than the compute queue, and this also fixes the partial offload issue I fixed in #19976, so the second queue no longer has to be enabled by default. I got the idea from @zedbytes reporting that tg goes up when running with RADV_DEBUG=nocompute.
    • AMD RX 9070 XT
  • b8356: Guard against sumq2 being 0 in IQ4_NL resulting in nan values (#20460)
    • With IQ4_NL on several recent models there have been issues where during quantization NaN blocks are being found which crashes the quant
    • It seems to be stemming from a scenario where sumq2 is 0 for a given block, likely from not having imatrix data for some obscure expert, or the weights themselves being 0 as we've seen with some recent Qwen models
    • This change guards against dividing by 0, instead setting d to 0, which would then just set the block of weights to 0, which seems appropriate
  • b8360: fix: prevent nullptr dereference (#20552)
    • When encountering an unsupported template (e.g. translategemma), the code currently dereferences a nullptr and causes the program to crash.
    • With this fix, a proper exception will be thrown from common_chat_templates_apply_jinja instead.
  • b8361: ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain (#20536)
    • Description:
    • On AMD APU/iGPU devices (unified memory architecture, e.g. AMD Strix Halo gfx1151), hipMemAdviseSetCoarseGrain returns
    • hipErrorInvalidValue because this hint is not applicable to UMA systems. The current code wraps this call in CUDA_CHECK(), which treats
  • b8366: sycl : fix for untransposed GDA recurrent state (#20583)
    • cont #20443
  • b8370: tests: Fix invalid iterator::end() dereference in common_regex (#20445)
    • When compiling with VS2026 18.4 I noticed test-regex-partial crashes immediately with debug build.
    • image
    • I tracked this down to an iterator::end() dereference in the following test case which was occurring here.
  • b8373: vulkan: fix flash attention dot product precision (#20589)
    • The Q*K^T dot product was done in float16, but it should have been using ACC_TYPE. This fixes the GLM4 incoherence.
    • Fixes #20555
  • b8391: vulkan: allow graphics queue only through env var (#20599)
    • Improve #20551 to fix the reported issues. Only use graphics queue on RADV on larger GPUs.
    • Fixes #20597

Additional Changes

10 minor improvements: 3 documentation, 2 examples, 5 maintenance.

Full Commit Range


2026-03-14: Update to llama.cpp b8329

Summary

Updated llama.cpp from b8287 to b8329, incorporating 29 upstream commits with new features.

Notable Changes

🆕 New Features

  • b8291: metal : add env var to trigger graph capture (#20398)
    • QoL for capturing execution of Metal graphs for profiling purposes.
    • Usage:
    • </code></pre>
      </li>
      </ul>
      </li>
      <li><strong>b8295</strong>: llama : add support for Nemotron 3 Super (<a href="https://github.com/ggml-org/llama.cpp/pull/20411">#20411</a>)
      <ul>
      <li>This commit adds support for the Nemotron 3 Super model (120B.A12B) enabling this model to be converted to GGUF format and run in llama.cpp.</li>
      </ul>
      </li>
      <li><strong>b8299</strong>: llama : enable chunked fused GDN path (<a href="https://github.com/ggml-org/llama.cpp/pull/20340">#20340</a>)
      <ul>
      <li>cont #19504</li>
      <li>Backends can now implement the chunked version of the fused GDN operator.</li>
      <li>Implementations:</li>
      </ul>
      </li>
      <li><strong>b8299</strong>: metal : add GDN kernel (<a href="https://github.com/ggml-org/llama.cpp/pull/20361">#20361</a>)
      <ul>
      <li>target #20340</li>
      <li>cont #20244</li>
      <li>Add fused GDN recurrent kernel. Use both for BS == 1 and BS > 1.</li>
      </ul>
      </li>
      <li><strong>b8299</strong>: ggml: add GATED_DELTA_NET op (<a href="https://github.com/ggml-org/llama.cpp/pull/19504">#19504</a>)
      <ul>
      <li>Add CPU/CUDA impl for GATED_DELTA_NET used in qwen3next and a lot of upcoming recent attention models. This is a basic vector impl and not the chunking impl, although this should work for n_tokens > 1 as a reference implementation. I tested this vs <code>build_delta_net_autoregressive</code> and the results were good. I plan to add the chunked implementation for CPU and CUDA.</li>
      <li>master:</li>
      <li>| model                          |       size |     params | backend    | threads | fa |            test |                  t/s |</li>
      </ul>
      </li>
      <li><strong>b8299</strong>: CUDA: AR gated delta net improvements (<a href="https://github.com/ggml-org/llama.cpp/pull/20391">#20391</a>)
      <ul>
      <li>I profiled the AR gated delta net, and improved perf by:</li>
      <li>
      <ol>
      <li>Adding fastdiv/fastrem for s64 int (do we even need this arithmetic to happen in 64-bit?)</li>
      </ol>
      </li>
      <li>
      <ol start="2">
      <li>Sharding a column across a full warp instead of using only a single thread. We don't fill SMs (at least on higher-tier GPUs) with existing launch-config (saw 16-32 CTAs with low thread-counts vs. 80+ SMs for e.g. 5080), so that was some free perf while reducing register-pressure in the case where S_v = 128 (saw some spill there)</li>
      </ol>
      </li>
      </ul>
      </li>
      <li><strong>b8304</strong>: tool parser: add GigaChatV3/3.1 models support in PEG format (<a href="https://github.com/ggml-org/llama.cpp/pull/19931">#19931</a>)
      <ul>
      <li>I have recreated the PR of <a href="https://github.com/ggml-org/llama.cpp/pull/17924">https://github.com/ggml-org/llama.cpp/pull/17924</a> for cleaner commits and no merge conflicts</li>
      </ul>
      </li>
      <li><strong>b8315</strong>: vulkan: fix SSM_CONV PP scaling with large ubatch sizes (<a href="https://github.com/ggml-org/llama.cpp/pull/20379">#20379</a>)
      <ul>
      <li>Fixes #18725</li>
      <li>The SSM_CONV shader dispatched one token per Y workgroup, each doing only <code>nc</code> (typically 4) multiply-adds. At ubatch=2048 this meant 2048 workgroups in Y with almost no work per launch  workgroup dispatch overhead dominated.</li>
      <li><strong>Changes:</strong></li>
      </ul>
      </li>
      <li><strong>b8317</strong>: llama : enable chunked fused GDN path (<a href="https://github.com/ggml-org/llama.cpp/pull/20340">#20340</a>)
      <ul>
      <li>cont #19504</li>
      <li>Backends can now implement the chunked version of the fused GDN operator.</li>
      <li>Implementations:</li>
      </ul>
      </li>
      <li><strong>b8329</strong>: ggml-cpu: add RVV vec dot kernels for quantization types (<a href="https://github.com/ggml-org/llama.cpp/pull/18859">#18859</a>)
      <ul>
      <li>This PR adds RVV vector dot kernels for a number of quantization types.</li>
      <li>Added the following RVV kernels:</li>
      <li>| Kernel | VLEN |</li>
      </ul>
      </li>
      </ul>
      <h4>🐛 Bug Fixes</h4>
      <ul>
      <li><strong>b8292</strong>: metal : fix q5_k mul_mv register spill (<a href="https://github.com/ggml-org/llama.cpp/pull/20399">#20399</a>)
      <ul>
      <li>cont #20398</li>
      <li>Noticed too high register pressure in the q5_k vec kernel:</li>
      <li>
      <pre lang="bash"><code>
      
  • b8301: common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (#20416)
    • Changed the regex that matches conditional experts from:
    • </code></pre>
      </li>
      <li>const char * const LLM_FFN_EXPS_REGEX = "\.ffn_(up|down|gate)_(ch|)exps";</li>
      </ul>
      </li>
      <li><strong>b8308</strong>: vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (<a href="https://github.com/ggml-org/llama.cpp/pull/20059">#20059</a>)
      <ul>
      <li>Fixes #19420.</li>
      <li>We were hitting an internal maximum number (16383) of command buffers for Intel's Windows GPU driver causing ErrorOutOfHostMemory when loading large models (1MB per transfer * 16383 == approx 16GB or more weight). This PR attempts to fix this by reusing command buffers that are done transferring data.</li>
      <li><code>llama-cli.exe -m Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --no-mmap</code> show no crashing on both Intel iGPU and NVIDIA dGPU. Chat results are correct as well.</li>
      </ul>
      </li>
      <li><strong>b8309</strong>: vulkan: fix OOB check in flash_attn_mask_opt (<a href="https://github.com/ggml-org/llama.cpp/pull/20296">#20296</a>)
      <ul>
      <li>Fixes #19955.</li>
      <li>I saw a few percent slowdown with pp512 (which is too small to hit the aligned path on my system after this change) so I tweaked the use_mask_opt logic to hide it. I should look into spreading the work across more workgroups, but I don't have time for that today.</li>
      <li>@el95149 this is different enough from the test change that it's probably worth retesting.</li>
      </ul>
      </li>
      <li><strong>b8310</strong>: vulkan: fix l2_norm epsilon handling (<a href="https://github.com/ggml-org/llama.cpp/pull/20350">#20350</a>)
      <ul>
      <li>This is the only "real" bug I could find in test-llama-archs. I see some other failures but they may be driver/compiler bugs.</li>
      </ul>
      </li>
      <li><strong>b8318</strong>: grammar : Fix grammar root symbol check (<a href="https://github.com/ggml-org/llama.cpp/pull/19761">#19761</a>)
      <ul>
      <li>Constructing a GBNF grammar allows the programmer to select a <code>grammar_root</code>- the symbol to start the grammar from.</li>
      <li>The <code>llama_grammar_init_impl</code> function incldued a check to see whether the grammar contains a rule for a symbol named literally "root", instead of checking for a symbol with the named passed in as <code>grammar_root</code>. This causes valid grammars with non-"root" root symbols to fail, and invalid grammars with a rule named "root", but a different chosen <code>grammar_root</code> symbol to pass the check, and immediately fail hard (see failure case in Tests section).</li>
      <li>Check whether there is a rule for a symbol with the name passed in as <code>grammar_root</code>, not literally <code>"root"</code>.</li>
      </ul>
      </li>
      <li><strong>b8323</strong>: llama : disable graph reuse with pipeline parallelism (<a href="https://github.com/ggml-org/llama.cpp/pull/20463">#20463</a>)
      <ul>
      <li>The following repro demonstrates the issue:</li>
      <li>
      <pre lang="bash"><code>
      
    • make -j && ./bin/llama-perplexity -hf ggml-org/Qwen3-0.6B-GGUF -f wiki.test.raw --chunks 16 -ngl 99 -ub 512 -b 2048
  • b8325: metal : fix l2 norm scale (#20493)
    • Bug revealed from recently added tests.
  • b8328: ggml : fix typo gmml (#20512)

Additional Changes

10 minor improvements: 3 documentation, 5 examples, 2 maintenance.

Full Commit Range


2026-03-08: Update to llama.cpp b8234

Summary

Updated llama.cpp from b8233 to b8234, incorporating 2 upstream commits with new features.

Notable Changes

🆕 New Features

  • b8233: ggml: add GATED_DELTA_NET op (#19504)
    • Add CPU/CUDA impl for GATED_DELTA_NET used in qwen3next and a lot of upcoming recent attention models. This is a basic vector impl and not the chunking impl, although this should work for n_tokens > 1 as a reference implementation. I tested this vs build_delta_net_autoregressive and the results were good. I plan to add the chunked implementation for CPU and CUDA.
    • master:
    • | model | size | params | backend | threads | fa | test | t/s |

Additional Changes

1 minor improvements: 1 documentation.

  • b8234: [SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (#20190)
    • Supprt Flash Attention for fp32/fp16/Q4/Q5/Q8.
    • All supported Flash Attention UT cases are passed.
    • Support to enable/disable Flash attention by environment variable: GGML_SYCL_ENABLE_FLASH_ATTN

Full Commit Range


2026-03-07: Update to llama.cpp b8229

Summary

Updated llama.cpp from b8229 to b8229, incorporating 1 upstream commits.

Notable Changes

🐛 Bug Fixes

  • b8229: [ggml-quants] Add memsets and other fixes for IQ quants (#19861)
    • While trying to stop my Qwen3.5 quants from getting a ton of "Oops: found point X not on grid ...", I (and claude) came across a potential big issue
    • Using gdb, it seems that L is often initialized to non-zero memory, and so when it's read, it has garbage data in it that's causing the quantizations to go awry when there's no candidates found during the search
    • With this change, with Qwen3.5, I no longer saw ANY "Oops: found point.." errors, and the PPL seems totally as expected

Full Commit Range


2026-03-05: Update to llama.cpp b8204

Summary

Updated llama.cpp from b8185 to b8204, incorporating 16 upstream commits with breaking changes, new features, and performance improvements.

Notable Changes

⚠️ Breaking Changes

  • b8189: Clean up per-thread parameter buffer pool and job submission logic (#19772)
    • After splitting per-thread state and execution, this is the final cleanup diff.
    • We allow the buffer pool to grow in case of multiple kernels in a command requiring more buffers, remove the inflight_threads logic, and replace it with num_kernels to decide when to submit a batch of commands.
  • b8201: [WebGPU] Fix wait logic for inflight jobs (#20096)
    • Fix WebGPU wait logic incorrectly removing futures. WaitAny returns when any future completes, but the previous implementation erased the entire submission entry (aka a vector of futures). Flatten the nested futures structure to a single vector and remove only the futures that are completed.

🆕 New Features

  • b8188: ggml-webgpu: Support non-contiguous src0 and overlapping src0/src1 in binary ops (#19850)
    • Hello. This PR improves the handling of binary operations in the WebGPU backend, adding support for patterns required by #16857 (MoE expert reduce).
    • The changes are as follows:
    • The index is now calculated based on stride to support cases where src0 is a non-contiguous tensor.
  • b8190: ggml webgpu: fix workgroup dispatch limit for large batch sizes (#19965)
    • WebGPU limits workgroup counts to 65535 per dimension. MUL_MAT operations with batch sizes exceeding this limit would fail or corrupt memory.
    • This PR implements 2D workgroup dispatch to handle arbitrary batch sizes:
    • Adds compute_2d_workgroups() helper to split workgroups across X/Y dimensions when exceeding the 65535 limit
  • b8191: opencl: add optimized q4_1 mm kernel for adreno (#19840)
    • This PR adds optimized OpenCL kernels for Q4_1 GEMM and GEMV operations on Adreno GPUs.
  • b8192: kleidiai : add sme fp16 compute path for q4_0 gemm on aarch64 (#20043)
    • This patch introduce an SME2-based FP16 compute path for Q4_0 GEMM to improve performance on AARCH64.
    • Benchmark result for Llama-3.2-1B-Instruct-Q4_0 — pp512 (t/s) (Mac M4 Pro, GGML_KLEIDIAI_SME=1)
    • | Threads | w/o fp16q4 | w/ fp16q4 | Improvement |
  • b8203: opencl: add set, i32 for cpy (#20101)
    • Add set and support i32 for cpy. Also some minor refactoring for cpy host code.

🚀 Performance Improvements

  • b8185: ggml-cpu: optimise s390x multiply extend instructions (#20032)
    • This PR optimizes the multiply extend vector instructions for Q4_0, Q4_K, Q5_K, and Q6_K quantizations by using the fused multiply-add instruction instead of separating them into multiple instruction calls. We notice a performance improvement of about 28.77% and 16.35% for Prompt Processing and Token Generation respectively.
    • Old Instruction Set
    • </code></pre>
      </li>
      </ul>
      </li>
      <li><strong>b8187</strong>: vulkan: tune MMVQ for Intel Windows (<a href="https://github.com/ggml-org/llama.cpp/pull/19988">#19988</a>)
      <ul>
      <li>Tune MMVQ use for Intel Windows according to <a href="https://github.com/ggml-org/llama.cpp/issues/17628#issuecomment-3897132360">https://github.com/ggml-org/llama.cpp/issues/17628#issuecomment-3897132360</a></li>
      <li>@savvadesogle Please try it and see if performance is good.</li>
      </ul>
      </li>
      <li><strong>b8197</strong>: ggml : use a simple std::thread in AMX without OpenMP (<a href="https://github.com/ggml-org/llama.cpp/pull/20074">#20074</a>)
      <ul>
      <li>Disabling OpenMP generally provides better inference performance (at least in my testing) but the loading becomes slightly slower.</li>
      <li>Benchmark results for <code>convert_B_packed_format()</code>:</li>
      <li>Before this commit:</li>
      </ul>
      </li>
      <li><strong>b8204</strong>: hexagon: Flash Attention optimizations (dma, mpyacc, multi-row) and MatMul updates (<a href="https://github.com/ggml-org/llama.cpp/pull/20118">#20118</a>)
      <ul>
      <li>Further updates on top of #19780 by @chraac</li>
      <li>Improved DMA pipelining in FA</li>
      <li>Reduced FA block size from 128 to 64 to improve DMA prefetch (128 is too big for most models)</li>
      </ul>
      </li>
      </ul>
      <h4>🐛 Bug Fixes</h4>
      <ul>
      <li><strong>b8196</strong>: impl : use 6 digits for tensor dims (<a href="https://github.com/ggml-org/llama.cpp/pull/20094">#20094</a>)
      <ul>
      <li>Many models have vocabulary sizes, and thus tensor shapes, with more than 5 digits (ex: Gemma 3's vocab size is 262,208).</li>
      <li>I already fixed this for <code>llama_format_tensor_shape</code> (tensor) but missed it for <code>llama_format_tensor_shape</code> (vector) until now. Oops.</li>
      <li><em>Make sure to read the <a href="https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md">contributing guidelines</a> before submitting a PR</em></li>
      </ul>
      </li>
      <li><strong>b8198</strong>: ggml: fix ggml_is_contiguous_n for ne == 1 (<a href="https://github.com/ggml-org/llama.cpp/pull/20092">#20092</a>)
      <ul>
      <li>While debugging a test failure for <a href="https://github.com/ggml-org/llama.cpp/pull/19802">https://github.com/ggml-org/llama.cpp/pull/19802</a> I found what I believe to be a bug in <code>ggml_is_contiguous_n</code>. A test case using the new fused experts from <a href="https://github.com/ggml-org/llama.cpp/pull/19139">https://github.com/ggml-org/llama.cpp/pull/19139</a> fails on an assert like <code>GGML_ASSERT(ggml_is_contiguous_1(a))</code>. This assertion failure happens specifically because the test case uses only a single expert vs. the real models using >1 experts. So the test case gets a tensor like this: <code>ne = {192, 1, 128, 1}, nb = {4, 1536, 1536, 196608}</code>. This should be contiguous in dimensions 1, 2, and 3 but it is not according to <code>ggml_is_contiguous_1</code>. The reason is that the code on master entirely skips dimensions that have a size of 1. But this then also skips the fix for <code>next_nb</code> if a dimension does not need to be contiguous. This PR adjusts the logic to skip only the check for whether or not the tensor is contiguous if a dimension is equal to 1.</li>
      </ul>
      </li>
      </ul>
      <h3>Additional Changes</h3>
      <p>3 minor improvements: 1 documentation, 2 examples.</p>
      <ul>
      <li><strong>b8200</strong>: ggml-webgpu: Add the support of <code>GGML_OP_CONCAT</code> (<a href="https://github.com/ggml-org/llama.cpp/pull/20068">#20068</a>)
      <ul>
      <li>Hello. This PR adds <code>GGML_OP_CONCAT</code> support to the WebGPU backend. This op is used by models such as DeepSeek-V2.</li>
      <li>This change supports two types <code>F32</code>, <code>I32</code> to match the types covered by <code>test_concat</code> in <code>test-backend-ops</code>.</li>
      </ul>
      </li>
      <li><strong>b8194</strong>: completion : Fix a typo in warning message (<a href="https://github.com/ggml-org/llama.cpp/pull/20082">#20082</a>)
      <ul>
      <li>resuse -> reuse</li>
      </ul>
      </li>
      <li><strong>b8195</strong>: Fix locale-dependent float printing in GGUF metadata (<a href="https://github.com/ggml-org/llama.cpp/pull/17331">#17331</a>)
      <ul>
      <li>I was running some llama.cpp examples on a system with a German locale (de_DE) and noticed something odd - when llama-cli printed out the model metadata, all the float values had commas as decimal separators (like "0,000000") instead of periods. But when I ran llama-perplexity on the same model, it used periods normally.</li>
      <li>After some digging, I found the issue was in the gguf_data_to_str() function in llama-impl.cpp. It was using std::to_string() to format floats, which respects the system's LC_NUMERIC locale setting. So depending on which tool you used and what locale it was running with, you'd get different formatting.</li>
      <li>I've changed it to use std::ostringstream with std::locale::classic() instead, which always formats floats with a period as the decimal separator, regardless of the system locale. This should make the output consistent across all tools and locales.</li>
      </ul>
      </li>
      </ul>
      <h3>Full Commit Range</h3>
      <ul>
      <li>b8185 to b8204 (16 commits)</li>
      <li>Upstream releases: <a href="https://github.com/ggml-org/llama.cpp/compare/b8185...b8204">https://github.com/ggml-org/llama.cpp/compare/b8185...b8204</a></li>
      </ul>
      <hr />
      <h2>2026-03-02: Update to llama.cpp b8185</h2>
      <h3>Summary</h3>
      <p>Updated llama.cpp from b8182 to b8185, incorporating 4 upstream commits with performance improvements.</p>
      <h3>Notable Changes</h3>
      <h4>🚀 Performance Improvements</h4>
      <ul>
      <li><strong>b8184</strong>: vulkan: improve partial offloading performance on AMD (<a href="https://github.com/ggml-org/llama.cpp/pull/19976">#19976</a>)
      <ul>
      <li>I saw a big difference between Vulkan and ROCm performance in partial offloads. I narrowed it down to transfer speeds for weight transfer from CPU to GPU with offloaded ops. One possible explanation is that using the dedicated transfer queue on AMD may be faster than using a compute queue, so I implemented using a transfer queue for async transfers as well and synchronizing transfers using a timeline semaphore. This does improve performance.</li>
      <li>Then I checked and found that the dedicated transfer queue on AMD is not exposed by the Linux driver by default, so it's not actually being used. The difference comes from using a second queue (the graphics queue) for transfers, so I assume the issue was the compute queue being congested with other work.</li>
      <li>This helps on AMD RDNA4, but not on GCN and not on Nvidia. I couldn't test Intel because the Linux driver only exposes a single queue.</li>
      </ul>
      </li>
      <li><strong>b8185</strong>: ggml-cpu: optimise s390x multiply extend instructions (<a href="https://github.com/ggml-org/llama.cpp/pull/20032">#20032</a>)
      <ul>
      <li>This PR optimizes the multiply extend vector instructions for Q4_0, Q4_K, Q5_K, and Q6_K quantizations by using the fused multiply-add instruction instead of separating them into multiple instruction calls. We notice a performance improvement of about 28.77% and 16.35% for Prompt Processing and Token Generation respectively.</li>
      <li>Old Instruction Set</li>
      <li>
      <pre lang="assembly"><code>
      

🐛 Bug Fixes

  • b8182: vendors: update miniaudio library to 0.11.24 (#19914)
  • b8183: cuda: fix grid.y overflow in non-contiguous dequantize/convert kernels (#19999)
    • The dequantize_block and convert_unary kernels pass ne01 directly as the CUDA grid y-dimension, but grid.y is limited to 65535. When ne01 exceeds this, the kernel launch fails with cudaErrorInvalidConfiguration.
    • This happens when using llama-server with flash attention, quantized KV cache, multiple parallel slots, and long context. With multiple slots the KV caches are non-contiguous, so the NC dequantization path is taken, and ne01 (the KV cache length) ends up as grid.y.
    • The grid.z dimension was already capped at 65535 with a grid-stride loop. This applies the same pattern to grid.y.

Full Commit Range


2026-03-01: Update to llama.cpp b8182

Summary

Updated llama.cpp from b8087 to b8182, incorporating 76 upstream commits with breaking changes, new features, and performance improvements.

Notable Changes

⚠️ Breaking Changes

  • b8098: models : dedup qwen35 graphs (#19660)
    • cont #19597
    • Use the new struct llm_build_delta_net_base to deduplicate the delta net graphs from Qwen35 models.
    • TODO:
  • b8101: llama : use output_resolve_row() in get_logits_ith/get_embeddings_ith (#19663)
    • This commit updates get_logits_ith(), and get_embeddings_ith() to use output_resolve_row() to resolve the batch index to output row index.
    • The motivation for this is to remove some code duplication between these functions.
  • b8140: hexagon refactor all Ops to use local context struct (#19819)
    • This PR completes the refactoring of all Hexagon Ops to use a local context structure. This allows each Op to precompute and cache more state. The refactoring also removes redundant function wrappers and unnecessary boilerplate.
    • Most Ops now use DMA for fetching inputs and writing back outputs.
    • The main loops of RoPE and Unary Ops have been completely rewritten for better DMA pipelining.
  • b8146: ggml/gguf : prevent integer overflows (#19856)
    • Strengthen integer overflow validation in ggml/gguf
    • Impose max limits for string length and array elements of GGUF metadata
    • Remove deprecated ggml_type_sizef()

🆕 New Features

  • b8091: ggml webgpu: shader library organization (#19530)
    • We've been converting many of the existing WGSL shaders into a format that allows for efficient just-in-time compilation of variants used in specific model graphs, as well as sets them up for better performance tuning down the road. This PR makes a pretty large organizational change, moving the shader preprocessing, compilation, and caching into a new ggml_webgpu_shader_lib structure. As part of this, the existing matrix multiplication shaders were also converted in to the JIT compilation format (using the wgsl preprocessor), along with get_rows and scale.
    • This new shader library class also opens up the opportunity for tons of interesting specialization in the WebGPU backend. For example, if you have a shader specialized for a particular GPU vendor/architecture in WGSL, it should be pretty easy to hook it into the logic for choosing the right shader/pipeline.
    • It's always nice to have a PR that removes more lines of code than it adds too :)
  • b8091: Add oneliner for batch quantization (#17)
  • b8100: full modern bert support (#18330)
    • Made support for conversion from hf->gguf and execution on llama.cpp after my recent (granite-embd-support)[https://github.com/ggml-org/llama.cpp/pull/15641] which is a modern bert based model, this pr continues off of that and has some tweaks. I have ran cosine similarity tests with this script
    • from sentence_transformers import SentenceTransformer
  • b8102: model : Add tokenizer from LFM2.5-Audio-1.5B (#19687)
  • b8106: model: add JAIS-2 architecture support (#19488)
  • b8106: CUDA: fix padding of GQA to power of 2 in FA (#19115)
    • Fixes https://github.com/ggml-org/llama.cpp/issues/19112 , the issue was introduced with https://github.com/ggml-org/llama.cpp/pull/19092 .
    • The MMA CUDA FlashAttention kernel uses a stream-k decomposition to treat the four-dimensional input tensors as one continuous dimension to split across streaming multiprocessors. However, in conjunction with the GQA-specific optimizations in the MMA kernel this is only correct if the number of Q columns per CUDA block exactly divide n_gqa. Otherwise the wrong Q and K/V heads will be associated and the result will be wrong (if there is only a single K/V head this doesn't matter so it was not detected in testing).
    • This PR extends the 4D space on master to a 5D space by splitting the "z" dimension with the number of Q heads into one dimension for the number of K/V heads and another dimension for the number of Q heads per K/V head. This then makes it possible to simply pad the Q columns per CUDA block to a power of 2.
  • b8116: ggml-quants : weighted rounding algorithms with cumulative search (#12557)
    • This adds proper imatrix support to TQ1_0 and TQ2_0, in addition to improving the rounding algorithm used for Q3_K, IQ4_NL, IQ4_XS (both with and without imatrix), as well as when using imatrix with Q4_0 and Q5_0.
    • This is backward and forward compatible with other versions of llama.cpp.
    • Since this doesn't change the format of the types, only how the values are rounded when quantized, even previous (or current) versions of llama.cpp can use quants made with this PR.
  • b8117: ggml-cpu: add RVV vec dot kernels for quantization types (#18784)
    • This PR adds RVV vector dot kernels for a number of quantization types.
    • Added the following RVV kernels:
    • | Kernel | VLEN |
  • b8118: common : merge qwen3-coder and nemotron nano 3 parsers (#19765)
    • Users are experiencing several issues with Qwen3-Coder-Next. Until #18675 is merged in, this PR serves as a stop-gap by replacing the existing Qwen3-Coder parsing with the Nemotron Nano 3 PEG parsing variant already present.
    • This PR also adds parallel tool calling and fixes JSON schema support.
    • fixes #19382
  • b8123: Add a build target to generate ROCm artifacts using ROCm 7.2 (#19433)
    • This builds the following targets:
    • gfx1151
    • gfx1150
  • b8128: model: Add Kanana-2 model support (#19803)
  • b8131: jinja: correct stats for tojson and string filters (#19785)
  • b8142: vulkan: fix coopmat1 without bf16 support (#19793)
    • This should fix the CI failure on lavapipe. lavapipe added coopmat1 support recently, but does not have bf16 support, so it falls back to the scalar path. This fallback didn't have quite the same tile size logic for subgroupsize=8 as when going through the scalar path directly.
  • b8143: Vulkan Scalar Flash Attention Refactor (#19625)
    • This started out as an attempt to go through the scalar FA version and add proper float16 support to improve AMD and Intel performance and went quite a bit further. @jeffbolznv Sorry about the amount of changes, let me know if there's something I can do to make the review easier. Please also let me know if you have architectural concerns. Flash Attention has so many dimensions and making it work well on so much hardware and models is pretty hard. I had to spend quite a lot of time figuring out and fixing regressions on specific configurations.
    • AI-generated summary of changes
  • b8149: gguf : fix ftell/fseek for Windows (#19870)
    • Regression introduced in #19856.
    • This changes the ftell/fseek calls to use _ftelli64/_fseeki64 on Windows, and ftello/fseeko for POSIX systems.
    • long on Windows is always 32-bit. Since that would cause an overflow on large files, ftell/fseek fails and nbytes_remain() returns 0.
  • b8155: common : add more aliases for sampler CLI params (#19797)
    • Adds two CLI argument aliases for sampler parameters:
    • --top-n-sigma (for existing --top-nsigma)
    • --temperature (for existing --temp)
  • b8161: jinja : correct default size for string slices (#19913)
    • Make sure to read the contributing guidelines before submitting a PR
    • As of b8157, when trying to use string slices in a chat template, and the slice does not specify end index (e.g. content[1 : ]), no output will be emitted since the default end index is calculated only for arrays, and remains 0 for strings. This PR adds handling for strings, and should be complete for currently supported data types.
  • b8164: llama: Add option to merge gate and exp weights (#19139)
    • Continuing on #18740 and #18866, add option --fuse_gate_up_exps to convert_hf_to_gguf.py.
    • I've just added the gate_up tracking for deepseek2 (GLM 4.7 flash) and gpt-oss - although for gpt-oss we need even more changes (it goes through the generate_extra_tensors for generating expert weights). This PR is not complete as we would need to add this check in all MoE models and their tensors, but putting it out there in any case.
    • on 5090:
  • b8165: kv-cache : fix can_shift() check to take into account M-RoPE (#19928)
    • fix #19915
    • KV cache shift is not supported with M-RoPE (yet).
  • b8169: ggml : fix AMX and add batched support (#19925)
    • llama-perplexity -hf ggml-org/Qwen3-0.6B-GGUF:Q4_0 -f wikitext-2-raw/wiki.test.raw -c 2048 -b 2048 --chunks 2
    • before this commit:
  • b8175: ggml-cpu: add repack for mxfp4 (#19738)
    • This is just a faithful copy of the iq4_nl quant to mxfp4 with just the scale loading changed. Tested on AVX2 only, would appreciate tests on ARM and AVX512. Perplexity is already high for gpt-oss-20b but I see it is the same between master and this branch
    • | Model | Test | t/s master | t/s mxfp4-repack-cpu | Speedup |
    • |:----------------------|:-------|-------------:|-----------------------:|----------:|
  • b8179: CUDA: add CDNA3 MFMA support for flash attention MMA kernel (#19806)
    • Adds MI300X (gfx942) MFMA tensor core flash attention to fattn-mma-f16.cuh. MI300X now routes to BEST_FATTN_KERNEL_MMA_F16 instead of the tile-based fallback.
    • Uses v_mfma_f32_16x16x16_f16 (FP16 inputs, FP32 accumulate) with wavefront64
    • Supports head sizes 64, 80, 96, 112, 128 via MMA; others fall back to VEC
  • b8180: Add model metadata loading from huggingface for use with tests requiring real model data (#19796)

🚀 Performance Improvements

  • b8087: opencl: refactor expm1 and softplus (#19404)
    • This PR refactors the EXPM1 and Softplus OpenCL operators to improve code clarity and reduce duplication.
  • b8099: powerpc: add FP16 MMA path for Q4/Q8 matmul (#19709)
    • Avoid xvi8ger4pp signed→unsigned bias correction by dequantizing Q4/Q8 inputs to FP16 and using FP16×FP16→FP32 MMA. This removes post-processing overhead and improves performance.
    • Performance Impact:
    • 1.5 ~ 2x improvement in PP_Speed for Q4 and Q8 Models, measured with llama-bench and llama-batched-bench. Q8 Model: granite-4.0-h-micro-Q8_0.gguf (from huggingface) Q4 Model: Meta-Llama3-8b Q4 model (generated with llama-quantize from f32 model)
  • b8121: Improve CUDA graph capture (#19754)
    • Currently, CUDA graphs are eagerly enabled on the first call to ggml_backend_cuda_graph_compute. If the graph properties keep changing (4+ consecutive updates), the graph is permanently disabled. This is suboptimal because:
    • The first call always incurs CUDA graph capture overhead even if the graph is unstable
    • Once permanently disabled, CUDA graphs never re-enable even after the graph stabilizes (e.g., switching from prompt processing to decode)

🐛 Bug Fixes

  • b8088: common : make small string helpers as inline functions (#19693)
    • Also use string_view when it make sense and fix some corner cases.
  • b8089: vulkan: split mul_mat into multiple dispatches to avoid overflow (#19509)
    • The batch dimensions can be greater than the max workgroup count limit, in which case we need to split into multiple dispatches and pass the base index through a push constant.
    • Fall back for the less common p021 and nc variants.
    • Fixes #19471.
  • b8095: ggml webgpu: Fix bug in dispatching large matrix-vector multiplication (#19535)
  • b8105: CUDA: fix kernel selection logic for tile FA (#19686)
  • b8109: vulkan: fix MMQ shader push constants and multi-dispatch (#19732)
    • We forgot to update the mul_mmq shader in #19509. This should fix #19710.
  • b8112: common : fix gpt-oss Jinja error with content and thinking on tool-call messages (#19704)
    • Erase the content from the adjusted message after copying reasoning_content to thinking.
    • Regression from #16937
    • Fixes #19703.
  • b8113: common : fix Step-3.5-Flash format detection and thinking support (#19635)
    • Step-3.5-Flash (196B MoE) uses the same XML tool call output format as Qwen3-Coder and Nemotron 3 Nano (`<tool_call><function=...><parameter=...>`), but its template lacks the bare `` and plural `` markers in the tool enumeration section. The previous detection logic required all five XML markers, so Step-3.5-Flash fell through to Hermes 2 Pro, which doesn't call `func_args_not_string()`. Tool arguments stayed as JSON strings and templates using `arguments|items` crashed.
    • Reported by multiple users in #19283:
    • Leaked tool tokens with Codex (@tarruda)
  • b8115: test: mul_mat tests with huge batch size (#19519)
    • tests for #19471.
    • vulkan fix is in #19509.
  • b8119: hexagon : fix build release (#19444) (#19587)
    • fixes: #19444
    • cc: @max-krasnyansky
  • b8130: common : fix improper trimming in XML parser on complete message (#19805)
    • Fix courtesy of @julio75012. Although his use case has already been fixed, I'm submitting this PR to address other models that exhibit similar behavior.
    • The issue is that the XML parser trims partially matched tags. The reason > was trimmed from Seed-OSS is because tool_sep = >, and the reason a trailing " is trimmed from MiniMax/Kimi-K2 is because tool_sep = ">. This trimming should only happen when the message is still partial. Once the full message has been received, no trimming should occur.
    • Fixes #19795
  • b8141: vulkan: fix data race in mul_mat_id shader (#19790)
    • I've been working on automated data race detection (see https://github.com/KhronosGroup/Vulkan-ValidationLayers/pull/11717), and it found a data race in the mul_mat_id shaders. All invocations in a subgroup were storing the same value to shared memory, but this is still technically a data race. Just store on the first invocation.
  • b8148: models : fix graph splits (#19866)
    • fix #19860
    • fix #19864
    • Ensure the node order of Qwen 3.5 graphs is suitable for multi-GPU systems.
  • b8156: vulkan: check for memory overlap before doing fusion (#19768)
    • This fixes a class of potential fusion bugs where the destination could overwrite a source tensor while other elements of the same op still need those source values. Add some logic to compare the memory ranges and disable fusion if the bad case is detected. Some operations contribute to the destination in an elementwise fashion and can do a more relaxed check where exact overlap is allowed.
    • In practice, I see this disabling TOPK_MOE fusion in some models (gpt-oss, qwen3) when there's more than one row, and this does appear to be a latent bug.
  • b8157: [SYCL] Fix binbcast.cpp:200: GGML_ASSERT(s10 == 1) failed of Qwen3-Coder-Next-Q3_K_M.gguf (#19889)
  • b8159: gguf : avoid too many file size calls (#19919)
    • cont #19856
    • fix #19912
    • No need to do file calls on each read. Instead, determine the remaining bytes once at the start and after that update the value on each read.
  • b8168: vulkan: fix fp16 Flash Attention on Windows AMD RDNA2 and below (#19921)
    • For some reason a f16vec4 subgroupShuffleXor is broken on RDNA2 and lower. I found a workaround by shuffling vec4 instead. This also fixes fp16 Flash Attention on AMD GCN, so I removed the fp32 fallback.
    • Fixes #19881 and also the issue reported here: https://github.com/ggml-org/llama.cpp/pull/19625#issuecomment-3940674420
    • @masamaru-san @DeryabinIvan Please try this fix and let me know if it works for you.
  • b8171: [SYCL] Replace the magic nunber 768 by max work group size to support iGPU (#19920)
  • b8172: [CMake] Enable test-chat out of tree build (#19558)
    • The test-chat binary relies on model files that it tries to find. However, when configuring the build directory to be parallel to the source tree those heuristics fail.
    • This sets the working directory for the test executable to be the source-tree which resolves this issue.
    • I validated locally with a build parallel to the source tree and nested inside the source tree.
  • b8182: vendors: update miniaudio library to 0.11.24 (#19914)

Additional Changes

27 minor improvements: 3 documentation, 19 examples, 5 maintenance.

Full Commit Range


2026-02-18: Update to llama.cpp b8087

Summary

Updated llama.cpp from b8053 to b8087, incorporating 28 upstream commits with breaking changes, new features, and performance improvements.

Notable Changes

⚠️ Breaking Changes

  • b8057: ggml-cpu: FA add GEMM microkernel (#19422)
    • This PR contains the following improvements for the tiled FA kernel
    • Add a simd gemm for float32 in the tiled FA kernel.
    • Tune tile sizes for larger context
  • b8075: Remove annoying warnings (unused functions) (#18639)
    • When using common.h as a library, these function produce annoying warnings about not being used.
    • Using "static" linking for these also doesn't make much sense because it potentially increases executable size with no gains.

🆕 New Features

  • b8059: ggml : avoid UB in gemm ukernel + tests (#19642)
    • cont #19422
    • Reword the GEMM ukernel to not trip the compiler's aggressive loop optimization warnings. It's better to avoid the global pragma as it might be useful for other static analysis
    • Add test-backend-ops with BS=75 to exercise the new tiled SIMD implementation
  • b8061: cmake : check if KleidiAI API has been fetched (#19640)
    • This commit addresses a build issue with the KleidiAI backend when building multiple cpu backends. Commmit
    • 3a00c98584e42a20675b6569d81beadb282b0952 ("cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL") introduced a change where FetchContent_Populate is called instead of FetchContent_MakeAvailable, where the latter does handle this case (it is idempotent but FetchContent_Populate is not).
    • I missed this during my review and I should not have commited without verifying the CI failure, sorry about that.
  • b8068: ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel (#19132)
    • This PR introduces support for SVE (Scalable Vector Extensions) kernels for the q4_K_q8_K gemm using i8mm and vector instructions. ARM Neon support for this kernel added in PR #16739
    • Verifying Feature
    • ----------------------------------------------------------------------------
  • b8070: models : deduplicate delta-net graphs for Qwen family (#19597)
    • cont #19375
    • Add llm_build_delta_net_base for common delta net builds. Currently used only by qwen3next
    • Rename llm_graph_context_mamba -> llm_build_mamba_base
  • b8071: Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (#19591)
  • b8073: Add support for Tiny Aya Models (#19611)
    • This PR adds native support for the CohereLabs/tiny-aya family of models in llama.cpp. These models use a distinct BPE pre-tokenizer (tiny_aya) with a custom digit-grouping regex.
    • Tagging @ngxson for visibility.
  • b8076: feat: add proper batching to perplexity (#19661)
    • This PR updates llama-perplexity to allow for batching similarly to how llama-imatrix works. The idea being that you can increase --batch-size / --ubatch-size to process multiple contexts chunks in a batch. This has limited application in VRAM-rich environments (eg, if you're running the entire model in VRAM) but it makes a huge difference when using models in a mixed CPU/GPU setup as it saves n_seq trips from the CPU RAM to GPU VRAM per batch.
    • I've double-checked the before and after to make sure the resulting PPL and KLD look correct still.
  • b8077: convert_hf_to_gguf: add JoyAI-LLM-Flash tokenizer hash mapping to deepseek-v3 (#19651)
    • adding hash for jdopensource/JoyAI-LLM-Flash mapping to existing deepseek-v3
    • DeepseekV3ForCausalLM architecture already supported
    • moved GLM-4.7-Flash entry together with the other glm entries

🚀 Performance Improvements

  • b8053: models : optimizing qwen3next graph (#19375)
    • Rewording the ggml compute graph to avoid too many unnecessary copies.
    • M2 Ultra:
    • | Model | Test | t/s b7946 | t/s gg/qwen3-next-opt | Speedup |
  • b8058: ggml-cpu: optimize ggml_vec_dot_bf16 for s390x (#19399)
    • Similar to #18837, this pull request integrates the SIMD instruction set for BF16 on the s390x platform. We notice a 154.86% performance improvement for Prompt Processing. No performance difference was noticed for Token Generation.
    • | model | size | params | backend | threads | mmap | test | t/s |
    • | ------------------------------ | ---------: | ---------: | ---------- | ------: | ---: | --------------: | -------------------: |
  • b8064: cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (#19624)
    • While looking over quantizations I believe I found a few optimizations for iq2xxs/iq2xs/iq3xxs. With these changes, I get a 5-10% increase in flops in test-backend-ops for small n, and a few extra flops otherwise:
    • load all 8 int8 for a grid position in one load
    • calculate signs via popcnt instead of fetching from ksigns table
  • b8086: opencl: optimize mean and sum_row kernels (#19614)
    • This PR optimizes the mean op and sum_rows op for the OpenCL backend.
  • b8087: opencl: refactor expm1 and softplus (#19404)
    • This PR refactors the EXPM1 and Softplus OpenCL operators to improve code clarity and reduce duplication.

🐛 Bug Fixes

  • b8056: cmake: fix KleidiAI install target failure with EXCLUDE_FROM_ALL (#19581)
    • Fix for the bug #19501 by adding EXCLUDE_FROM_ALL to the FetchContent_Declare call for KleidiAI. This properly excludes the KleidiAI library from both the all and install targets, preventing CMake install failures when building with GGML_CPU_KLEIDIAI=ON. The KleidiAI source files are still compiled directly into libggml-cpu.so, so functionality is preserved.
  • b8060: context : fix output reorder with backend sampling (#19638)
    • fix #19629
    • Some of the sampling arrays could remain in invalid state after a sequence of enabling/disabling samplers.
  • b8069: graph : fix KQ mask, lora, cvec reuse checks (#19644)
    • cont #14482
    • Graph reuse was never triggered for parallel decoding with non-unified KV cache due to incorrect check of the KQ mask shape.
    • Also fix the checks for reusing lora and control vectors.
  • b8071: Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (#19461)
    • There is an upstream problem [1] with AMD's LLVM 22 fork and rocWMMA 2.2.0 causing compilation issues on devices without native fp16 support (CDNA devices).
    • The specialized types aren't resolved properly:
    • </code></pre>
      </li>
      </ul>
      </li>
      <li><strong>b8083</strong>: ggml: ggml-cpu: force-no-lto-for-cpu-feats (<a href="https://github.com/ggml-org/llama.cpp/pull/19609">#19609</a>)
      <ul>
      <li>When LTO enabled in build environments it forces all builds to have LTO in place. But feature detection logic is fragile, and causing Illegal instruction errors with lto. This disables LTO for the feature detection code to prevent cross-module optimization from inlining architecture-specific instructions into the score function. Without this, LTO can cause SIGILL when loading backends on older CPUs (e.g., loading power10 backend on power9 crashes before feature check runs).</li>
      <li>Please also see <a href="https://salsa.debian.org/deeplearning-team/ggml/-/merge_requests/6">https://salsa.debian.org/deeplearning-team/ggml/-/merge_requests/6</a> for more information about the issue we saw on ppc64el builds with LTO enabled in ubuntu.</li>
      <li><em>Make sure to read the <a href="https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md">contributing guidelines</a> before submitting a PR</em></li>
      </ul>
      </li>
      </ul>
      <h3>Additional Changes</h3>
      <p>8 minor improvements: 1 documentation, 3 examples, 4 maintenance.</p>
      <h3>Full Commit Range</h3>
      <ul>
      <li>b8053 to b8087 (28 commits)</li>
      <li>Upstream releases: <a href="https://github.com/ggml-org/llama.cpp/compare/b8053...b8087">https://github.com/ggml-org/llama.cpp/compare/b8053...b8087</a></li>
      </ul>
      <hr />
      <h2>2026-02-14: Update to llama.cpp b8040</h2>
      <h3>Summary</h3>
      <p>Updated llama.cpp from b8027 to b8040, incorporating 11 upstream commits with breaking changes, new features, and performance improvements.</p>
      <h3>Notable Changes</h3>
      <h4>⚠️ Breaking Changes</h4>
      <ul>
      <li><strong>b8027</strong>: llama : remove deprecated codecvt (<a href="https://github.com/ggml-org/llama.cpp/pull/19565">#19565</a>)
      <ul>
      <li>Using the same conversion function ensures a consistent matching between the regex pattern and the text</li>
      </ul>
      </li>
      <li><strong>b8037</strong>: common : update download code (<a href="https://github.com/ggml-org/llama.cpp/pull/19573">#19573</a>)
      <ul>
      <li>This PR removes the legacy migration code for etag and forces a download if no etag file is found.</li>
      </ul>
      </li>
      </ul>
      <h4>🆕 New Features</h4>
      <ul>
      <li><strong>b8028</strong>: Kimi Linear fix conv state update (<a href="https://github.com/ggml-org/llama.cpp/pull/19531">#19531</a>)
      <ul>
      <li><em>Make sure to read the <a href="https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md">contributing guidelines</a> before submitting a PR</em></li>
      <li>The current implementation has incorrect conv state update such that it has state corruption when running parallel in llama-server. This is fixed in this PR.</li>
      <li>
      <pre><code>
      
  • b8030: CUDA: Do not mutate cgraph for fused ADDs (#19566)
      1. We should try to minimize in-place changes to the incoming ggml_cgraph where possible (those should happen in a backends' graph_optimize function)
      1. Modifying in-place leads to an additional, unnecessary graph capture step as we store the properties before modifying the graph in-place in the cuda-backend: We hit ggml_cuda_graph_node_set_properties via ggml_cuda_graph_update_required before entering ggml_cuda_graph_evaluate_and_capture.
    • Isolated from #19521
  • b8036: model: support GLM MoE DSA arch (NOTE: indexer is not yet supported) (#19460)
    • Ref upstream vllm PR: https://github.com/vllm-project/vllm/pull/34124
    • [!IMPORTANT]

    • This PR allows converting safetensors to GGUF while keeping the indexer tensors (for deepseek sparse attention), but they are left unused by the cpp code. The quality will be suboptimal

🚀 Performance Improvements

  • b8038: vulkan: restore -inf check in FA shaders (#19582)
    • For #19523.
    • I verified the performance is restored with llama-batched-bench.
  • b8040: hexagon: further optimizations and refactoring for flash attention (#19583)
    • The PR includes some more refactoring and optimizations for flash attention op/kernel:
    • Local fa_context that stores all precomputed values
    • More HVX usage (hvx_vec_expf, ...)

🐛 Bug Fixes

  • b8034: fix vulkan ggml_acc only works in 3d but not 4d (#19426)
  • b8035: ggml-cpu: arm64: Fix wrong memcpy length for q4_K block_interleave == 4 (#19575)
    • https://github.com/ggml-org/llama.cpp/issues/19561 reports issues with the stack for Q4_K.
    • I can't reproduce the issue locally, but the make_block_q4_Kx8 function would write past the buffer size 4 extra bytes, which could be the issue.
    • @taronaeo, since you found the problem, are you able to check if this patch fixes it?

Additional Changes

2 minor improvements: 1 examples, 1 maintenance.

  • b8033: cli : support --verbose-prompt (#19576)
    • Useful when debugging templates.
  • b8032: CUDA: loop over ne2*ne3 in case it overflows (#19538)

Full Commit Range


2026-02-13: Update to llama.cpp b8018

Summary

Updated llama.cpp from b7958 to b8018, incorporating 44 upstream commits with breaking changes and new features.

Notable Changes

⚠️ Breaking Changes

  • b8004: common : remove unused token util functions (#19506)
    • This commit removes two unused functions common_lcp and common_lcs. The last usage of these functions was removed in Commit 33eff4024084d1f0c8441b79f7208a52fad79858 ("server : vision support via libmtmd") and are no longer used anywhere in the codebase.
  • b8007: common : replace deprecated codecvt using parse_utf8_codepoint (#19517)

🆕 New Features

  • b7964: Support Step3.5-Flash (#19283)
  • b7966: metal : consolidate bin kernels (#19390)
    • Refactor and consolidate the implementation of the binary Metal kernels.
    • | Model | Test | t/s master | t/s gg/metal-bin-opt | Speedup |
    • |:-------------------------|:-------|-------------:|-----------------------:|----------:|
  • b7972: CUDA: Fix non-contig rope (#19338)
  • b7973: [Model] Qwen3.5 dense and MoE support (no vision) (#19435)
    • I've gotten a bit tired of Llama.cpp missing all the zero-day releases, so this time I decided to make (or, more precisely, instructed Opus 4.6 to make, based on reference implementations and my guidelines for model adaptation) a conversion based on the Transformers PR ( https://github.com/huggingface/transformers/pull/43830/changes ). It's mostly based on Qwen3Next, but it's rebased on the common-delta-net PR ( #19125 ).
    • Here are the mock models I generated to test it: https://huggingface.co/ilintar/qwen35_testing/tree/main
    • Here are the conversion results from causal-verify-logits:
  • b7974: cmake: add variable to skip installing tests (#19370)
    • When packaging downstream, there's usually little point in installing test. The default behaviour remains the same.
  • b7976: [Model] Qwen3.5 dense and MoE support (no vision) (#19435)
    • I've gotten a bit tired of Llama.cpp missing all the zero-day releases, so this time I decided to make (or, more precisely, instructed Opus 4.6 to make, based on reference implementations and my guidelines for model adaptation) a conversion based on the Transformers PR ( https://github.com/huggingface/transformers/pull/43830/changes ). It's mostly based on Qwen3Next, but it's rebased on the common-delta-net PR ( #19125 ).
    • Here are the mock models I generated to test it: https://huggingface.co/ilintar/qwen35_testing/tree/main
    • Here are the conversion results from causal-verify-logits:
  • b7976: revert : "[Model] Qwen3.5 dense and MoE support (no vision) (#19435)" (#19453)
    • cont #19435
    • Taking a step back to implement support for Qwen3.5 properly.
  • b7981: chat: fix case where template accepts type content only (#19419)
  • b7982: cuda : extend GGML_OP_PAD to work with non-cont src0 (#19429)
    • Extend CUDA support
    • Remove redundant assert in CPU implementation
    • Add permuted PAD tests
  • b7983: CANN: Support MUL_MAT_ID in ACL graph (#19228)
    • Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
    • multiplication for Mixture of Experts (MoE) architectures on CANN backend.
    • Key features:
  • b7988: ggml-cpu: arm64: q6_K repack gemm and gemv (and generic) implementations (dotprod) (#19360)
  • b7991: [WebGPU] Plug memory leaks and free resources on shutdown (#19315)
    • This diff destroys wgpu::Buffers and buffer pools on shutdown. It also fixes memory leaks on the heap, where we allocate backend, backend_ctx, buffer_ctx, and decisions on the heap but never delete them. These are either explicitly deleted or changed to be smart pointers.
    • We implement destructors for our buffer pool structs, webgpu_context struct and webgpu_global_context struct. Since webgpu_global_context is a refcounted smart pointer, it will destruct automatically when all thread contexts have been destroyed.
    • Screenshot 2026-02-03 at 3 56 11 PM
  • b7992: CUDA: Update CCCL-tag for 3.2 to final release from RC (#19486)
  • b7994: metal : consolidate unary ops (#19490)
    • cont #19390
    • Common implementation of the unary kernels
    • Extend support for non-cont src0
  • b7995: ggml : extend bin bcast for permuted src1 (#19484)
    • Remove CPU asserts preventing src1 from being permuted
    • Update CUDA kernels to support permuted src1
    • Add tests to exercise src1 permutation
  • b7998: hexagon: Add ARGSORT, DIV, SQR, SQRT, SUM_ROWS, GEGLU (#19406)
    • Catching up on the Op coverage for the Hexagon backend.
    • This PR improves Op coverage for Gemma-3N, LFM2 and other models.
    • All new Ops pass test-backend-ops (mostly in f32).
  • b8001: metal : extend l2_norm support for non-cont src0 (#19502)
    • Support non-cont src0
    • Support ne00 non-multiple of 4
  • b8005: ggml : unary ops support non-cont src0 + metal F16 unary ops (#19511)
    • cont #19490
  • b8006: opencl: add general Q6_K mm and Q4_K mv (#19347)
    • Although still slow, this should make Q4_K_M a bit more usable. Q4_K mv is not flattened yet. More specialized Q6_K and Q4_K mm and mv using transposed layouts will be added in follow up PRs.
  • b8008: hexagon: further optimization and tuning of matmul and dot kernels (#19407)
    • This PR adds support for computing 2x2 (2 rows x 2 cols) dot products in parallel.
    • Mostly helps with the Prompt processing that shows 10+ T/S gains for most models.
    • Here are some numbers with Qwen3.
  • b8012: metal : update sum_rows kernel to support float4 (#19524)

🐛 Bug Fixes

  • b7958: MSVC regex fix (#19340)
    • Fix MSVC regex error:
    • Regex error: regex_error(error_stack): There was insufficient memory to determine whether the regular expression could match the specified character sequence.
  • b7965: metal : fix event synchronization in cpy_tensor_async (#19402)
  • b7987: ggml: use noexcept overload for is_regular_file in backend registration (#19452)
    • using noexcept std::filesystem::directory_entry::is_regular_file overload prevents abnormal termination upon throwing an error (as caused by symlinks to non-existant folders on linux)
    • fixes issue #18560
    • Searched for existing PRs for this issue
  • b7989: test: fix IMROPE perf test case (#19465)
  • b7997: fix: correct typos 'occured' and 'occurences' (#19414)
    • Fixes minor spelling typos in comments:
    • occurred (1 instance in llama.h)
    • occurrences (3 instances in ngram-map.h and ngram-map.cpp)
  • b7999: common : improve download error reporting (#19491)
    • While debugging the new cpp-httplib, the current errors were unusable...
    • Here is a small patch to make life easier for the next person dealing with HTTP issues :)
  • b8011: Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (#19461)
    • There is an upstream problem [1] with AMD's LLVM 22 fork and rocWMMA 2.2.0 causing compilation issues on devices without native fp16 support (CDNA devices).
    • The specialized types aren't resolved properly:
    • </code></pre>
      </li>
      </ul>
      </li>
      <li><strong>b8018</strong>: vendor : update cpp-httplib (<a href="https://github.com/ggml-org/llama.cpp/pull/19537">#19537</a>)
      <ul>
      <li>The 0.32 version had important bug fixes, but it wasnt working for us. We need the latest patches.</li>
      </ul>
      </li>
      </ul>
      <h3>Additional Changes</h3>
      <p>13 minor improvements: 3 documentation, 7 examples, 3 maintenance.</p>
      <h3>Full Commit Range</h3>
      <ul>
      <li>b7958 to b8018 (44 commits)</li>
      <li>Upstream releases: <a href="https://github.com/ggml-org/llama.cpp/compare/b7958...b8018">https://github.com/ggml-org/llama.cpp/compare/b7958...b8018</a></li>
      </ul>
      <hr />
      <h2>2026-02-06: Update to llama.cpp b7955</h2>
      <h3>Summary</h3>
      <p>Updated llama.cpp from b7926 to b7955, incorporating 24 upstream commits with breaking changes, new features, and performance improvements.</p>
      <h3>Notable Changes</h3>
      <h4>⚠️ Breaking Changes</h4>
      <ul>
      <li><strong>b7931</strong>: ggml-virtgpu: make the code thread safe (<a href="https://github.com/ggml-org/llama.cpp/pull/19204">#19204</a>)
      <ul>
      <li>This PR improves the code of the ggml-virtgpu backend to make it thread safe, by using mutex for accessing the host<>guest shared memory buffers, and by pre-caching, during the initialization, the constant values queried from the backend.</li>
      <li>The unused <code>buffer_type_is_host</code> method is also deprecated.</li>
      </ul>
      </li>
      <li><strong>b7933</strong>: spec : fix the check-rate logic of ngram-simple (<a href="https://github.com/ggml-org/llama.cpp/pull/19261">#19261</a>)
      <ul>
      <li>fix #19231</li>
      <li>For the <code>spec-simple</code> method, we don't need to keep track of the last length to rate-limit the generations. We can simply use an incremental counter. This makes the speculator work with "Regenerate" of last message or branching the conversation from previous messages.</li>
      <li>Also, removed <code>struct common_ngram_simple_state</code> - seemed a bit redundant.</li>
      </ul>
      </li>
      </ul>
      <h4>🆕 New Features</h4>
      <ul>
      <li><strong>b7928</strong>: ci : add sanitizer runs for server (<a href="https://github.com/ggml-org/llama.cpp/pull/19291">#19291</a>)
      <ul>
      <li>Reenable the server sanitizer builds + runs. The thread sanitizer is quite slow, so remains disabled for now.</li>
      <li><a href="https://github.com/ggerganov/tmp2/actions/runs/21629674042">https://github.com/ggerganov/tmp2/actions/runs/21629674042</a></li>
      </ul>
      </li>
      <li><strong>b7929</strong>: metal : add solve_tri (<a href="https://github.com/ggml-org/llama.cpp/pull/19302">#19302</a>)
      <ul>
      <li>Add <code>GGML_OP_SOLVE_TRI</code> implementation for Metal.</li>
      <li>| Model                  | Test   |   t/s master |   t/s gg/metal-solve-tri |   Speedup |</li>
      <li>|:-----------------------|:-------|-------------:|-------------------------:|----------:|</li>
      </ul>
      </li>
      <li><strong>b7935</strong>: tests : add non-cont, inplace rope tests (<a href="https://github.com/ggml-org/llama.cpp/pull/19296">#19296</a>)
      <ul>
      <li>ref <a href="https://github.com/ggml-org/llama.cpp/pull/18986#issuecomment-3841942982">https://github.com/ggml-org/llama.cpp/pull/18986#issuecomment-3841942982</a></li>
      <li>ref <a href="https://github.com/ggml-org/llama.cpp/issues/19128#issuecomment-3807441909">https://github.com/ggml-org/llama.cpp/issues/19128#issuecomment-3807441909</a></li>
      <li>ref <a href="https://github.com/ggml-org/llama.cpp/issues/19292">https://github.com/ggml-org/llama.cpp/issues/19292</a></li>
      </ul>
      </li>
      <li><strong>b7941</strong>: vendor : add missing llama_add_compile_flags (<a href="https://github.com/ggml-org/llama.cpp/pull/19322">#19322</a>)
      <ul>
      <li><del>Hopefully fixes CI</del>Ensure <code>httplib</code> and <code>boringssl</code>/<code>libressl</code> are built with sanitizer options, see <a href="https://github.com/ggml-org/llama.cpp/pull/19291#discussion_r2761613566">https://github.com/ggml-org/llama.cpp/pull/19291#discussion_r2761613566</a></li>
      </ul>
      </li>
      <li><strong>b7946</strong>: metal : add diag (<a href="https://github.com/ggml-org/llama.cpp/pull/19330">#19330</a>)
      <ul>
      <li>Add implementation for GGML_OP_DIAG for the Metal backend</li>
      </ul>
      </li>
      </ul>
      <h4>🚀 Performance Improvements</h4>
      <ul>
      <li><strong>b7930</strong>: ggml-cpu: use LUT for converting e8->f32 scales on x86 (<a href="https://github.com/ggml-org/llama.cpp/pull/19288">#19288</a>)
      <ul>
      <li><code>perf</code> showed the e8m0->f32 function as a bottleneck. Use a LUT instead. Tested only on x86</li>
      <li>| Model                 | Test   |   t/s topk-cuda-refactor |   t/s mxfp4-cpu-scale |   Speedup |</li>
      <li>|:----------------------|:-------|-------------------------:|----------------------:|----------:|</li>
      </ul>
      </li>
      <li><strong>b7951</strong>: metal : adaptive CPU/GPU interleave based on number of nodes (<a href="https://github.com/ggml-org/llama.cpp/pull/19369">#19369</a>)
      <ul>
      <li>Put a bit more work on the main thread when encoding the graph. This helps to interleave better the CPU/GPU work, especially for larger graphs.</li>
      <li>| Model                    | Test   |   t/s master |   t/s gg/metal-adaptive-cpu-interleave |   Speedup |</li>
      <li>|:-------------------------|:-------|-------------:|---------------------------------------:|----------:|</li>
      </ul>
      </li>
      <li><strong>b7954</strong>: metal : skip loading all-zero mask (<a href="https://github.com/ggml-org/llama.cpp/pull/19337">#19337</a>)
      <ul>
      <li>Similar optimization as in #19281 to skip loading the all-zero mask blocks.</li>
      <li>| Model                 | Test    |   t/s master |   t/s gg/metal-fa-mask-zero-opt |   Speedup |</li>
      <li>|:----------------------|:--------|-------------:|--------------------------------:|----------:|</li>
      </ul>
      </li>
      </ul>
      <h4>🐛 Bug Fixes</h4>
      <ul>
      <li><strong>b7926</strong>: vulkan: disable coopmat1 flash attention on Nvidia Turing (<a href="https://github.com/ggml-org/llama.cpp/pull/19290">#19290</a>)
      <ul>
      <li>See <a href="https://github.com/ggml-org/llama.cpp/pull/19075#issuecomment-3820716090">https://github.com/ggml-org/llama.cpp/pull/19075#issuecomment-3820716090</a></li>
      </ul>
      </li>
      <li><strong>b7927</strong>: sampling : delegate input allocation to the scheduler (<a href="https://github.com/ggml-org/llama.cpp/pull/19266">#19266</a>)
      <ul>
      <li>fix #18622</li>
      <li>alt #18636</li>
      <li>Merge the sampler inputs into the main graph. This way the backend scheduler is responsible for allocating the memory which makes backend sampling compatible with pipeline parallelism</li>
      </ul>
      </li>
      <li><strong>b7936</strong>: model: (qwen3next) correct vectorized key_gdiff calculation (<a href="https://github.com/ggml-org/llama.cpp/pull/19324">#19324</a>)
      <ul>
      <li>Testing with the provided prompt from <a href="https://github.com/ggml-org/llama.cpp/issues/19305">https://github.com/ggml-org/llama.cpp/issues/19305</a></li>
      <li>
      <img width="837" height="437" alt="image" src="https://github.com/user-attachments/assets/54f19beb-a9d0-4f10-bc33-747057f36fe7" />
      </li>
      </ul>
      </li>
      <li><strong>b7938</strong>: debug: make common_debug_print_tensor readable (<a href="https://github.com/ggml-org/llama.cpp/pull/19331">#19331</a>)
      <ul>
      <li>Now using 4-space indentation</li>
      <li>The log is output to stdout, so that I can do <code>llama-eval-callback ... > debug.log</code></li>
      <li>
      <pre><code>
      
  • b7940: vendor: update cpp-httplib version (#19313)
    • ref: #19017
    • Sync the cpp-httplib library to fix #19017.
  • b7942: Fix missing includes in metal build (#19348)
  • b7943: vulkan: fix non-contig rope (#19299)
    • For #19296.
  • b7945: vulkan: fix GPU deduplication logic. (#19222)
    • As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the (same uuid, same driver) logic is problematic for windows+intel igpu.
    • Let's just avoid filtering for MoltenVK which is apple-specific, and keep the logic the same as before 88d23ad5 - just dedup based on UUID.
    • Verified that MacOS + 4xVega still reports 4 GPUs with this version.
  • b7952: cuda : cuda graphs now compare all node params (#19383)

Additional Changes

5 minor improvements: 1 examples, 4 maintenance.

  • b7932: completion : simplify batch (embd) processing (#19286)
    • This commit simplifies the processing of embd by removing the for loop that currently exists which uses params.n_batch as its increment. This commit also removes the clamping of n_eval as the size of embd is always at most the size of params.n_batch.
    • The motivation is to clarify the code as it is currently a little confusing when looking at this for loop in isolation and thinking that it can process multiple batches.
  • b7944: vulkan: Set k_load_shmem to false when K is too large (#19301)
  • b7947: vendor : update BoringSSL to 0.20260204.0 (#19333)
  • b7950: vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (#19281)
    • Write out a 2-bit code per block and avoid loading the mask when it matches these two common cases.
    • Apply this optimization when the mask is relatively large (i.e. prompt processing).
  • b7955: vulkan: make FA mask/softcap enables spec constants (#19309)
    • This is stacked on #19281. (merged)
    • This allows the compiler to do a bit better at overlapping loads and math (e.g. loading V can start while computing Q*K^t is still happening). Worth a couple percent for coopmat2, less for coopmat1/scalar.
    • 
      

Full Commit Range


2026-02-03: Update to llama.cpp b7921

Summary

Updated llama.cpp from b7907 to b7921, incorporating 11 upstream commits with new features.

Notable Changes

🆕 New Features

  • b7907: ggml-backend: fix async set/get fallback sync (#19179)
    • While working on an implementation for backend-agnostic tensor parallelism I found what I believe to be a bug in the ggml backend code. For a minimal implementation I did at first not implement set_tensor_async and get_tensor_async assuming that I could just rely on the synchronous fallback and implement those later. However, set_tensor_async and get_tensor_async do not call ggml_backend_synchronize for their fallback so I got incorrect results. This PR adds the corresponding calls.
  • b7909: metal : support virtual devices (#18919)
    • Support virtual Metal devices. Allows simulating multi-GPU environments on Mac using the new GGML_METAL_DEVICES environment variable.
    • </code></pre>
      </li>
      <li>GGML_METAL_DEVICES=4 ./bin/llama-completion -m [model.gguf]</li>
      </ul>
      </li>
      <li><strong>b7919</strong>: support infill for Falcon-H1-Tiny-Coder (<a href="https://github.com/ggml-org/llama.cpp/pull/19249">#19249</a>)
      <ul>
      <li>Added FIM tokens used in Falcon-H1-Tiny-Coder (see <a href="https://tiiuae-tiny-h1-blogpost.hf.space/#fim-format">https://tiiuae-tiny-h1-blogpost.hf.space/#fim-format</a>, <a href="https://huggingface.co/tiiuae/Falcon-H1-Tiny-Coder-90M/blob/main/tokenizer_config.json#L1843">https://huggingface.co/tiiuae/Falcon-H1-Tiny-Coder-90M/blob/main/tokenizer_config.json#L1843</a>) to make the llama-server <code>POST /infill</code> handle work.</li>
      </ul>
      </li>
      <li><strong>b7921</strong>: ggml: added cleanups in ggml_quantize_free (<a href="https://github.com/ggml-org/llama.cpp/pull/19278">#19278</a>)
      <ul>
      <li>Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.</li>
      </ul>
      </li>
      </ul>
      <h4>🐛 Bug Fixes</h4>
      <ul>
      <li><strong>b7917</strong>: opencl: refactor some ops, concat, repeat, tanh and scale (<a href="https://github.com/ggml-org/llama.cpp/pull/19226">#19226</a>)
      <ul>
      <li>Gemma-3n-E2B and Gemma-3n-E4B have been producing weird (not really gibberish but apparently not correct) output. Ended up refactoring these ops and the issue is now fixed. In addition, this refactor also improves perf a bit.</li>
      <li>On X Elite,</li>
      <li><code>gemma-3n-E2B-it-Q8_0</code>,</li>
      </ul>
      </li>
      </ul>
      <h3>Additional Changes</h3>
      <p>6 minor improvements: 4 documentation, 1 examples, 1 maintenance.</p>
      <h3>Full Commit Range</h3>
      <ul>
      <li>b7907 to b7921 (11 commits)</li>
      <li>Upstream releases: <a href="https://github.com/ggml-org/llama.cpp/compare/b7907...b7921">https://github.com/ggml-org/llama.cpp/compare/b7907...b7921</a></li>
      </ul>
      <hr />
      <h2>2026-02-02: Update to llama.cpp b7907</h2>
      <h3>Summary</h3>
      <p>Updated llama.cpp from b7885 to b7907, incorporating 14 upstream commits with breaking changes and new features.</p>
      <h3>Notable Changes</h3>
      <h4>⚠️ Breaking Changes</h4>
      <ul>
      <li><strong>b7903</strong>: Remove pipeline cache mutexes (<a href="https://github.com/ggml-org/llama.cpp/pull/19195">#19195</a>)
      <ul>
      <li>Now that <code>webgpu_context</code> is per-thread, we can remove mutexes from pipeline caches. We cannot remove mutexes from <code>webgpu_buf_pool</code> since they are allocated and freed in callback threads, and we cannot remove the mutex from the memset buffer pool since it is shared by all ggml buffers.</li>
      </ul>
      </li>
      </ul>
      <h4>🆕 New Features</h4>
      <ul>
      <li><strong>b7885</strong>: tests : add GQA=20 FA test (<a href="https://github.com/ggml-org/llama.cpp/pull/19095">#19095</a>)
      <ul>
      <li>Might be a good idea to have a test that exercises GQA=20 in order to catch any potential regressions.</li>
      </ul>
      </li>
      <li><strong>b7895</strong>: lookahead : add example for lookahead decoding (<a href="https://github.com/ggml-org/llama.cpp/pull/4207">#4207</a>)
      <ul>
      <li>ref #4157</li>
      <li>Think this should implement the approach from: <a href="https://lmsys.org/blog/2023-11-21-lookahead-decoding/">https://lmsys.org/blog/2023-11-21-lookahead-decoding/</a></li>
      <li>The approach requires large batches to be decoded, which in turn requires a lot of FLOPS even for single stream</li>
      </ul>
      </li>
      <li><strong>b7895</strong>: Prompt lookup decoding (<a href="https://github.com/ggml-org/llama.cpp/pull/4484">#4484</a>)
      <ul>
      <li>ref #4226</li>
      <li>This example implements the "Prompt Lookup Decoding" technique:</li>
      <li><a href="https://github.com/apoorvumang/prompt-lookup-decoding">https://github.com/apoorvumang/prompt-lookup-decoding</a></li>
      </ul>
      </li>
      <li><strong>b7898</strong>: ggml-hexagon: flash-attention and reduce-sum optimizations (<a href="https://github.com/ggml-org/llama.cpp/pull/19141">#19141</a>)
      <ul>
      <li>Further to the discussion in <a href="vscode-file://vscode-app/f:/Download/OneDrive/sync/tools/editor/VSCode/resources/app/out/vs/code/electron-browser/workbench/workbench.html">PR #19025</a>, this implements the dual row dot product for flash attention.</li>
      <li>Added <code>hvx_vec_reduce_sum_qf32x2</code>, a helper function for efficiently reducing and accumulating two HVX vectors of qf32 values, and refactored several places in the codebase to use this function for dual-accumulation scenarios. <a href="diffhunk://#diff-a61b8b4ec9b687ceb6adecb4f2de734f398493514475aa35a2ed1697d58e8a78R47-R57">[1]</a> <a href="diffhunk://#diff-9469cc7ef405748e1379a215fd377726746ae6087c02d975042955268ea40870L468-R469">[2]</a> <a href="diffhunk://#diff-9469cc7ef405748e1379a215fd377726746ae6087c02d975042955268ea40870L641-R639">[3]</a> <a href="diffhunk://#diff-9469cc7ef405748e1379a215fd377726746ae6087c02d975042955268ea40870L883-R878">[4]</a> <a href="diffhunk://#diff-9469cc7ef405748e1379a215fd377726746ae6087c02d975042955268ea40870L960-R952">[5]</a></li>
      <li>Introduced new "rx2" (dual accumulation) versions of dot product functions for both f32-f16 and f16-f16 cases (<code>hvx_dot_f32_f16_aa_rx2</code>, <code>hvx_dot_f16_f16_aa_rx2</code>), improving performance by processing two accumulations in parallel. <a href="diffhunk://#diff-703a5dfdf5d9711789e72c854d70db2559000749823e0cb8fa9defc4b276e7b8R76-R139">[1]</a> <a href="diffhunk://#diff-703a5dfdf5d9711789e72c854d70db2559000749823e0cb8fa9defc4b276e7b8R180-R233">[2]</a></li>
      </ul>
      </li>
      <li><strong>b7907</strong>: ggml-backend: fix async set/get fallback sync (<a href="https://github.com/ggml-org/llama.cpp/pull/19179">#19179</a>)
      <ul>
      <li>While working on an implementation for backend-agnostic tensor parallelism I found what I believe to be a bug in the ggml backend code. For a minimal implementation I did at first not implement <code>set_tensor_async</code> and <code>get_tensor_async</code> assuming that I could just rely on the synchronous fallback and implement those later. However, <code>set_tensor_async</code> and <code>get_tensor_async</code> do not call <code>ggml_backend_synchronize</code> for their fallback so I got incorrect results. This PR adds the corresponding calls.</li>
      </ul>
      </li>
      </ul>
      <h4>🐛 Bug Fixes</h4>
      <ul>
      <li><strong>b7895</strong>: llama : adjust default context size + print warnings (<a href="https://github.com/ggml-org/llama.cpp/pull/10136">#10136</a>)
      <ul>
      <li>fix #8817, <a href="https://github.com/ggerganov/llama.cpp/issues/9563#issuecomment-2452727620">https://github.com/ggerganov/llama.cpp/issues/9563#issuecomment-2452727620</a></li>
      <li>By default, the examples will use a context size of 4096, instead of the training context of the model. In a lot of cases, the default training context can be very big - 32k to 128k tokens, which causes enormous KV cache allocation and failures for regular hardware.</li>
      <li>Also, add warning logs when the specified context size per sequence does not match the training context.</li>
      </ul>
      </li>
      </ul>
      <h3>Additional Changes</h3>
      <p>7 minor improvements: 3 documentation, 4 examples.</p>
      <h3>Full Commit Range</h3>
      <ul>
      <li>b7885 to b7907 (14 commits)</li>
      <li>Upstream releases: <a href="https://github.com/ggml-org/llama.cpp/compare/b7885...b7907">https://github.com/ggml-org/llama.cpp/compare/b7885...b7907</a></li>
      </ul>
      <hr />
      <h2>2026-01-30: Update to llama.cpp b7885</h2>
      <h3>Summary</h3>
      <p>Updated llama.cpp from b7871 to b7885, incorporating 9 upstream commits with breaking changes and new features.</p>
      <h3>Notable Changes</h3>
      <h4>⚠️ Breaking Changes</h4>
      <ul>
      <li><strong>b7872</strong>: jinja : do not pass empty tools and add some none filters (<a href="https://github.com/ggml-org/llama.cpp/pull/19176">#19176</a>)
      <ul>
      <li>Passing empty or null <code>tools</code> breaks many templates so avoid that.</li>
      <li>Added several filters to <code>none</code> that are accepted by <code>jinja2</code>, fixes some templates that will try to use them (like <code>Functionary</code>).</li>
      <li>Fixes #19155</li>
      </ul>
      </li>
      <li><strong>b7883</strong>: memory : remove unused tmp_buf (<a href="https://github.com/ggml-org/llama.cpp/pull/19199">#19199</a>)
      <ul>
      <li>This commit removes the unused tmp_buf variable from llama-kv-cache.cpp and llama-memory-recurrent.cpp.</li>
      <li>The tmp_buf variable was declared but never used but since it has a non-trivial constructor/desctuctor we don't get an unused variable warning about it.</li>
      </ul>
      </li>
      </ul>
      <h4>🆕 New Features</h4>
      <ul>
      <li><strong>b7871</strong>: HIP: add mmf for CDNA (<a href="https://github.com/ggml-org/llama.cpp/pull/18896">#18896</a>)
      <ul>
      <li>Add mmf for CDNA, CDNA3 is passed, it will be very helpful if anyone can test it on CDNA2 and CDNA1, thank you.</li>
      <li><input type="checkbox" checked="" disabled="" /> Refactor mmf to make rows_per_block as input parameter.</li>
      <li><input type="checkbox" checked="" disabled="" /> Pass MUL_MAT and MUL_MAT_ID.</li>
      </ul>
      </li>
      <li><strong>b7881</strong>: add tensor type checking as part of cuda graph properties (<a href="https://github.com/ggml-org/llama.cpp/pull/19186">#19186</a>)
      <ul>
      <li>Motivated by <a href="https://github.com/ggml-org/llama.cpp/pull/15805#issuecomment-3818986820">https://github.com/ggml-org/llama.cpp/pull/15805#issuecomment-3818986820</a></li>
      </ul>
      </li>
      <li><strong>b7885</strong>: tests : add GQA=20 FA test (<a href="https://github.com/ggml-org/llama.cpp/pull/19095">#19095</a>)
      <ul>
      <li>Might be a good idea to have a test that exercises GQA=20 in order to catch any potential regressions.</li>
      </ul>
      </li>
      </ul>
      <h4>🐛 Bug Fixes</h4>
      <ul>
      <li><strong>b7875</strong>: cuda : fix nkvo, offload and cuda graph node properties matching (<a href="https://github.com/ggml-org/llama.cpp/pull/19165">#19165</a>)
      <ul>
      <li>fix #19158</li>
      <li>fix #19169</li>
      <li>cont #19105</li>
      </ul>
      </li>
      </ul>
      <h3>Additional Changes</h3>
      <p>3 minor improvements: 3 documentation.</p>
      <ul>
      <li><strong>b7876</strong>: hexagon: enable offloading to Hexagon on Windows on Snapdragon (<a href="https://github.com/ggml-org/llama.cpp/pull/19150">#19150</a>)
      <ul>
      <li>GGML Hexagon backend updates to support Windows on Snapdragon.</li>
      <li>Features:</li>
      <li>Support for building and offloading to NPU on WoS.</li>
      </ul>
      </li>
      <li><strong>b7879</strong>: sycl: implement GGML_OP_TRI (<a href="https://github.com/ggml-org/llama.cpp/pull/19089">#19089</a>)
      <ul>
      <li>Implements GGML_OP_TRI for the SYCL backend (F32).</li>
      <li>The implementation matches CPU semantics for all ggml_tri_type values</li>
      <li>(lower/upper, with and without diagonal).</li>
      </ul>
      </li>
      <li><strong>b7880</strong>: sycl: implement GGML_UNARY_OP_SOFTPLUS (<a href="https://github.com/ggml-org/llama.cpp/pull/19114">#19114</a>)
      <ul>
      <li>Implements GGML_UNARY_OP_SOFTPLUS for the SYCL backend.</li>
      <li>Adds an element-wise softplus kernel integrated through the generic SYCL unary dispatch path.</li>
      <li>Numerical behavior matches the CPU backend implementation.</li>
      </ul>
      </li>
      </ul>
      <h3>Full Commit Range</h3>
      <ul>
      <li>b7871 to b7885 (9 commits)</li>
      <li>Upstream releases: <a href="https://github.com/ggml-org/llama.cpp/compare/b7871...b7885">https://github.com/ggml-org/llama.cpp/compare/b7871...b7885</a></li>
      </ul>
      <hr />
      <h2>2026-01-29: Update to llama.cpp b7871</h2>
      <h3>Summary</h3>
      <p>Updated llama.cpp from b7847 to b7871, incorporating 22 upstream commits with breaking changes, new features, and performance improvements.</p>
      <h3>Notable Changes</h3>
      <h4>⚠️ Breaking Changes</h4>
      <ul>
      <li><strong>b7850</strong>: ggml-zendnn : update ZenDNN git tag to main branch (<a href="https://github.com/ggml-org/llama.cpp/pull/19133">#19133</a>)
      <ul>
      <li>This PR is related to ZenDNN removed their zendnnl branch and moved all the code to main</li>
      <li>Right now our code is still looking for the old zendnnl branch which no longer exists, so builds break.</li>
      <li>This fixes it by pointing to the new main branch instead</li>
      </ul>
      </li>
      <li><strong>b7852</strong>: sampling : remove sampling branching in output_reserve (<a href="https://github.com/ggml-org/llama.cpp/pull/18811">#18811</a>)
      <ul>
      <li>This commit updates output_reserve in llama-context.cpp to always allocate sampling buffers regardless of whether sampling is needed for the current batch.</li>
      <li>The motivation for this is to avoid reallocations and branching based on the sampling requirements of the batch.</li>
      </ul>
      </li>
      <li><strong>b7862</strong>: ggml-sycl: remove unused syclcompat header (<a href="https://github.com/ggml-org/llama.cpp/pull/19140">#19140</a>)
      <ul>
      <li>The <code>syclcompat/math.hpp</code> is not used anymore. The change that introduced it was successfully reverted (<a href="https://github.com/ggml-org/llama.cpp/pull/17826">https://github.com/ggml-org/llama.cpp/pull/17826</a>). This include path will become obsolete and dropped in oneAPI 2026.0 effectively breaking <code>ggml-sycl</code> builds.</li>
      <li><em>Make sure to read the <a href="https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md">contributing guidelines</a> before submitting a PR</em></li>
      </ul>
      </li>
      <li><strong>b7868</strong>: CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) (<a href="https://github.com/ggml-org/llama.cpp/pull/19126">#19126</a>)
      <ul>
      <li>Refactor the topk-moe to enabling various combination of topk-moe. Hopefully this will cover most models. I removed some templates from the code and only kept the bias because it has a extra warp shuffle, the rest of the template code does not provide any significant speedup.</li>
      <li>3090</li>
      <li>| Model                 | Test   |   t/s master |   t/s topk-cuda-refactor |   Speedup |</li>
      </ul>
      </li>
      </ul>
      <h4>🆕 New Features</h4>
      <ul>
      <li><strong>b7849</strong>: jinja : implement mixed type object keys (<a href="https://github.com/ggml-org/llama.cpp/pull/18955">#18955</a>)
      <ul>
      <li>Allow all hashable types as object keys, taking care to replicate special python/jinja behavior between <code>int</code>/<code>float</code>/<code>bool</code>.</li>
      <li>Fixed array/object output with <code>string</code> filter.</li>
      <li>Fixed object <code>tojson</code> output (did not properly escape key string).</li>
      </ul>
      </li>
      <li><strong>b7860</strong>: CUDA: use mul_mat_q kernels by default (<a href="https://github.com/ggml-org/llama.cpp/pull/2683">#2683</a>)
      <ul>
      <li>There seem to have been no further reports of problems with the mul_mat_q kernels so I think it's fine to use them by default. This PR does just that and replaces the <code>-mmq</code>/<code>--mul-mat-q</code> CLI argument with <code>-nommq</code>/<code>--no-mul-mat-q</code>. Unless I'm mistaken the long-term plan is to also add equivalent CPU kernels for matrix matrix multiplications. Ideally I think the same CLI argument should then be used for switching the algorithm. So if you think that "mul_mat_q" is a bad name for matrix multiplications using quantized data now would be a good time to tell me.</li>
      </ul>
      </li>
      <li><strong>b7870</strong>: arg : add -kvu to llama-batched-bench (<a href="https://github.com/ggml-org/llama.cpp/pull/19172">#19172</a>)</li>
      <li><strong>b7871</strong>: HIP: add mmf for CDNA (<a href="https://github.com/ggml-org/llama.cpp/pull/18896">#18896</a>)
      <ul>
      <li>Add mmf for CDNA, CDNA3 is passed, it will be very helpful if anyone can test it on CDNA2 and CDNA1, thank you.</li>
      <li><input type="checkbox" checked="" disabled="" /> Refactor mmf to make rows_per_block as input parameter.</li>
      <li><input type="checkbox" checked="" disabled="" /> Pass MUL_MAT and MUL_MAT_ID.</li>
      </ul>
      </li>
      </ul>
      <h4>🚀 Performance Improvements</h4>
      <ul>
      <li><strong>b7847</strong>: CUDA: tune GLM 4.7 Flash FA kernel selection logic (<a href="https://github.com/ggml-org/llama.cpp/pull/19097">#19097</a>)
      <ul>
      <li>Follow-up to <a href="https://github.com/ggml-org/llama.cpp/pull/19092">https://github.com/ggml-org/llama.cpp/pull/19092</a> .</li>
      <li>Adjusts the kernel selection logic as a function of context depth to squeeze out a few more % on Ampere/Blackwell.</li>
      <li>| GPU      | Model               |   Microbatch size | Test          |   t/s master |   t/s 8a8b9a8bd |   Speedup |</li>
      </ul>
      </li>
      <li><strong>b7858</strong>: ggml: new backend for Virglrenderer API Remoting acceleration (v2) (<a href="https://github.com/ggml-org/llama.cpp/pull/18718">#18718</a>)
      <ul>
      <li>This is a follow up of <a href="https://github.com/ggml-org/llama.cpp/pull/17072">https://github.com/ggml-org/llama.cpp/pull/17072</a></li>
      <li>The API Remoting backend/frontend allow escaping the VM isolation, with the help of the <code>virt-gpu</code> paravirtualization (and the <code>virglrenderer</code> library on the host side).</li>
      <li><code>ggml-remotingfrontend</code> is a GGML API implementation, which intercepts the GGML API calls and forwards them to the <code>virt-gpu</code> virtual device</li>
      </ul>
      </li>
      <li><strong>b7865</strong>: Vulkan Flash Attention Coopmat1 Refactor (<a href="https://github.com/ggml-org/llama.cpp/pull/19075">#19075</a>)
      <ul>
      <li>I finally had the time to go through Jeff's Flash Attention shaders in detail and used the chance to refactor the Coopmat1 for AMD. It started out as an attempt to use Coopmats for the Softmax * V matrix multiplication as well and then escalated into a refactor of the whole shader structure.</li>
      <li>It now uses coopmats for the Softmax result * V matrix multiplication, and I vectorized some variables, changed how shared memory is used, load K and V directly from global memory if possible, otherwise streamed through a shared memory cache.</li>
      <li>Tests are passing. Performance is up significantly on AMD RX 8060S (Strix Halo). Draft because there is a regression on Nvidia. Let me know if you see anything obvious @jeffbolznv. More tuning is likely required.</li>
      </ul>
      </li>
      </ul>
      <h4>🐛 Bug Fixes</h4>
      <ul>
      <li><strong>b7851</strong>: Split shared state (webgpu_context) into global state and per-thread state (<a href="https://github.com/ggml-org/llama.cpp/pull/18976">#18976</a>)
      <ul>
      <li>Right now, the WebGPU backend has a global <code>webgpu_context</code> struct with all the information required to instantiate and run a WebGPU graph.</li>
      <li>We want to split up the <code>webgpu_context</code> struct as follows:</li>
      <li>Move <code>get_tensor_sharing_buf</code> to global state, along with the <code>mutex</code></li>
      </ul>
      </li>
      <li><strong>b7853</strong>: llama : disable Direct IO by default (<a href="https://github.com/ggml-org/llama.cpp/pull/19109">#19109</a>)
      <ul>
      <li>ref <a href="https://github.com/ggml-org/llama.cpp/issues/19035#issuecomment-3798971944">https://github.com/ggml-org/llama.cpp/issues/19035#issuecomment-3798971944</a></li>
      <li>cont #18012</li>
      <li>Update <code>llama_model_params::use_direct_io == false</code> by default</li>
      </ul>
      </li>
      <li><strong>b7856</strong>: cuda : fix "V is K view" check for non-unified KV cache (<a href="https://github.com/ggml-org/llama.cpp/pull/19145">#19145</a>)
      <ul>
      <li>We weren't handling the case where both V and K are views of the same data with the same offset different from 0. This happens with split KV cache (e.g. <code>--parallel 4 --no-kv-unified</code>) and causes the flash attention to fall back to the CPU in such cases.</li>
      </ul>
      </li>
      <li><strong>b7860</strong>: vulkan: handle device dedup on MacOS + Vega II Duo cards (<a href="https://github.com/ggml-org/llama.cpp/pull/19058">#19058</a>)
      <ul>
      <li>Deduplication here relied on the fact that vulkan would return unique UUID for different physical GPUs. It is at the moment not always the case. On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total), MotlenVK would assign same UUID to pairs of GPUs, unless they are connected with Infinity Fabric.</li>
      <li>See more details here: KhronosGroup/MoltenVK#2683.</li>
      <li>The right way is to fix that in MoltenVK, but until it is fixed, llama.cpp would only recognize 2 of 4 GPUs in such configuration.</li>
      </ul>
      </li>
      <li><strong>b7861</strong>: jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (<a href="https://github.com/ggml-org/llama.cpp/pull/19147">#19147</a>)
      <ul>
      <li>Fixes #19130</li>
      </ul>
      </li>
      <li><strong>b7869</strong>: ggml-zendnn : resolve ZenDNN backend cross-module symbol dependency (<a href="https://github.com/ggml-org/llama.cpp/pull/19159">#19159</a>)
      <ul>
      <li>This PR fixes the ZenDNN backend failing to load when <code>GGML_BACKEND_DL=ON</code></li>
      <li>The issue occurs because MODULE libs cannot access symbols from other MODULE libs, ZenDNN backend was attempting to call <code>ggml_get_type_traits_cpu()</code> from ggml-cpu, resulting in an undfined symbol error for <code>GGML_BACKEND_DL=ON</code></li>
      <li>This fix uses <code>ggml_get_type_traits()</code> from ggml-base instead, eliminating the dependency on ggml-cpu</li>
      </ul>
      </li>
      </ul>
      <h3>Additional Changes</h3>
      <p>5 minor improvements: 3 documentation, 2 maintenance.</p>
      <ul>
      <li><strong>b7864</strong>: Add self‑speculative decoding (no draft model required) (<a href="https://github.com/ggml-org/llama.cpp/pull/18471">#18471</a>)
      <ul>
      <li>This PR introduces self-speculative decoding: instead of using a dedicated draft model (which is good, if available, see #18039), the current token history is used to predict future tokens. This can provide a speedup in cases where the output contains repeated parts of the prompt. A typical example is making many small changes in a large source file.</li>
      <li><strong>Example 1</strong> (<code>gpt-oss-120b</code> in VRAM): Translation of a few comments in a Python script (chosen as a favorable case).</li>
      <li>
      <pre><code>
      
  • b7864: Add self‑speculative decoding (no draft model required) (#18471)
    • This PR introduces self-speculative decoding: instead of using a dedicated draft model (which is good, if available, see #18039), the current token history is used to predict future tokens. This can provide a speedup in cases where the output contains repeated parts of the prompt. A typical example is making many small changes in a large source file.
    • Example 1 (gpt-oss-120b in VRAM): Translation of a few comments in a Python script (chosen as a favorable case).
  • b7867: [SYCL] fix norm kernels: l2_norm, group_norm, rms_norm by remove assert (#19154)
    • fix norm kernels: l2_norm, group_norm, rms_norm by remove assert.
    • all ut cases of norm are 100% passed.
    • no crash of UT cases.
  • b7855: CUDA: tune GLM 4.7 Flash FA kernel selection logic (DGX Spark) (#19142)
    • cont #19097
    • This is similar to #19097, but for DGX Spark. I used only the Q8_0 model for the measurements.
    • </code></pre>
      </li>
      </ul>
      </li>
      <li><strong>b7857</strong>: ggml-cpu: arm64: Q4_K repack (i8mm) scale unroll and vectorization (<a href="https://github.com/ggml-org/llama.cpp/pull/19108">#19108</a>)
      <ul>
      <li>While working on <a href="https://github.com/ggml-org/llama.cpp/pull/18860">https://github.com/ggml-org/llama.cpp/pull/18860</a> I found out a small perf optimization when loading the subblock scales.</li>
      <li>Behavior unchanged, it's a manual unroll + vectorization.</li>
      <li>Llama-bench:</li>
      </ul>
      </li>
      </ul>
      <h3>Full Commit Range</h3>
      <ul>
      <li>b7847 to b7871 (22 commits)</li>
      <li>Upstream releases: <a href="https://github.com/ggml-org/llama.cpp/compare/b7847...b7871">https://github.com/ggml-org/llama.cpp/compare/b7847...b7871</a></li>
      </ul>
      <hr />
      <h2>2026-01-27: Update to llama.cpp b7845</h2>
      <h3>Summary</h3>
      <p>Updated llama.cpp from b7837 to b7845, incorporating 8 upstream commits with breaking changes, new features, and performance improvements.</p>
      <h3>Notable Changes</h3>
      <h4>⚠️ Breaking Changes</h4>
      <ul>
      <li><strong>b7839</strong>: graph : fix nkvo offload with FA (<a href="https://github.com/ggml-org/llama.cpp/pull/19105">#19105</a>)
      <ul>
      <li>fix #19096</li>
      <li>The <code>ggml_flash_attn_ext</code> was not being offloaded to the CPU when <code>-nkvo</code> is specified.</li>
      <li>Also remove obsolete <code>strcmp(name, "kqv_merged_cont")</code> check in the graph callback.</li>
      </ul>
      </li>
      </ul>
      <h4>🆕 New Features</h4>
      <ul>
      <li><strong>b7837</strong>: model : add correct type for GLM 4.7 Flash (<a href="https://github.com/ggml-org/llama.cpp/pull/19106">#19106</a>)
      <ul>
      <li>Fix the displayed model type in the logs:</li>
      <li>
      <pre lang="bash"><code>
      
    • deepseek2 ?B Q8_0
  • b7843: common : clarify HTTPS build options in error message (#19103)
    • This commit updates the https error message to provide clearer instructions for users who encounter the "HTTPS is not supported" error.
    • The motivation for this is that it might not be clear to users that only one of these options are needed to enable HTTPS support. The LLAMA_OPENSSL option is also added to the message to cover all possible build configurations.
  • b7845: ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) (#18860)
    • This PR implements the REPACK version of q5_K, following most of the existing design used for q4_K, since Q5_K only differs from q4_K in having the qh field with the additional bit.
    • Most of the code is shared, but I didn't know how to abstract the common patterns without creating a convoluted mess of functions. Since only Q4_K and Q5_K share the same 6bit scales and mins decode, I opted to duplicate the code.
    • I also moved around some declarations for Q2_K because the structure seemed weird (it's inverted with what I've seen in quants.c). The Q2_K function declarations were left where they were to avoid polluting the diff and messing the blame. If you want me to revert it, just say so.
  • b7845: ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860 (#18888)
    • Continuation of repack work for ARM, since q4_K_M and q5_K_M quantizations spend ~%20 of compute time on q6_K layers.
    • Still pending rebasing on top of #18860 if that gets merged.
    • Same testing practices from the other repack implementations.

🚀 Performance Improvements

  • b7841: opencl: add flattened q6_K mv (#19054)
    • This PR adds flattened q6_K mv and renames the existing q6_K mv kernel file to better reflect what the kernel does. There should be no performance improvement, but will enable further optimizations.
  • b7842: ggml-cpu: Enable FP16 MMA kernels on PPC (#19060)
    • This change introduces a unified FP16/BF16 MMA kernel selection via mma_instr,
    • allowing FP16 models to leverage Power MMA instructions instead of falling back to scalar/vector paths.
    • Performance impact (Power10, 10 threads, Mistral-7B FP16, llama-batched-bench):

Additional Changes

1 minor improvements: 1 documentation.

  • b7844: [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full (#19042)
    • With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, resulting in bubbles in the GPU timeline. This PR fixes this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.
    • The NSight profile below shows the issue in more detail:
    • image

Full Commit Range


2026-01-26: Update to llama.cpp b7837

Summary

Updated llama.cpp from b7837 to b7837, incorporating 1 upstream commits with new features.

Notable Changes

🆕 New Features

  • b7837: model : add correct type for GLM 4.7 Flash (#19106)
    • Fix the displayed model type in the logs:
    • </code></pre>
      </li>
      <li>deepseek2 ?B Q8_0</li>
      </ul>
      </li>
      </ul>
      <h3>Full Commit Range</h3>
      <ul>
      <li>b7837 to b7837 (1 commits)</li>
      <li>Upstream releases: <a href="https://github.com/ggml-org/llama.cpp/compare/b7837...b7837">https://github.com/ggml-org/llama.cpp/compare/b7837...b7837</a></li>
      </ul>
      <hr />
      <h2>2026-01-26: Update to llama.cpp b7837</h2>
      <h3>Summary</h3>
      <p>Updated llama.cpp from b7837 to b7837, incorporating 1 upstream commits with new features.</p>
      <h3>Notable Changes</h3>
      <h4>🆕 New Features</h4>
      <ul>
      <li><strong>b7837</strong>: model : add correct type for GLM 4.7 Flash (<a href="https://github.com/ggml-org/llama.cpp/pull/19106">#19106</a>)
      <ul>
      <li>Fix the displayed model type in the logs:</li>
      <li>
      <pre lang="bash"><code>
      
    • deepseek2 ?B Q8_0

Full Commit Range


2026-01-26: Update to llama.cpp b7837

Summary

Updated llama.cpp from b7837 to b7837, incorporating 1 upstream commits with new features.

Notable Changes

🆕 New Features

  • b7837: model : add correct type for GLM 4.7 Flash (#19106)
    • Fix the displayed model type in the logs:
    • </code></pre>
      </li>
      <li>deepseek2 ?B Q8_0</li>
      </ul>
      </li>
      </ul>
      <h3>Full Commit Range</h3>
      <ul>
      <li>b7837 to b7837 (1 commits)</li>
      <li>Upstream releases: <a href="https://github.com/ggml-org/llama.cpp/compare/b7837...b7837">https://github.com/ggml-org/llama.cpp/compare/b7837...b7837</a></li>
      </ul>
      <hr />
      <h2>2026-01-26: Update to llama.cpp b7836</h2>
      <h3>Summary</h3>
      <p>Updated llama.cpp from b7836 to b7836, incorporating 1 upstream commits with performance improvements.</p>
      <h3>Notable Changes</h3>
      <h4>🚀 Performance Improvements</h4>
      <ul>
      <li><strong>b7836</strong>: CUDA: faster FA for GQA > 1 but not power of 2 (<a href="https://github.com/ggml-org/llama.cpp/pull/19092">#19092</a>)
      <ul>
      <li>This PR generalizes the CUDA MMA FlashAttention kernel to enable the GQA optimizations for models where the ratio between the number of Q heads and the number of K/V heads is not a power of 2. This is done by simply padding the Q columns per CUDA block to the next higher power of 2. This wastes a bit of compute but particularly for small batch sizes the kernel is I/O-bound anyways.</li>
      <li>On Ampere or newer this improves performance of GLM 4.7 Flash as well as some random models like Granite 3.0 with a GQA ratio of 3. On Volta the new code path is slower than master so it's disabled. On RDNA4 it seems to be faster but as of right now the performance of the MMA kernel is bad on RDNA for head sizes > 128 so there is no benefit for GLM 4.7 Flash.</li>
      <li>
      <details>
      </li>
      </ul>
      </li>
      </ul>
      <h3>Full Commit Range</h3>
      <ul>
      <li>b7836 to b7836 (1 commits)</li>
      <li>Upstream releases: <a href="https://github.com/ggml-org/llama.cpp/compare/b7836...b7836">https://github.com/ggml-org/llama.cpp/compare/b7836...b7836</a></li>
      </ul>
      <hr />
      <h2>2026-01-26: Update to llama.cpp b7836</h2>
      <h3>Summary</h3>
      <p>Updated llama.cpp from b7836 to b7836, incorporating 1 upstream commits with performance improvements.</p>
      <h3>Notable Changes</h3>
      <h4>🚀 Performance Improvements</h4>
      <ul>
      <li><strong>b7836</strong>: CUDA: faster FA for GQA > 1 but not power of 2 (<a href="https://github.com/ggml-org/llama.cpp/pull/19092">#19092</a>)
      <ul>
      <li>This PR generalizes the CUDA MMA FlashAttention kernel to enable the GQA optimizations for models where the ratio between the number of Q heads and the number of K/V heads is not a power of 2. This is done by simply padding the Q columns per CUDA block to the next higher power of 2. This wastes a bit of compute but particularly for small batch sizes the kernel is I/O-bound anyways.</li>
      <li>On Ampere or newer this improves performance of GLM 4.7 Flash as well as some random models like Granite 3.0 with a GQA ratio of 3. On Volta the new code path is slower than master so it's disabled. On RDNA4 it seems to be faster but as of right now the performance of the MMA kernel is bad on RDNA for head sizes > 128 so there is no benefit for GLM 4.7 Flash.</li>
      <li>
      <details>
      </li>
      </ul>
      </li>
      </ul>
      <h3>Full Commit Range</h3>
      <ul>
      <li>b7836 to b7836 (1 commits)</li>
      <li>Upstream releases: <a href="https://github.com/ggml-org/llama.cpp/compare/b7836...b7836">https://github.com/ggml-org/llama.cpp/compare/b7836...b7836</a></li>
      </ul>
      <hr />
      <h2>2026-01-26: Update to llama.cpp b7836</h2>
      <h3>Summary</h3>
      <p>Updated llama.cpp from b7836 to b7836, incorporating 1 upstream commits with performance improvements.</p>
      <h3>Notable Changes</h3>
      <h4>🚀 Performance Improvements</h4>
      <ul>
      <li><strong>b7836</strong>: CUDA: faster FA for GQA > 1 but not power of 2 (<a href="https://github.com/ggml-org/llama.cpp/pull/19092">#19092</a>)
      <ul>
      <li>This PR generalizes the CUDA MMA FlashAttention kernel to enable the GQA optimizations for models where the ratio between the number of Q heads and the number of K/V heads is not a power of 2. This is done by simply padding the Q columns per CUDA block to the next higher power of 2. This wastes a bit of compute but particularly for small batch sizes the kernel is I/O-bound anyways.</li>
      <li>On Ampere or newer this improves performance of GLM 4.7 Flash as well as some random models like Granite 3.0 with a GQA ratio of 3. On Volta the new code path is slower than master so it's disabled. On RDNA4 it seems to be faster but as of right now the performance of the MMA kernel is bad on RDNA for head sizes > 128 so there is no benefit for GLM 4.7 Flash.</li>
      <li>
      <details>
      </li>
      </ul>
      </li>
      </ul>
      <h3>Full Commit Range</h3>
      <ul>
      <li>b7836 to b7836 (1 commits)</li>
      <li>Upstream releases: <a href="https://github.com/ggml-org/llama.cpp/compare/b7836...b7836">https://github.com/ggml-org/llama.cpp/compare/b7836...b7836</a></li>
      </ul>
      <hr />
      <h2>2026-01-21: Update to llama.cpp b7788</h2>
      <h3>Summary</h3>
      <p>Updated llama.cpp from b7772 to b7788, incorporating 13 upstream commits with breaking changes, new features, and performance improvements.</p>
      <h3>Notable Changes</h3>
      <h4>⚠️ Breaking Changes</h4>
      <ul>
      <li><strong>b7782</strong>: ggml : cleanup path_str() (<a href="https://github.com/ggml-org/llama.cpp/pull/18928">#18928</a>)
      <ul>
      <li>Remove pragmas as <code>std::codecvt_utf8</code> is not used.</li>
      <li>Avoid implicit <code>strlen()</code>.</li>
      </ul>
      </li>
      </ul>
      <h4>🆕 New Features</h4>
      <ul>
      <li><strong>b7774</strong>: ggml : add ggml_build_forward_select (<a href="https://github.com/ggml-org/llama.cpp/pull/18550">#18550</a>)
      <ul>
      <li>target #18547</li>
      <li>alt #18549</li>
      <li>Add <code>GGML_TENSOR_FLAG_COMPUTE</code> flag indicating that a tensor in the graph must be computed</li>
      </ul>
      </li>
      <li><strong>b7777</strong>: jinja : fix undefined keys and attributes and int/float as bool (<a href="https://github.com/ggml-org/llama.cpp/pull/18924">#18924</a>)
      <ul>
      <li>Return <code>undefined</code> on undefined keys and attributes.</li>
      <li>Integers and floats can be represented as bools.</li>
      <li>Added <code>falsy</code> tests.</li>
      </ul>
      </li>
      </ul>
      <h4>🚀 Performance Improvements</h4>
      <ul>
      <li><strong>b7781</strong>: metal : enable FA for MLA heads (<a href="https://github.com/ggml-org/llama.cpp/pull/18950">#18950</a>)
      <ul>
      <li>ref #18936</li>
      <li>Re-enable FA for K head size of 576 (MQA mode of MLA) and adjust simdgroups and loop unrolling for performance.</li>
      </ul>
      </li>
      <li><strong>b7783</strong>: CUDA: Replace init_offsets kernel with iterators in cub-based argsort (<a href="https://github.com/ggml-org/llama.cpp/pull/18930">#18930</a>)
      <ul>
      <li>This is mostly a QOL improvement, saving us the cost of materializing the iterator.</li>
      <li>--- before</li>
      <li>
      <pre><code>
      

🐛 Bug Fixes

  • b7772: DirectIO Model Loading: Extend and fix Fallback (#18887)
    • Due to issues with the DirectIO model loading path on Android this PR adds EINVAL errors to the fallback condition. Also there was a bug in the fallback to mmap in case open with the DirectIO flag fails.
  • b7787: gguf: display strerrno when cant load a model (#18884)
    • I've had issues loading models with llama-server:
    • [44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf'
    • and I was sure it could access the file. Seems like --models-dir and --models-presets dont interact like I thought they would but I salvaged this snippet that helps troubleshooting
  • b7788: Fix GLM 4.7 Lite MoE gating func (#18980)
    • GLM 4.7 Lite uses SIGMOID, not SOFTMAX like Deepseek.

Additional Changes

5 minor improvements: 1 documentation, 4 examples.

  • b7786: CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (#18964)
  • b7775: server: fix memory reservations in populate_token_probs (#18787)
    • Fixes the two Vector::reserve calls in the populate_token_probs function.
    • In case post_sampling is true the code now only reserves as much space in the Vector as is needed for the requested number of logprobs. This prevents reserving large amounts of memory that are not used.
    • In case post_sampling is false the code now clamps the reserved size to the maximum number of tokens the model supports. This prevents reserving large amounts of unused memory when the client requests more token logprobs than the model supports and, in extreme cases, crashes from invalid memory allocations.
  • b7779: server : refactor oai_parser_opt, move it to server_chat_params (#18937)
    • In this PR:
    • Rename oaicompat_parser_options --> server_chat_params
    • Store common_chat_templates_ptr inside it
  • b7784: cli : fix reasoning responses in CLI (#18961)
    • The chat format was not populate to task state in CLI, so reasoning content was not parsed correctly
    • With this PR, GLM-4.7 now works correctly on CLI:
    • image
  • b7785: common, server : use the same User-Agent by default (#18957)
    • This commit also ensures that if a custom User-Agent is used, it will be the only one sent.

Full Commit Range


2026-01-05: Update to llama.cpp b7631

2026-01-03: Update to llama.cpp b7621

2025-12-20: Update to llama.cpp b7488

2025-12-13: Update to llama.cpp b7376

2025-12-05: Update to llama.cpp b7278

2025-12-01: Update to llama.cpp b7213

2025-11-14: Update to llama.cpp b7058

2025-11-05: Update to llama.cpp b6957

2025-11-01: Update to llama.cpp b6916

2025-10-31: Update to llama.cpp b6900

2025-10-18: Update to llama.cpp b6792

2025-10-02: Update to llama.cpp b6666

This file lists notable changes synchronized from upstream llama.cpp releases. Each entry corresponds to the vendor submodule update in this package.

2025-09-17: Update to llama.cpp b6497

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_cpp_pydist-0.46.0.tar.gz (51.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_cpp_pydist-0.46.0-py3-none-any.whl (52.9 MB view details)

Uploaded Python 3

File details

Details for the file llama_cpp_pydist-0.46.0.tar.gz.

File metadata

  • Download URL: llama_cpp_pydist-0.46.0.tar.gz
  • Upload date:
  • Size: 51.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llama_cpp_pydist-0.46.0.tar.gz
Algorithm Hash digest
SHA256 9d843aa1db2e4050745baccd584326ee9603270f3554d1d12ffb7b18ecc36c07
MD5 32748256ea7760dcf6149ba8ddac2edf
BLAKE2b-256 ed4dc780bc6d65da4c66d199b59eb811fc877a33b03e02d0689018d11a165062

See more details on using hashes here.

File details

Details for the file llama_cpp_pydist-0.46.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_cpp_pydist-0.46.0-py3-none-any.whl
Algorithm Hash digest
SHA256 88eb7ef74bcbd536288d27301fb6a321fdbc419d633e5463b9cde2fcce54e079
MD5 9741dd8719d4a1d1a1f271a6d30fc723
BLAKE2b-256 9b36b010f9b72ff07c63cf19f24b5c7555b761f9ea8169b3e8460ddc8b120996

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page