Skip to main content

Python bindings for the DS4 native inference engine

Project description

pyds4

Python bindings for the DS4 DeepSeek V4 Flash inference engine

C++ & python tests PyPI Python versions Last commit License

pyds4 is a Python package for running DS4-supported DeepSeek V4 Flash GGUF models from Python. It wraps the DS4 native engine with synchronous and asyncio APIs, token streaming, chat prompt helpers, token logprobs, DSML tool-call helpers, session snapshots, and payload-backed disk KV cache helpers.

This is not a generic GGUF runner. It targets the model files and native API supported by DS4. Avalan uses pyds4 for native DS4 inference, but pyds4 is usable directly from any Python application.

Contents

Install

Install the published package from PyPI:

python -m pip install -U pyds4

Check which native backend the installed package was built for:

python - <<'PY'
import pyds4

backend = pyds4.__ds4_native_backend__
print("pyds4:", pyds4.__version__)
print("backend:", backend)
print("available:", pyds4.is_backend_available(backend))
if not pyds4.is_backend_available(backend):
    print(pyds4.backend_unavailable_reason(backend))
PY

pyds4 requires Python 3.11 or newer. Production targets are macOS arm64 with Metal and Linux with CUDA. CPU builds exist for diagnostics and tests only.

pyds4 wheels are built for one selected native backend. If a wheel for your platform is not available, build from source against a DS4 checkout as shown in Build From Source.

Model Files

DS4 opens a local GGUF file directly from the filesystem. Use DS4 to download one of the supported DeepSeek V4 Flash GGUFs:

git clone https://github.com/antirez/ds4.git /path/to/ds4
cd /path/to/ds4
./download_model.sh q2-imatrix

The DS4 repository documents the available quantizations, memory expectations, optional MTP model, and current engine limitations. The examples below assume /path/to/ds4/ds4flash.gguf.

Async Streaming Quickstart

AsyncEngine owns DS4 on a single worker thread and serializes native calls there. For most applications, AsyncSession.stream_text() is the right high-level API: it advances the session, suppresses EOS text, buffers stop strings, and handles incremental UTF-8 decoding.

import asyncio

import pyds4

MODEL_PATH = "/path/to/ds4/ds4flash.gguf"
CTX_SIZE = 4096


async def main() -> None:
    backend = pyds4.Backend(pyds4.__ds4_native_backend__)
    if not pyds4.is_backend_available(backend.value):
        raise RuntimeError(pyds4.backend_unavailable_reason(backend.value))

    options = pyds4.EngineOptions(
        model_path=MODEL_PATH,
        backend=backend,
        native_log=False,
    )

    async with pyds4.AsyncEngine(options) as engine:
        prompt_tokens = await engine.encode_chat_prompt(
            system="You are concise.",
            prompt="Explain why KV caches matter in one paragraph.",
            think_mode=pyds4.think_mode_for_context(
                pyds4.ThinkMode.NONE,
                CTX_SIZE,
            ),
        )

        async with await engine.create_session(CTX_SIZE) as session:
            await session.sync(prompt_tokens)

            generation = pyds4.GenerationOptions(max_new_tokens=128)
            async for chunk in session.stream_text(generation):
                print(chunk, end="", flush=True)
            print()


asyncio.run(main())

Use sampling by passing SamplingOptions into GenerationOptions:

generation = pyds4.GenerationOptions(
    max_new_tokens=128,
    sampling=pyds4.SamplingOptions(
        temperature=0.7,
        top_k=40,
        top_p=0.95,
        seed=1,
    ),
)

If you build your own token loop but want the same stop-string behavior, StopStringBuffer exposes the generic buffering used by stream_text():

buffer = pyds4.StopStringBuffer(["</s>", "STOP"])
for chunk in buffer.push(decoded_token_text):
    print(chunk, end="")
for chunk in buffer.flush():
    print(chunk, end="")

Runnable Examples

The async example exposes the common knobs for backend, context size, sampling, streaming, and thread count:

python examples/generate_text_async.py \
  --model /path/to/ds4/ds4flash.gguf \
  --backend metal \
  --ctx-size 4096 \
  --max-new-tokens 128 \
  --temperature 0 \
  "Explain LLM distillation in one paragraph."

For lower-level control, the sync example shows the synchronous session loop: sync() the prompt, pick argmax() or sample(), call eval() to advance, then decode engine.token_text(token_id).

Tool Use With DSML

DeepSeek V4 Flash tool calls use DSML text. pyds4 can render prompts with tool schemas, tokenize rendered DSML chat prompts, parse generated tool calls, and stream argument-value deltas from a growing tool block. It does not execute tools; your application owns dispatching the parsed call and appending the tool result on the next turn.

import pyds4
from pyds4.dsml import (
    DsmlMessage,
    DsmlParseStatus,
    DsmlPrompt,
    DsmlToolCallBufferStatus,
    parse_generated_message,
    render_prompt,
    tool_call_buffer_status,
)

tool_schema = {
    "type": "function",
    "function": {
        "name": "math.calculator",
        "description": "Evaluate a small arithmetic expression.",
        "parameters": {
            "type": "object",
            "properties": {"expression": {"type": "string"}},
            "required": ["expression"],
        },
    },
}

rendered = render_prompt(
    DsmlPrompt(
        system_content="Use tools for arithmetic.",
        messages=[DsmlMessage(role="user", content="What is 4 * 7?")],
        tool_schemas=[tool_schema],
    ),
    think_mode=pyds4.ThinkMode.NONE,
)

async def collect_tool_call_text(engine: pyds4.AsyncEngine) -> str:
    prompt_tokens = await engine.tokenize_rendered_chat(rendered)
    generated = ""

    async with await engine.create_session(4096) as session:
        await session.sync(prompt_tokens)
        async for chunk in session.stream_text(
            pyds4.GenerationOptions(max_new_tokens=512),
        ):
            generated += chunk
            if (
                tool_call_buffer_status(generated)
                is DsmlToolCallBufferStatus.CLOSED
            ):
                break

    return generated


async def run_tool_prompt(engine: pyds4.AsyncEngine) -> None:
    generated = await collect_tool_call_text(engine)
    parsed = parse_generated_message(generated)
    if parsed.status is DsmlParseStatus.COMPLETE:
        for call in parsed.calls:
            print(call.name, call.arguments)

When continuing after a tool result, render the next prompt with the prior assistant tool call and a DsmlMessage(role="tool", content=...) result so the DSML transcript stays aligned with DS4's native prompt format. String parameter rendering escapes every accepted DSML parameter close-marker variant before it reaches the generated block, while preserving literal entity text such as &lt;/parameter> and &amp;lt;/parameter> when parsed back.

Advanced APIs

Token Scores

Use AsyncSession.next_token() when you need token-level metadata instead of plain text chunks:

step = await session.next_token(
    decode=True,
    scores=pyds4.GenerationScoreOptions(
        mode=pyds4.TokenScoreMode.TOKEN_LOGPROB_AND_TOP_LOGPROBS,
        top_k=5,
    ),
)

print(step.decoded_text, step.token_logprob)
for score in step.top_logprobs:
    print(score.token_id, score.logprob)

The synchronous Session exposes the same primitives directly with argmax(), sample(), token_logprob(), top_logprobs(), and eval().

Snapshots And Disk KV Cache

Sessions can save and restore in-memory snapshots and serialized payloads. Ds4DiskKvCache uses payloads to cache a prompt prefix on disk.

from pathlib import Path

from pyds4.kv_cache import Ds4DiskKvCache

cache = Ds4DiskKvCache(
    Path("~/.cache/pyds4/kv").expanduser(),
    model_namespace="deepseek-v4-flash-q2-imatrix",
    backend=backend,
)

async with await engine.create_session(CTX_SIZE) as session:
    restored = await cache.arestore(session, prompt_tokens, CTX_SIZE)
    if restored.status == "miss" and restored.synced:
        await cache.astore(
            session,
            prompt_tokens,
            CTX_SIZE,
            size_budget_bytes=2_000_000_000,
        )

    text = await session.generate_text(
        pyds4.GenerationOptions(max_new_tokens=128),
    )

Store immediately after syncing or restoring the prefix you want to cache. Ds4DiskKvCache caches payload bytes, not snapshots. On a hit, it loads the payload with load_payload(); on a miss or corrupt entry, it calls sync(prompt_tokens) by default. model_namespace is caller-defined and should identify the exact model and configuration. Cache metadata can include prompt text, and payload bytes are session state, so treat cache directories as sensitive. size_budget_bytes is opt-in and enforced after a payload is written.

Progress And Cancellation

AsyncSession.progress is an asyncio.Queue[pyds4.ProgressEvent] populated when the native backend reports long-running progress. Cancelling a mutating async operation poisons that session and closes the native session during cleanup; create a fresh session after cancellation.

MTP And Speculative Evaluation

EngineOptions accepts mtp_path, mtp_draft_tokens, and mtp_margin for DS4's optional MTP path. Engine metadata exposes has_mtp and mtp_draft_tokens, while sessions expose eval_speculative_argmax(). Treat this as an advanced DS4-specific path and validate it with your target model.

Build From Source

Source builds need a DS4 checkout from the repository's default branch, plus CMake and a platform C/C++ toolchain:

git clone https://github.com/antirez/ds4.git /path/to/ds4

Build a Metal package on macOS arm64:

DS4_SOURCE_DIR=/path/to/ds4 \
PYDS4_BACKEND=metal \
python -m pip install --no-binary=pyds4 pyds4

Build a CUDA package on Linux:

DS4_SOURCE_DIR=/path/to/ds4 \
PYDS4_BACKEND=cuda \
CUDA_ARCH=90 \
python -m pip install --no-binary=pyds4 pyds4

To build this checkout and install it into a specific project's virtual environment, point PYTHON at that project's interpreter:

DS4_SOURCE_DIR=.local/ds4 \
PYDS4_BACKEND=metal \
PYTHON=/path/to/project/.venv/bin/python \
make ds4-bridge

If DS4_SOURCE_DIR is omitted during a source build, the package remains import-safe but native inference is unavailable.

For local wrapper development without a real GGUF or GPU, use the deterministic fake DS4 shim:

PYDS4_USE_FAKE_DS4=1 PYDS4_BACKEND=cpu \
python -m pip install -e ".[test,dev]"

Benchmark

Benchmark the pyds4 sync path, async primitive path, and async next_token() path:

python scripts/benchmark_generation.py \
  --model /path/to/ds4/ds4flash.gguf \
  --backend metal \
  --ctx-size 4096 \
  --max-new-tokens 128 \
  --mode pyds4 \
  --json-output /tmp/pyds4-bench.json

The benchmark reports open, prompt, warmup, sync, generation and total time, time to first token, tokens per second, event-loop latency, queue round-trip latency, and output preview.

Test

Run the fake-native test suite:

PYDS4_USE_FAKE_DS4=1 PYDS4_BACKEND=cpu python -m pip install -e ".[test,dev]"
make test
make test-cpp-sanitizers

Run real-model integration tests when a supported DS4 GGUF is available:

PYDS4_MODEL=/path/to/ds4/ds4flash.gguf \
PYDS4_BACKEND=metal \
PYDS4_CTX=4096 \
python -m pytest -q \
  tests/test_real_ds4_integration.py \
  tests/test_async_real_ds4_integration.py

Build and smoke-test a wheel:

DS4_SOURCE_DIR=/path/to/ds4 PYDS4_BACKEND=metal make wheel

WHEEL="dist/pyds4-*.whl" \
SMOKE_BACKEND=metal \
SMOKE_EXPECT_AVAILABLE=true \
SMOKE_MODEL=/path/to/ds4/ds4flash.gguf \
SMOKE_CTX=4096 \
make wheel-smoke

Release

pyproject.toml is the version source. For a new release, bump it and commit:

make version VERSION=1.0.2

Build and publish one backend wheel plus the sdist:

DS4_SOURCE_DIR=/path/to/ds4 PYDS4_BACKEND=metal make release

Use PYDS4_BACKEND=cuda on a CUDA Linux build host to produce the NVIDIA wheel. Metal and CUDA wheels can be uploaded for the same pyds4 version because they have different platform tags. The GitHub Release workflow builds the sdist, macOS arm64 Metal wheels, and Linux x86_64 CUDA wheels for Python 3.11 through 3.14. The cuda_arch workflow input controls the NVIDIA architecture passed to CMake.

The CUDA wheel is audited in the release workflow and repaired to a manylinux_2_38_x86_64 wheel for PyPI. CUDA runtime and cuBLAS libraries are left external, so Linux installs must provide compatible NVIDIA CUDA 12 runtime libraries on the target host.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyds4-1.0.2.tar.gz (116.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyds4-1.0.2-cp314-cp314-manylinux_2_38_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.38+ x86-64

pyds4-1.0.2-cp314-cp314-macosx_15_0_arm64.whl (464.5 kB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

pyds4-1.0.2-cp313-cp313-manylinux_2_38_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.38+ x86-64

pyds4-1.0.2-cp313-cp313-macosx_15_0_arm64.whl (464.0 kB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

pyds4-1.0.2-cp312-cp312-manylinux_2_38_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.38+ x86-64

pyds4-1.0.2-cp312-cp312-macosx_15_0_arm64.whl (464.0 kB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

pyds4-1.0.2-cp311-cp311-manylinux_2_38_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.38+ x86-64

pyds4-1.0.2-cp311-cp311-macosx_15_0_arm64.whl (463.8 kB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

File details

Details for the file pyds4-1.0.2.tar.gz.

File metadata

  • Download URL: pyds4-1.0.2.tar.gz
  • Upload date:
  • Size: 116.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyds4-1.0.2.tar.gz
Algorithm Hash digest
SHA256 083be6a49849b37edb96ecad64cf87e09997e19770bcd2ff97e43d0e12c69932
MD5 6f1eafb517483df13d8b6b05acbfc927
BLAKE2b-256 230fac227547574fdea4e119df124c91cc4cceed5385a757e82c964ecd8e6f25

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyds4-1.0.2.tar.gz:

Publisher: release.yml on avalan-ai/pyds4

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyds4-1.0.2-cp314-cp314-manylinux_2_38_x86_64.whl.

File metadata

File hashes

Hashes for pyds4-1.0.2-cp314-cp314-manylinux_2_38_x86_64.whl
Algorithm Hash digest
SHA256 536057a67fad5122cc2d6d171a38f16e3585b51400dc3db7d4edc8ff73902618
MD5 bdf6d856ebd2d75bf8bb6985a903bf0e
BLAKE2b-256 bb6c6ea331721a659d56e02ba2dde584203e3a32d61f1fc85e926dae800bfdd2

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyds4-1.0.2-cp314-cp314-manylinux_2_38_x86_64.whl:

Publisher: release.yml on avalan-ai/pyds4

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyds4-1.0.2-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pyds4-1.0.2-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 ae1702a9d065327d441feb44e8e1cdbebd811fc7cd2662da1bf214e42835a006
MD5 5a35096917f1f64e9954615482ee1bb2
BLAKE2b-256 5f46eb21afe69dbc90c59d6439098528abd67384927dc20edfb11a78831410e7

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyds4-1.0.2-cp314-cp314-macosx_15_0_arm64.whl:

Publisher: release.yml on avalan-ai/pyds4

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyds4-1.0.2-cp313-cp313-manylinux_2_38_x86_64.whl.

File metadata

File hashes

Hashes for pyds4-1.0.2-cp313-cp313-manylinux_2_38_x86_64.whl
Algorithm Hash digest
SHA256 66210423a1c57821d19012817a3cd3ea40a7257d0f8c7c59e987dcd773c0d1e3
MD5 a7f031ebaac0840a0e1be801ca4f95dd
BLAKE2b-256 904d69cb834584af4bd7817044b299108bfdaa9d305b9aabc827e2524ea6b0d8

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyds4-1.0.2-cp313-cp313-manylinux_2_38_x86_64.whl:

Publisher: release.yml on avalan-ai/pyds4

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyds4-1.0.2-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pyds4-1.0.2-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 d1d1df14461b20ee11472eca24adba26a61c4a345ceb1b2a424e38ee3d7eba60
MD5 4a1f6af99e54746f2270e5daff9e61d9
BLAKE2b-256 3d749142fcd5768545fcddb5bc35206551758c894d5f412be6472e1f1e10c83b

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyds4-1.0.2-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: release.yml on avalan-ai/pyds4

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyds4-1.0.2-cp312-cp312-manylinux_2_38_x86_64.whl.

File metadata

File hashes

Hashes for pyds4-1.0.2-cp312-cp312-manylinux_2_38_x86_64.whl
Algorithm Hash digest
SHA256 0789ac5662f84038aa9efd339321d3b6e1fe1bf606fc2b75f8e80c8f8bfb2552
MD5 2af7d0ee050a39d5c2bd483f4055a2eb
BLAKE2b-256 54fa4b047de458d1bfa3bddeb5af222a6f47f33ef3b8c466b6f11f3123f7ee82

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyds4-1.0.2-cp312-cp312-manylinux_2_38_x86_64.whl:

Publisher: release.yml on avalan-ai/pyds4

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyds4-1.0.2-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pyds4-1.0.2-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 4152ccd5f0046ee4c3399777b383bd31ca36882c21d1f76daca199d37941af3f
MD5 8fa410ad0dce1f159a40d7113eb17e26
BLAKE2b-256 b1a7aaa29c82579e8a9a9c602008edea2f1fba3fff822957b32d25fe619d68ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyds4-1.0.2-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: release.yml on avalan-ai/pyds4

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyds4-1.0.2-cp311-cp311-manylinux_2_38_x86_64.whl.

File metadata

File hashes

Hashes for pyds4-1.0.2-cp311-cp311-manylinux_2_38_x86_64.whl
Algorithm Hash digest
SHA256 a4af840af8040373fb8dc2c07cc15f292396416da3aaf8eac9a2c92167c725b8
MD5 119923c6c08c82eb5b13e5506b5f253c
BLAKE2b-256 0a04d8272be39b9b3e47ad36c86b726232489a82ea52961feab96b708be92a56

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyds4-1.0.2-cp311-cp311-manylinux_2_38_x86_64.whl:

Publisher: release.yml on avalan-ai/pyds4

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyds4-1.0.2-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for pyds4-1.0.2-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 bef4e24f5808e76b354f96c580d6bbdfdd857e26492974e7f3d511154eec6718
MD5 5e623ce6a9c6511792b051e12f99df68
BLAKE2b-256 9684d17483a36530425514a9bcc48e73a3de67698bbf3a71af3b9b06c93c8870

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyds4-1.0.2-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: release.yml on avalan-ai/pyds4

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page