Speculative decoding engine with local and remote target model execution

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

AutoDraft: Automatic Cost-Performance Adaptation in User-Cloud Distributed Speculative Decoding

This repository is the official implementation of "AutoDraft: Automatic Cost-Performance Adaptation in User-Cloud Distributed Speculative Decoding," submitted to ACM NeurIPS 2026.

Abstract

As demand for on-cloud Large Language Models (LLMs) explodes, the high inference cost has become a critical issue. Recently, user-cloud distributed speculative decoding has emerged as a promising paradigm, wherein a lightweight draft model on the user device generates candidate tokens while a large target model on the cloud server verifies them in parallel. However, existing approaches rely on static configurations, overlooking the heterogeneous performance of user devices and the alignment between draft and target models. This rigidity leads to redundant resource utilization. To this end, we propose AutoDraft, an adaptive framework that navigates the cost-performance trade-off by continuously profiling the execution environment and dynamically configuring the draft tree structure (width, depth, and transmitted nodes) in real-time to accommodate diverse user-defined Service Level Objectives (SLOs), including monetary cost and inference throughput. Extensive evaluations demonstrate that AutoDraft achieves up to an x% reduction in overall inference costs. Furthermore, we encapsulate these technical optimizations into an accessible API, allowing users to effortlessly input their desired constraints and dynamically control the framework without requiring deep system expertise.

Framework

Getting Started

AutoDraft is an adaptive tree-based user-cloud distributed speculative decoding framework. It consists of two cooperating processes:

Server process — runs the target model and verifies the draft tree.
User process — runs the draft model and builds an adaptive token tree.

The server process must be launched first, because the user process opens a socket to the server immediately on start. Always boot the server, wait until it is listening, and then launch the user process (or the UI / config-driven runner that drives it).

We support two ways of bringing them up: a PyPI library (drop-in Python API) and a GitHub repository (full source, UI, configurable scripts).

Python library usage
- 1.1 Install the library
- 1.2 Server process example
- 1.3 User process example
GitHub repository usage
- 2.1 Prerequisites
- 2.2 Installation
- 2.3 Running
Configure runs via an XML file

1. Python library usage

1.1 Install the library

pip install autodraft-sd

1.2 Server process example

Run this first, on the machine that will host the target model. It blocks forever (server loop), so leave the terminal open.

from autodraft import serve_target

serve_target(
    host="0.0.0.0",                  # bind address of the server process
    port=26001,                      # port to listen on
    server_name="autodraft",
    hf_token=None,                   # gated repos: pass token here or set HF_TOKEN
)

1.3 User process example

Once the server is up, run this on the user device. It connects to the server at target_host:target_port and uses it to verify the draft tree.

from autodraft import Autodraft

engine = Autodraft(
    draft_model="meta-llama/Llama-3.2-1B-Instruct",
    target_model="meta-llama/Llama-3.2-1B-Instruct",
    draft_quantization="4bit",       # "none" / "4bit" / "8bit"
    target_quantization="4bit",      # "none" / "4bit" / "8bit"
    target_host="127.0.0.1",         # IP address of the server process
    target_port=26001,               # port the server process listens on
    cost="energy_total",             # "total_cost" (default) / "api_cost" / "energy_total"
                                     # / "draft_energy" / "target_energy"
    hf_token=None,                   # pass token here or set HF_TOKEN env var
)

result = engine.run(
    input_text="...",
    proactive=False,
    cs="balanced",                   # "tps" / "balanced" (default) / "cost", or 0~100 number
    save_tradeoff=True,              # save the reference trade-off curve (default True)
    tradeoff_dir=None,               # default: $MOBITREE_DATA_DIR/tradeoff
    server_name="autodraft",         # must match the server's server_name
    # Any other run_draft kwargs (~70 options) are forwarded as-is.
)

2. GitHub repository usage

2.1 Prerequisites

NVIDIA driver
Python 3.10 or newer
git

2.2 Installation

2.2.1 Get the project

git clone git@github.com:PJChoi1/MobiTree.git MobiTree
cd MobiTree
git checkout seperate

2.2.2 Virtual environment + install requirements

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
pip install -r requirements.txt

2.3 Running

Bring the server up first, then the UI / user side.

2.3.1 Launch the server process — in terminal A:

source .venv/bin/activate
./run_target.sh --host 0.0.0.0 --port 26001 --device-map auto \
                --load-in-8bit --lazy-load --enable-auto-target-profile \
                --server-name EC2-A100

Wait until you see the server log indicating that it is listening on the port (lazy-load mode means no model is loaded yet — that happens when the user side connects).

2.3.2 Launch the UI — in terminal B:

source .venv/bin/activate
python3 -m uvicorn chat_ui.main:app --host 0.0.0.0 --port 8000

Open in your browser: http://localhost:8000

UI screenshot

2.3.3 Register the server in the UI

In the browser UI:

Click Add Server.
Enter Server Name — the same value you passed to --server-name in step 2.3.1.
Enter IP Address.
Enter Port.
Click Add Server.

2.3.4 Pick a server / model and start

Starting from the registered server, select in this order:

Pick the server from the Server dropdown.
Pick the target model under Server Model.
Pick the draft model under Draft Model.
Pick quantization (4bit / 8bit) under Server Q / Draft Q.
Click Start to launch the runtime, then send a message.

3. Configure runs via an XML file

Instead of typing many command-line flags every time, you can keep all run settings — models, dataset, cost objective, tree size, etc. — in a single XML file and point the runner at it. This makes it easy for users to tweak just the values they care about, and it makes a run reproducible by sharing the same XML file.

bash evaluation/run_main_experiment_overall_performance.sh \
     --config-xml evaluation/overall_performance_draft_energy_humaneval_example.xml

Two example configs ship with the repo (copy and edit them):

evaluation/overall_performance_draft_energy_humaneval_example.xml — HumanEval, draft-energy objective.
evaluation/overall_performance_total_cost_mt_bench_example.xml — MT-bench, total-cost objective.

3.1 Cost models

Every cost knob the framework supports is exposed as an explicit XML tag inside <objective>, so users can edit them directly in the XML rather than chasing the default through CSV files. Pick the cost objective via OBJECTIVE_METRICS_CSV and edit the knobs that matter for it:

`OBJECTIVE_METRICS_CSV`	Knobs consumed	What it measures
`total_cost`	`TARGET_PER_HOUR_COST`, `DRAFT_ELECTRICITY_COST_PER_KWH`, `USER_COMM_COST_PER_GB`, `CLOUD_OUTBOUND_COST_PER_GB`	Draft GPU $ + Target GPU $ + Communication $
`api_cost`	`TARGET_PER_HOUR_COST`	Target API $
`energy_total`	(none — measured via NVML)	Draft GPU kWh + Target GPU kWh
`draft_energy`	(none — measured via NVML)	Draft GPU kWh
`target_energy`	(none — measured via NVML)	Target GPU kWh

Excerpt (overall_performance_total_cost_mt_bench_example.xml):

<objective>
  <!-- Pick one (comma-separated for sweeps):
       total_cost / api_cost / energy_total / draft_energy / target_energy -->
  <OBJECTIVE_METRICS_CSV>total_cost</OBJECTIVE_METRICS_CSV>

  <!-- 0 = TPS-first, 100 = cost-first; space-separated for a sweep -->
  <AUTODRAFT_CS_LIST>0</AUTODRAFT_CS_LIST>

  <!-- Cost-model parameters. Knobs are ignored for energy_* objectives. -->
  <TARGET_PER_HOUR_COST>1.208</TARGET_PER_HOUR_COST>          <!-- $/h cloud GPU/API; used by: total_cost, api_cost -->
  <DRAFT_ELECTRICITY_COST_PER_KWH>0.2</DRAFT_ELECTRICITY_COST_PER_KWH>   <!-- $/kWh user side, multiplied by measured draft kWh; used by: total_cost -->
  <USER_COMM_COST_PER_GB>0.33</USER_COMM_COST_PER_GB>          <!-- $/GB user→cloud; used by: total_cost -->
  <CLOUD_OUTBOUND_COST_PER_GB>0.09</CLOUD_OUTBOUND_COST_PER_GB>  <!-- $/GB cloud→user; used by: total_cost -->
</objective>

The same <objective> block lives in the draft_energy example too (with the dollar knobs marked "for reference / easy switching") so you can flip from energy to total-cost just by editing OBJECTIVE_METRICS_CSV — no other surgery needed.

3.2 Other config blocks

The XML also has runtime / models / dataset / algorithms / tree blocks. Excerpt:

<runtime>
  <TARGET_HOST>192.168.0.12</TARGET_HOST>
  <TARGET_PORT>26001</TARGET_PORT>
  <DEVICE_MAP>cuda:0</DEVICE_MAP>
  <DRAFT_DEVICE_NAME>rtx5080</DRAFT_DEVICE_NAME>
  <SERVER_NAME>rtxproa6000</SERVER_NAME>
</runtime>

<models>
  <BASE_MODEL_PATH>Qwen/Qwen2.5-14B-Instruct</BASE_MODEL_PATH>
  <DRAFT_MODEL_PATH>Qwen/Qwen2.5-1.5B-Instruct</DRAFT_MODEL_PATH>
  <TARGET_QUANTIZATION>none</TARGET_QUANTIZATION>
  <DRAFT_QUANTIZATION>none</DRAFT_QUANTIZATION>
</models>

<dataset><BENCHES_CSV>mt_bench</BENCHES_CSV></dataset>

<algorithms><ENABLE_HYBRID_AUTODRAFT>1</ENABLE_HYBRID_AUTODRAFT></algorithms>

<tree>
  <PROPOSED_NODES>150</PROPOSED_NODES>
  <PROPOSED_MAX_DEPTH>15</PROPOSED_MAX_DEPTH>
  <!-- ... profile width / node lists ... -->
</tree>

The same rule applies here: the server process (run_target.sh) must already be running on TARGET_HOST:TARGET_PORT before you launch the runner.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Donghyeon_Kim218

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.15

May 6, 2026

0.1.14

May 5, 2026

0.1.13

May 3, 2026

0.1.12

May 3, 2026

0.1.11

May 2, 2026

0.1.10

May 2, 2026

0.1.9

May 2, 2026

0.1.8

May 2, 2026

0.1.7

May 2, 2026

0.1.6

May 2, 2026

0.1.5

May 2, 2026

0.1.4

May 2, 2026

0.1.3

May 2, 2026

0.1.2

May 2, 2026

0.1.1

May 2, 2026

0.1.0

May 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autodraft_sd-0.1.15.tar.gz (217.4 kB view details)

Uploaded May 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autodraft_sd-0.1.15-py3-none-any.whl (222.8 kB view details)

Uploaded May 6, 2026 Python 3

File details

Details for the file autodraft_sd-0.1.15.tar.gz.

File metadata

Download URL: autodraft_sd-0.1.15.tar.gz
Upload date: May 6, 2026
Size: 217.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for autodraft_sd-0.1.15.tar.gz
Algorithm	Hash digest
SHA256	`116613c0273f815aa2f4ac5fcc15259a28f89c1dc2e13964cd339d2a47617094`
MD5	`0213cf4683af1d3a83d74d4f6fb2ec4f`
BLAKE2b-256	`c97e25f42901c8a7fbabc0c30bafd283cd597fa83e8dee72ff3e543601b25920`

See more details on using hashes here.

Provenance

The following attestation bundles were made for autodraft_sd-0.1.15.tar.gz:

Publisher: publish.yml on PJChoi1/MobiTree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: autodraft_sd-0.1.15.tar.gz
- Subject digest: 116613c0273f815aa2f4ac5fcc15259a28f89c1dc2e13964cd339d2a47617094
- Sigstore transparency entry: 1446340817
- Sigstore integration time: May 6, 2026
Source repository:
- Permalink: PJChoi1/MobiTree@c38dd4c8a4321963495019e05ec35fa34fd88ead
- Branch / Tag: refs/tags/v0.1.15
- Owner: https://github.com/PJChoi1
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@c38dd4c8a4321963495019e05ec35fa34fd88ead
- Trigger Event: push

File details

Details for the file autodraft_sd-0.1.15-py3-none-any.whl.

File metadata

Download URL: autodraft_sd-0.1.15-py3-none-any.whl
Upload date: May 6, 2026
Size: 222.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for autodraft_sd-0.1.15-py3-none-any.whl
Algorithm	Hash digest
SHA256	`588c9a15ecb452698db13ee1770a7fb5282e1c18b79c26dc929d0572d8fd7f84`
MD5	`c62e31eaf6ba6ed1d21e6e5d4d45d646`
BLAKE2b-256	`69d51dba6a5bb2433b6f9e408068a235816a0f3b5a321f8f77236b244f75855f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for autodraft_sd-0.1.15-py3-none-any.whl:

Publisher: publish.yml on PJChoi1/MobiTree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: autodraft_sd-0.1.15-py3-none-any.whl
- Subject digest: 588c9a15ecb452698db13ee1770a7fb5282e1c18b79c26dc929d0572d8fd7f84
- Sigstore transparency entry: 1446340930
- Sigstore integration time: May 6, 2026
Source repository:
- Permalink: PJChoi1/MobiTree@c38dd4c8a4321963495019e05ec35fa34fd88ead
- Branch / Tag: refs/tags/v0.1.15
- Owner: https://github.com/PJChoi1
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@c38dd4c8a4321963495019e05ec35fa34fd88ead
- Trigger Event: push

autodraft-sd 0.1.15

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

AutoDraft: Automatic Cost-Performance Adaptation in User-Cloud Distributed Speculative Decoding

Abstract

Getting Started

Contents

1. Python library usage

1.1 Install the library

1.2 Server process example

1.3 User process example

2. GitHub repository usage

2.1 Prerequisites

2.2 Installation

2.3 Running

3. Configure runs via an XML file

3.1 Cost models

3.2 Other config blocks

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance