Speculative decoding engine with local and remote target model execution
Project description
AutoDraft: Automatic Cost-Performance Adaptation in User-Cloud Distributed Speculative Decoding
This repository is the official implementation of "DualEngine: A Thermal-Aware Vision Inference Framework via Mobile and Cloud Co-Execution," submitted to IEEE SECON 2026.
This document is written so that even a first-time installer can get running by copy-pasting the commands.
- Target environment: Ubuntu / Linux + NVIDIA GPU
- Scope: driver / CUDA check,
venvsetup, dependency install, target server, FastAPI chat UI, and UI usage.
For the autodraft Python API (programmatic access to the
speculative-decoding runtime), see section 9 at the end of this README.
1) Prerequisites
1-1. Required software
- NVIDIA driver
- Python 3.10 or newer
git- (optional)
tmuxorscreen— convenient when leaving the server running for a while.
1-2. Driver / CUDA sanity check
nvidia-smi
python3 --version
If everything is OK:
nvidia-smiprints GPU / driver info.python3 --versionprints a version string.
Notes:
- This repository pins
torch==2.7.0+cu128inrequirements.txt. - You do not need a system-wide CUDA Toolkit (
nvcc) installed. The PyTorch wheel + NVIDIA driver combination is sufficient. - The cu128 extra index is already declared inside
requirements.txt, sopip install -r requirements.txtis the only command you need.
2) Get the project
git clone git@github.com:PJChoi1/MobiTree.git MobiTree
cd MobiTree
git checkout seperate
Skip this step if you already have the source.
3) Virtual environment + dependencies
Run the block below as is:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
pip install -r requirements.txt
Verify the install:
python -c "import torch; print('torch:', torch.__version__); print('cuda:', torch.cuda.is_available())"
cuda: True means GPU detection works.
4) Running (the most common path)
Follow the steps below in order. (Terminal A: UI.)
4-1. Step 1 — Launch the UI
In terminal A:
source .venv/bin/activate
python3 -m uvicorn chat_ui.main:app --host 0.0.0.0 --port 8000
Open in your browser:
4-2. Step 2 — Verify UI is reachable
When the UI loads in the browser, confirm that the top Server panel and
the Add Server button are visible.
4-3. Step 3 — Register a server through the UI
In the browser UI:
- Click
Add Server. - Enter
Server Name(e.g.icnl-server). - Enter
IP Address(e.g.163.152.163.152). - Enter
Port(e.g.26001). - Click
Add Server.
5) UI usage (for first-time users)
5-1. Pick server / model / quantization
Starting from a registered server, select in this order:
- Pick the server from the
Serverdropdown. - Pick the target model under
Server Model. - Pick the draft model under
Draft Model. - Pick quantization (4bit / 8bit) under
Server Q/Draft Q. - Click
Startto launch the runtime, then send a message.
5-2. Key buttons / settings
Start: start the chat runtime.Stop: stop the chat runtime and request the target to unload the model.Profile LLM: refresh the reference cache / profile.Token Source Coloring: color tokens by their origin.Algorithm: pick the decoding strategy (MobiTree,Server-Only,Server-Only-AR,OPT-Tree,Fixed-tree).Mode: pick the run mode (Chatfor conversation,Benchmarkfor evaluation).Max New Tokens: maximum tokens to generate per response.Dataset: evaluation dataset used inBenchmarkmode.
5-3. Run the target with the same server_name
Important: the Server Name you registered in step 4-3 and the
--server-name you pass to the target launcher must match.
If you registered Server Name=icnl-server in step 4-3, in terminal B:
source .venv/bin/activate
./run_target.sh --host 0.0.0.0 --port 26001 --device-map auto \
--load-in-8bit --lazy-load --enable-auto-target-profile \
--server-name icnl-server
Notes:
--host/--port: must match theIP Address/Portyou typed in step 4-3.--server-name: must match theServer Nameyou typed in step 4-3.--lazy-load: load the model on the first incoming request.--enable-auto-target-profile: auto-generate the target profile if it doesn't exist yet.
6) Cache / profile file locations
- Target profile:
data/profile/profile_target_<server-name>_<model>_tq-<quant>.json - Reference cache:
data/reference/ref_<server-name>_<base>_<device>_<draft>_tq-<...>_dq-<...>_mt_bench_<metric>_<mode>_*.json
Filenames are load-bearing — they must match exactly to be reused across runs.
7) Batch experiments via XML config
For reproducible end-to-end experiment sweeps (model × dataset × cost objective × algorithm), use the XML-driven runner. It calls the same draft / target binaries that the UI uses, but reads its settings from a single XML file so a run can be replayed exactly.
bash evaluation/run_main_experiment_overall_performance.sh \
--config-xml evaluation/overall_performance_draft_energy_humaneval_example.xml
Two example configs ship with the repo (drop them into your own copies and tweak the values):
evaluation/overall_performance_draft_energy_humaneval_example.xml— HumanEval, draft-energy objective.evaluation/overall_performance_total_cost_mt_bench_example.xml— MT-bench, total-cost objective.
Excerpt (overall_performance_draft_energy_humaneval_example.xml):
<?xml version="1.0" encoding="UTF-8"?>
<experiment_config>
<runtime>
<TARGET_HOST>192.168.0.12</TARGET_HOST>
<TARGET_PORT>26001</TARGET_PORT>
<DEVICE_MAP>cuda:0</DEVICE_MAP>
<DRAFT_DEVICE_NAME>rtx5080</DRAFT_DEVICE_NAME>
<SERVER_NAME>rtxproa6000</SERVER_NAME>
</runtime>
<models>
<BASE_MODEL_PATH>Qwen/Qwen2.5-14B-Instruct</BASE_MODEL_PATH>
<DRAFT_MODEL_PATH>Qwen/Qwen2.5-1.5B-Instruct</DRAFT_MODEL_PATH>
<TARGET_QUANTIZATION>none</TARGET_QUANTIZATION>
<DRAFT_QUANTIZATION>none</DRAFT_QUANTIZATION>
</models>
<objective>
<OBJECTIVE_METRICS_CSV>draft_energy</OBJECTIVE_METRICS_CSV>
<AUTODRAFT_CS_LIST>50</AUTODRAFT_CS_LIST>
</objective>
<dataset>
<BENCHES_CSV>humaneval</BENCHES_CSV>
</dataset>
<algorithms>
<ENABLE_HYBRID_AUTODRAFT>1</ENABLE_HYBRID_AUTODRAFT>
</algorithms>
<tree>
<PROPOSED_NODES>150</PROPOSED_NODES>
<PROPOSED_MAX_DEPTH>15</PROPOSED_MAX_DEPTH>
<!-- ... profile width / node lists ... -->
</tree>
</experiment_config>
Notes:
TARGET_HOST/TARGET_PORT/SERVER_NAMEmust match the target server you launched in §5-3 (or viapython examples/target.py).- Configuration precedence is env vars → XML values → script defaults — environment variables on the command line still override the XML, which is handy for quick one-off tweaks without editing the file.
- The runner accepts both forms: direct tags
(
<MAX_NEW_TOKENS>256</MAX_NEW_TOKENS>) and<parameter>entries (<parameter name="MAX_NEW_TOKENS">256</parameter>). - Run
bash evaluation/run_main_experiment_overall_performance.sh --helpto see every variable the script understands.
8) autodraft Python API
autodraft is a thin Python wrapper around the MobiTree
speculative-decoding runtime. Existing shell scripts (run_target.sh,
run_mt_bench_sd.sh, etc.) and CLI behavior are kept intact; this
section is purely additive.
8-1. End-to-end usage examples
See examples/ for runnable scripts that show the full
target / draft flow:
examples/target.py— start the target server in lazy-load mode (one terminal).examples/draft.py— run the draft side, send a prompt, and print the generated text + stats (another terminal).
Quickstart from a source checkout:
git clone https://github.com/PJChoi1/MobiTree.git
cd MobiTree
# Terminal A: target server
python examples/target.py
# Terminal B: draft side (produces real generated text)
python examples/draft.py
8-2. Install options
# Option A — editable from a source checkout
pip install -e .
pip install -e ".[dev]" # adds tests, lint, build, twine
# Option B — from PyPI
pip install autodraft-sd
A single pip install autodraft-sd covers everything: the wrapper, the
runtime (evaluation/ + opt_classic/), 4/8-bit quantization
(bitsandbytes), and trade-off PNG rendering (matplotlib). There are
no [quant] or [plot] extras to remember — autodraft-sd is a GPU
library, so splitting hairs over a few MB doesn't help.
The PyPI distribution name is
autodraft-sdbecauseautodraftis already taken on PyPI. The Python import name is stillautodraft— writefrom autodraft import Autodraftin code (same pattern aspip install pillow→from PIL import ...).
The PyPI wheel ships both the wrapper API (autodraft/) and the
speculative-decoding runtime (evaluation/, opt_classic/), so a
single pip install autodraft-sd lets engine.run(...) and
serve_target(...) work without a source checkout (you still need a
GPU and a CUDA-matched PyTorch wheel — see 8-5).
Research-only assets (chat_ui/, data/ benchmark datasets, result/)
are intentionally excluded from the wheel; access them through the
GitHub source checkout.
8-3. Autodraft(...) and engine.run() parameters
from autodraft import Autodraft
engine = Autodraft(
draft_model="meta-llama/Llama-3.2-1B-Instruct",
target_model="meta-llama/Llama-3.2-1B-Instruct",
draft_quantization=None, # None / "none" / "4bit" / "8bit"
target_quantization=None,
target_host="127.0.0.1",
target_port=26001,
cost="total_cost", # "total_cost" (default) / "api_cost" /
# "energy_total" / "draft_energy" /
# "target_energy". Set once at init —
# the reference cache key depends on it,
# so to switch metrics, build a new
# Autodraft instance.
hf_token=None, # gated repos: pass token here or set HF_TOKEN
)
result = engine.run(
input_text="...",
proactive=False,
cs="balanced", # "tps" / "balanced" (default) / "cost",
# or a number in [0, 100]
save_tradeoff=True, # save reference trade-off curve
tradeoff_dir=None, # default: $MOBITREE_DATA_DIR/tradeoff
server_name="autodraft", # must match the target's server_name
# Any other run_draft kwargs (~70 options) are forwarded as-is.
)
Result shape:
{
"generated_text": "...", # final model output
"input_text": "...",
"proactive": False,
"cs": "balanced",
"cost": "total_cost",
"algorithm": "MobiTree",
"stats": { # one-line summary
"total_steps": 15,
"total_new_tokens": 121,
"total_time_seconds": 1.98,
"tokens_per_second": 61.10,
"tokens_per_step": 8.07,
"avg_tree_width": 6.4,
"avg_tree_depth": 7.2,
"avg_nnodes": 50.5, # avg nodes per tree
"avg_accept_length": 7.07,
"acceptance_ratio_avg": 0.98,
"total_cost": 0.000125, # in the unit of the chosen cost objective ($ or kWh)
"api_cost": 0.0,
"draft_cost": 0.000031,
"target_cost": 0.000094,
},
"tradeoff_files": {
# Filename is conditions-hashed (server, target, draft, device,
# quantization, bench, cost, mode), so repeated calls with the
# same conditions overwrite the same file (no timestamps).
# Paired 1:1 with the reference cache.
"json": ".../data/tradeoff/tradeoff_<server>_<target>_<device>_<draft>_tq-<...>_dq-<...>_<bench>_<cost>_<mode>.json",
"png": ".../data/tradeoff/tradeoff_<server>_<target>_<device>_<draft>_tq-<...>_dq-<...>_<bench>_<cost>_<mode>.png", # only if matplotlib is installed
},
"answer_row": { ... }, # answers[-1] from the integrated result
"raw_result": { ... }, # full integrated result (latency_statistics, accept_stats, etc.)
}
The target.py side is intentionally minimal because MobiTree always
runs split-process and the target loads whatever the draft tells it to
load via the reload_model RPC (lazy load):
from autodraft import serve_target
serve_target(
host="0.0.0.0",
port=26001,
server_name="autodraft",
hf_token=None,
) # blocks forever (server loop)
8-4. Cache layout for the API
Profile / reference caches default to ./data/ under the directory
where you launch the script (i.e. python my_script.py writes to
<cwd>/data/profile/, <cwd>/data/reference/). This matches the
source-checkout convention where users run from the repo root.
The first run with no cache auto-bakes the target latency profile and the draft latency profile (tens of seconds to a few minutes). Later runs cache-hit and start fast. Cache filenames embed the GPU name, so moving to a different GPU rebuilds them. To pin a custom GPU label:
export MOBITREE_DEVICE_NAME="rtx-5080"
If unset, the code auto-detects via torch.cuda.get_device_name(0),
falling back to "unknown-gpu" if no GPU is available.
To share caches across projects (e.g. a fixed location):
export MOBITREE_DATA_DIR=/path/to/shared/cache
8-5. PyTorch installation note
autodraft-sd declares a generic torch dependency, but PyTorch wheels
are CUDA-specific. For a working install, install your matching wheel
first (e.g. torch==2.7.0+cu128 from this repo's requirements.txt)
and then pip install autodraft-sd. Otherwise pip will pull a
default CPU wheel that won't see your GPU.
8-6. Talking to a target on another machine
Run the target on the server machine, e.g. with python examples/target.py
(or the legacy ./run_target.sh), then on the client machine:
from autodraft import Autodraft
engine = Autodraft(
draft_model="...",
target_model="...",
target_host="TARGET_SERVER_IP",
target_port=26001,
cost="total_cost", # api_cost / energy_total / draft_energy / target_energy
)
result = engine.run(
"input text",
proactive=True,
cs="balanced", # "tps" / "balanced" / "cost", or 0~100 number
server_name="my-server",
)
8-7. Running examples without pip install
To import autodraft without installing, the repo root must be on
sys.path. examples/draft.py and examples/target.py already include
a 4-line bootstrap that inserts their parent directory, so
python examples/draft.py from the repo root just works.
For your own scripts:
# (a) Run with -m from the repo root
python -m examples.draft
# (b) Add the repo to PYTHONPATH
export PYTHONPATH=/path/to/MobiTree:$PYTHONPATH
python my_script.py
# (c) Add the same sys.path bootstrap at the top of your script
8-8. HuggingFace token
Gated models (meta-llama/*, etc.) require an HF access token. Two
options:
# (a) Pass directly to the constructor
engine = Autodraft(draft_model=..., target_model=..., hf_token="hf_xxx")
# (b) Set the env var (HF_TOKEN or HUGGING_FACE_HUB_TOKEN)
# export HF_TOKEN=hf_xxx
hf_token is masked as '***' in repr(engine). Internally it is
exported to HF_TOKEN / HUGGING_FACE_HUB_TOKEN before the runtime is
imported, so the same token reaches both draft-side from_pretrained
calls and any target server you launch in the same process tree.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autodraft_sd-0.1.13.tar.gz.
File metadata
- Download URL: autodraft_sd-0.1.13.tar.gz
- Upload date:
- Size: 510.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
318e1769987b9784f86bfdacc3b3066feaa14929a879e7b8383614614ab1f29e
|
|
| MD5 |
ffd7b31ced5a2ddfae63ff512e2fe5b7
|
|
| BLAKE2b-256 |
b82da2077e034f8f7fc48c5f9ae0dae9be5bd8ddd9e47cf0407bc3e0d5813a5a
|
Provenance
The following attestation bundles were made for autodraft_sd-0.1.13.tar.gz:
Publisher:
publish.yml on PJChoi1/MobiTree
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autodraft_sd-0.1.13.tar.gz -
Subject digest:
318e1769987b9784f86bfdacc3b3066feaa14929a879e7b8383614614ab1f29e - Sigstore transparency entry: 1431798573
- Sigstore integration time:
-
Permalink:
PJChoi1/MobiTree@dfe656e9cca732f2b61b791b2aee54a4eac4fd5b -
Branch / Tag:
refs/tags/v0.1.13 - Owner: https://github.com/PJChoi1
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@dfe656e9cca732f2b61b791b2aee54a4eac4fd5b -
Trigger Event:
push
-
Statement type:
File details
Details for the file autodraft_sd-0.1.13-py3-none-any.whl.
File metadata
- Download URL: autodraft_sd-0.1.13-py3-none-any.whl
- Upload date:
- Size: 557.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
469a7a7269b3bc97002e4414bdb1ffa74f8a62557ead1b519471baf204a2c112
|
|
| MD5 |
17f32f5e2942a12e9c01b603654f07c6
|
|
| BLAKE2b-256 |
be29047c9d341761a3a60fad3c844a6108ce763417d75703c02ac2091c0131f3
|
Provenance
The following attestation bundles were made for autodraft_sd-0.1.13-py3-none-any.whl:
Publisher:
publish.yml on PJChoi1/MobiTree
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autodraft_sd-0.1.13-py3-none-any.whl -
Subject digest:
469a7a7269b3bc97002e4414bdb1ffa74f8a62557ead1b519471baf204a2c112 - Sigstore transparency entry: 1431798613
- Sigstore integration time:
-
Permalink:
PJChoi1/MobiTree@dfe656e9cca732f2b61b791b2aee54a4eac4fd5b -
Branch / Tag:
refs/tags/v0.1.13 - Owner: https://github.com/PJChoi1
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@dfe656e9cca732f2b61b791b2aee54a4eac4fd5b -
Trigger Event:
push
-
Statement type: