Skip to main content

Speculative decoding engine with local and remote target model execution

Project description

AutoDraft: Automatic Cost-Performance Adaptation in User-Cloud Distributed Speculative Decoding

This repository is the official implementation of "DualEngine: A Thermal-Aware Vision Inference Framework via Mobile and Cloud Co-Execution," submitted to IEEE SECON 2026.

This document is written so that even a first-time installer can get running by copy-pasting the commands.

  • Target environment: Ubuntu / Linux + NVIDIA GPU
  • Scope: driver / CUDA check, venv setup, dependency install, target server, FastAPI chat UI, and UI usage.

For the autodraft Python API (programmatic access to the speculative-decoding runtime), see section 9 at the end of this README.


1) Prerequisites

1-1. Required software

  • NVIDIA driver
  • Python 3.10 or newer
  • git
  • (optional) tmux or screen — convenient when leaving the server running for a while.

1-2. Driver / CUDA sanity check

nvidia-smi
python3 --version

If everything is OK:

  • nvidia-smi prints GPU / driver info.
  • python3 --version prints a version string.

Notes:

  • This repository pins torch==2.7.0+cu128 in requirements.txt.
  • You do not need a system-wide CUDA Toolkit (nvcc) installed. The PyTorch wheel + NVIDIA driver combination is sufficient.
  • The cu128 extra index is already declared inside requirements.txt, so pip install -r requirements.txt is the only command you need.

2) Get the project

git clone git@github.com:PJChoi1/MobiTree.git MobiTree
cd MobiTree
git checkout seperate

Skip this step if you already have the source.


3) Virtual environment + dependencies

Run the block below as is:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
pip install -r requirements.txt

Verify the install:

python -c "import torch; print('torch:', torch.__version__); print('cuda:', torch.cuda.is_available())"

cuda: True means GPU detection works.


4) Running (the most common path)

Follow the steps below in order. (Terminal A: UI.)

4-1. Step 1 — Launch the UI

In terminal A:

source .venv/bin/activate
python3 -m uvicorn chat_ui.main:app --host 0.0.0.0 --port 8000

Open in your browser:

4-2. Step 2 — Verify UI is reachable

When the UI loads in the browser, confirm that the top Server panel and the Add Server button are visible.

4-3. Step 3 — Register a server through the UI

In the browser UI:

  1. Click Add Server.
  2. Enter Server Name (e.g. icnl-server).
  3. Enter IP Address (e.g. 163.152.163.152).
  4. Enter Port (e.g. 26001).
  5. Click Add Server.

5) UI usage (for first-time users)

5-1. Pick server / model / quantization

Starting from a registered server, select in this order:

  1. Pick the server from the Server dropdown.
  2. Pick the target model under Server Model.
  3. Pick the draft model under Draft Model.
  4. Pick quantization (4bit / 8bit) under Server Q / Draft Q.
  5. Click Start to launch the runtime, then send a message.

5-2. Key buttons / settings

  • Start: start the chat runtime.
  • Stop: stop the chat runtime and request the target to unload the model.
  • Profile LLM: refresh the reference cache / profile.
  • Token Source Coloring: color tokens by their origin.
  • Algorithm: pick the decoding strategy (MobiTree, Server-Only, Server-Only-AR, OPT-Tree, Fixed-tree).
  • Mode: pick the run mode (Chat for conversation, Benchmark for evaluation).
  • Max New Tokens: maximum tokens to generate per response.
  • Dataset: evaluation dataset used in Benchmark mode.

5-3. Run the target with the same server_name

Important: the Server Name you registered in step 4-3 and the --server-name you pass to the target launcher must match.

If you registered Server Name=icnl-server in step 4-3, in terminal B:

source .venv/bin/activate
./run_target.sh --host 0.0.0.0 --port 26001 --device-map auto \
                --load-in-8bit --lazy-load --enable-auto-target-profile \
                --server-name icnl-server

Notes:

  • --host / --port: must match the IP Address / Port you typed in step 4-3.
  • --server-name: must match the Server Name you typed in step 4-3.
  • --lazy-load: load the model on the first incoming request.
  • --enable-auto-target-profile: auto-generate the target profile if it doesn't exist yet.

6) Cache / profile file locations

  • Target profile: data/profile/profile_target_<server-name>_<model>_tq-<quant>.json
  • Reference cache: data/reference/ref_<server-name>_<base>_<device>_<draft>_tq-<...>_dq-<...>_mt_bench_<metric>_<mode>_*.json

Filenames are load-bearing — they must match exactly to be reused across runs.


7) Batch experiments via XML config

For reproducible end-to-end experiment sweeps (model × dataset × cost objective × algorithm), use the XML-driven runner. It calls the same draft / target binaries that the UI uses, but reads its settings from a single XML file so a run can be replayed exactly.

bash evaluation/run_main_experiment_overall_performance.sh \
     --config-xml evaluation/overall_performance_draft_energy_humaneval_example.xml

Two example configs ship with the repo (drop them into your own copies and tweak the values):

  • evaluation/overall_performance_draft_energy_humaneval_example.xml — HumanEval, draft-energy objective.
  • evaluation/overall_performance_total_cost_mt_bench_example.xml — MT-bench, total-cost objective.

Excerpt (overall_performance_draft_energy_humaneval_example.xml):

<?xml version="1.0" encoding="UTF-8"?>
<experiment_config>
  <runtime>
    <TARGET_HOST>192.168.0.12</TARGET_HOST>
    <TARGET_PORT>26001</TARGET_PORT>
    <DEVICE_MAP>cuda:0</DEVICE_MAP>
    <DRAFT_DEVICE_NAME>rtx5080</DRAFT_DEVICE_NAME>
    <SERVER_NAME>rtxproa6000</SERVER_NAME>
  </runtime>

  <models>
    <BASE_MODEL_PATH>Qwen/Qwen2.5-14B-Instruct</BASE_MODEL_PATH>
    <DRAFT_MODEL_PATH>Qwen/Qwen2.5-1.5B-Instruct</DRAFT_MODEL_PATH>
    <TARGET_QUANTIZATION>none</TARGET_QUANTIZATION>
    <DRAFT_QUANTIZATION>none</DRAFT_QUANTIZATION>
  </models>

  <objective>
    <OBJECTIVE_METRICS_CSV>draft_energy</OBJECTIVE_METRICS_CSV>
    <AUTODRAFT_CS_LIST>50</AUTODRAFT_CS_LIST>
  </objective>

  <dataset>
    <BENCHES_CSV>humaneval</BENCHES_CSV>
  </dataset>

  <algorithms>
    <ENABLE_HYBRID_AUTODRAFT>1</ENABLE_HYBRID_AUTODRAFT>
  </algorithms>

  <tree>
    <PROPOSED_NODES>150</PROPOSED_NODES>
    <PROPOSED_MAX_DEPTH>15</PROPOSED_MAX_DEPTH>
    <!-- ... profile width / node lists ... -->
  </tree>
</experiment_config>

Notes:

  • TARGET_HOST / TARGET_PORT / SERVER_NAME must match the target server you launched in §5-3 (or via python examples/target.py).
  • Configuration precedence is env vars → XML values → script defaults — environment variables on the command line still override the XML, which is handy for quick one-off tweaks without editing the file.
  • The runner accepts both forms: direct tags (<MAX_NEW_TOKENS>256</MAX_NEW_TOKENS>) and <parameter> entries (<parameter name="MAX_NEW_TOKENS">256</parameter>).
  • Run bash evaluation/run_main_experiment_overall_performance.sh --help to see every variable the script understands.

8) autodraft Python API

autodraft is a thin Python wrapper around the MobiTree speculative-decoding runtime. Existing shell scripts (run_target.sh, run_mt_bench_sd.sh, etc.) and CLI behavior are kept intact; this section is purely additive.

8-1. End-to-end usage examples

See examples/ for runnable scripts that show the full target / draft flow:

  • examples/target.py — start the target server in lazy-load mode (one terminal).
  • examples/draft.py — run the draft side, send a prompt, and print the generated text + stats (another terminal).

Quickstart from a source checkout:

git clone https://github.com/PJChoi1/MobiTree.git
cd MobiTree
# Terminal A: target server
python examples/target.py
# Terminal B: draft side (produces real generated text)
python examples/draft.py

8-2. Install options

# Option A — editable from a source checkout
pip install -e .
pip install -e ".[dev]"     # adds tests, lint, build, twine

# Option B — from PyPI
pip install autodraft-sd

A single pip install autodraft-sd covers everything: the wrapper, the runtime (evaluation/ + opt_classic/), 4/8-bit quantization (bitsandbytes), and trade-off PNG rendering (matplotlib). There are no [quant] or [plot] extras to remember — autodraft-sd is a GPU library, so splitting hairs over a few MB doesn't help.

The PyPI distribution name is autodraft-sd because autodraft is already taken on PyPI. The Python import name is still autodraft — write from autodraft import Autodraft in code (same pattern as pip install pillowfrom PIL import ...).

The PyPI wheel ships both the wrapper API (autodraft/) and the speculative-decoding runtime (evaluation/, opt_classic/), so a single pip install autodraft-sd lets engine.run(...) and serve_target(...) work without a source checkout (you still need a GPU and a CUDA-matched PyTorch wheel — see 8-5).

Research-only assets (chat_ui/, data/ benchmark datasets, result/) are intentionally excluded from the wheel; access them through the GitHub source checkout.

8-3. Autodraft(...) and engine.run() parameters

from autodraft import Autodraft

engine = Autodraft(
    draft_model="meta-llama/Llama-3.2-1B-Instruct",
    target_model="meta-llama/Llama-3.2-1B-Instruct",
    draft_quantization=None,    # None / "none" / "4bit" / "8bit"
    target_quantization=None,
    target_host="127.0.0.1",
    target_port=26001,
    cost="total_cost",          # "total_cost" (default) / "api_cost" /
                                # "energy_total" / "draft_energy" /
                                # "target_energy". Set once at init —
                                # the reference cache key depends on it,
                                # so to switch metrics, build a new
                                # Autodraft instance.
    hf_token=None,              # gated repos: pass token here or set HF_TOKEN
)

result = engine.run(
    input_text="...",
    proactive=False,
    cs="balanced",              # "tps" / "balanced" (default) / "cost",
                                # or a number in [0, 100]
    save_tradeoff=True,         # save reference trade-off curve
    tradeoff_dir=None,          # default: $MOBITREE_DATA_DIR/tradeoff
    server_name="autodraft",    # must match the target's server_name
    # Any other run_draft kwargs (~70 options) are forwarded as-is.
)

Result shape:

{
    "generated_text": "...",        # final model output
    "input_text": "...",
    "proactive": False,
    "cs": "balanced",
    "cost": "total_cost",
    "algorithm": "MobiTree",
    "stats": {                      # one-line summary
        "total_steps": 15,
        "total_new_tokens": 121,
        "total_time_seconds": 1.98,
        "tokens_per_second": 61.10,
        "tokens_per_step": 8.07,
        "avg_tree_width": 6.4,
        "avg_tree_depth": 7.2,
        "avg_nnodes": 50.5,         # avg nodes per tree
        "avg_accept_length": 7.07,
        "acceptance_ratio_avg": 0.98,
        "total_cost": 0.000125,     # in the unit of the chosen cost objective ($ or kWh)
        "api_cost": 0.0,
        "draft_cost": 0.000031,
        "target_cost": 0.000094,
    },
    "tradeoff_files": {
        # Filename is conditions-hashed (server, target, draft, device,
        # quantization, bench, cost, mode), so repeated calls with the
        # same conditions overwrite the same file (no timestamps).
        # Paired 1:1 with the reference cache.
        "json": ".../data/tradeoff/tradeoff_<server>_<target>_<device>_<draft>_tq-<...>_dq-<...>_<bench>_<cost>_<mode>.json",
        "png":  ".../data/tradeoff/tradeoff_<server>_<target>_<device>_<draft>_tq-<...>_dq-<...>_<bench>_<cost>_<mode>.png",  # only if matplotlib is installed
    },
    "answer_row": { ... },          # answers[-1] from the integrated result
    "raw_result": { ... },          # full integrated result (latency_statistics, accept_stats, etc.)
}

The target.py side is intentionally minimal because MobiTree always runs split-process and the target loads whatever the draft tells it to load via the reload_model RPC (lazy load):

from autodraft import serve_target

serve_target(
    host="0.0.0.0",
    port=26001,
    server_name="autodraft",
    hf_token=None,
)  # blocks forever (server loop)

8-4. Cache layout for the API

Profile / reference caches default to ./data/ under the directory where you launch the script (i.e. python my_script.py writes to <cwd>/data/profile/, <cwd>/data/reference/). This matches the source-checkout convention where users run from the repo root.

The first run with no cache auto-bakes the target latency profile and the draft latency profile (tens of seconds to a few minutes). Later runs cache-hit and start fast. Cache filenames embed the GPU name, so moving to a different GPU rebuilds them. To pin a custom GPU label:

export MOBITREE_DEVICE_NAME="rtx-5080"

If unset, the code auto-detects via torch.cuda.get_device_name(0), falling back to "unknown-gpu" if no GPU is available.

To share caches across projects (e.g. a fixed location):

export MOBITREE_DATA_DIR=/path/to/shared/cache

8-5. PyTorch installation note

autodraft-sd declares a generic torch dependency, but PyTorch wheels are CUDA-specific. For a working install, install your matching wheel first (e.g. torch==2.7.0+cu128 from this repo's requirements.txt) and then pip install autodraft-sd. Otherwise pip will pull a default CPU wheel that won't see your GPU.

8-6. Talking to a target on another machine

Run the target on the server machine, e.g. with python examples/target.py (or the legacy ./run_target.sh), then on the client machine:

from autodraft import Autodraft

engine = Autodraft(
    draft_model="...",
    target_model="...",
    target_host="TARGET_SERVER_IP",
    target_port=26001,
    cost="total_cost",   # api_cost / energy_total / draft_energy / target_energy
)

result = engine.run(
    "input text",
    proactive=True,
    cs="balanced",       # "tps" / "balanced" / "cost", or 0~100 number
    server_name="my-server",
)

8-7. Running examples without pip install

To import autodraft without installing, the repo root must be on sys.path. examples/draft.py and examples/target.py already include a 4-line bootstrap that inserts their parent directory, so python examples/draft.py from the repo root just works.

For your own scripts:

# (a) Run with -m from the repo root
python -m examples.draft

# (b) Add the repo to PYTHONPATH
export PYTHONPATH=/path/to/MobiTree:$PYTHONPATH
python my_script.py

# (c) Add the same sys.path bootstrap at the top of your script

8-8. HuggingFace token

Gated models (meta-llama/*, etc.) require an HF access token. Two options:

# (a) Pass directly to the constructor
engine = Autodraft(draft_model=..., target_model=..., hf_token="hf_xxx")

# (b) Set the env var (HF_TOKEN or HUGGING_FACE_HUB_TOKEN)
#   export HF_TOKEN=hf_xxx

hf_token is masked as '***' in repr(engine). Internally it is exported to HF_TOKEN / HUGGING_FACE_HUB_TOKEN before the runtime is imported, so the same token reaches both draft-side from_pretrained calls and any target server you launch in the same process tree.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autodraft_sd-0.1.13.tar.gz (510.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autodraft_sd-0.1.13-py3-none-any.whl (557.6 kB view details)

Uploaded Python 3

File details

Details for the file autodraft_sd-0.1.13.tar.gz.

File metadata

  • Download URL: autodraft_sd-0.1.13.tar.gz
  • Upload date:
  • Size: 510.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for autodraft_sd-0.1.13.tar.gz
Algorithm Hash digest
SHA256 318e1769987b9784f86bfdacc3b3066feaa14929a879e7b8383614614ab1f29e
MD5 ffd7b31ced5a2ddfae63ff512e2fe5b7
BLAKE2b-256 b82da2077e034f8f7fc48c5f9ae0dae9be5bd8ddd9e47cf0407bc3e0d5813a5a

See more details on using hashes here.

Provenance

The following attestation bundles were made for autodraft_sd-0.1.13.tar.gz:

Publisher: publish.yml on PJChoi1/MobiTree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file autodraft_sd-0.1.13-py3-none-any.whl.

File metadata

  • Download URL: autodraft_sd-0.1.13-py3-none-any.whl
  • Upload date:
  • Size: 557.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for autodraft_sd-0.1.13-py3-none-any.whl
Algorithm Hash digest
SHA256 469a7a7269b3bc97002e4414bdb1ffa74f8a62557ead1b519471baf204a2c112
MD5 17f32f5e2942a12e9c01b603654f07c6
BLAKE2b-256 be29047c9d341761a3a60fad3c844a6108ce763417d75703c02ac2091c0131f3

See more details on using hashes here.

Provenance

The following attestation bundles were made for autodraft_sd-0.1.13-py3-none-any.whl:

Publisher: publish.yml on PJChoi1/MobiTree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page