Speculative decoding engine with local and remote target model execution

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

AutoDraft: Automatic Cost-Performance Adaptation in User-Cloud Distributed Speculative Decoding

This repository is the official implementation of "DualEngine: A Thermal-Aware Vision Inference Framework via Mobile and Cloud Co-Execution," submitted to IEEE SECON 2026.

This document is written so that even a first-time installer can get running by copy-pasting the commands.

Target environment: Ubuntu / Linux + NVIDIA GPU
Scope: driver / CUDA check, venv setup, dependency install, target server, FastAPI chat UI, and UI usage.

For the autodraft Python API (programmatic access to the speculative-decoding runtime), see section 9 at the end of this README.

1) Prerequisites

1-1. Required software

NVIDIA driver
Python 3.10 or newer
git
(optional) tmux or screen — convenient when leaving the server running for a while.

1-2. Driver / CUDA sanity check

nvidia-smi
python3 --version

If everything is OK:

nvidia-smi prints GPU / driver info.
python3 --version prints a version string.

Notes:

This repository pins torch==2.7.0+cu128 in requirements.txt.
You do not need a system-wide CUDA Toolkit (nvcc) installed. The PyTorch wheel + NVIDIA driver combination is sufficient.
The cu128 extra index is already declared inside requirements.txt, so pip install -r requirements.txt is the only command you need.

2) Get the project

git clone git@github.com:PJChoi1/MobiTree.git MobiTree
cd MobiTree
git checkout seperate

Skip this step if you already have the source.

3) Virtual environment + dependencies

Run the block below as is:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
pip install -r requirements.txt

Verify the install:

python -c "import torch; print('torch:', torch.__version__); print('cuda:', torch.cuda.is_available())"

cuda: True means GPU detection works.

4) Running (the most common path)

Follow the steps below in order. (Terminal A: UI.)

4-1. Step 1 — Launch the UI

In terminal A:

source .venv/bin/activate
python3 -m uvicorn chat_ui.main:app --host 0.0.0.0 --port 8000

Open in your browser:

http://localhost:8000

4-2. Step 2 — Verify UI is reachable

When the UI loads in the browser, confirm that the top Server panel and the Add Server button are visible.

4-3. Step 3 — Register a server through the UI

In the browser UI:

Click Add Server.
Enter Server Name (e.g. icnl-server).
Enter IP Address (e.g. 163.152.163.152).
Enter Port (e.g. 26001).
Click Add Server.

5) UI usage (for first-time users)

5-1. Pick server / model / quantization

Starting from a registered server, select in this order:

Pick the server from the Server dropdown.
Pick the target model under Server Model.
Pick the draft model under Draft Model.
Pick quantization (4bit / 8bit) under Server Q / Draft Q.
Click Start to launch the runtime, then send a message.

5-2. Key buttons / settings

Start: start the chat runtime.
Stop: stop the chat runtime and request the target to unload the model.
Profile LLM: refresh the reference cache / profile.
Token Source Coloring: color tokens by their origin.
Algorithm: pick the decoding strategy (MobiTree, Server-Only, Server-Only-AR, OPT-Tree, Fixed-tree).
Mode: pick the run mode (Chat for conversation, Benchmark for evaluation).
Max New Tokens: maximum tokens to generate per response.
Dataset: evaluation dataset used in Benchmark mode.

5-3. Run the target with the same `server_name`

Important: the Server Name you registered in step 4-3 and the --server-name you pass to the target launcher must match.

If you registered Server Name=icnl-server in step 4-3, in terminal B:

source .venv/bin/activate
./run_target.sh --host 0.0.0.0 --port 26001 --device-map auto \
                --load-in-8bit --lazy-load --enable-auto-target-profile \
                --server-name icnl-server

Notes:

--host / --port: must match the IP Address / Port you typed in step 4-3.
--server-name: must match the Server Name you typed in step 4-3.
--lazy-load: load the model on the first incoming request.
--enable-auto-target-profile: auto-generate the target profile if it doesn't exist yet.

6) Cache / profile file locations

Target profile: data/profile/profile_target_<server-name>_<model>_tq-<quant>.json
Reference cache: data/reference/ref_<server-name>_<base>_<device>_<draft>_tq-<...>_dq-<...>_mt_bench_<metric>_<mode>_*.json

Filenames are load-bearing — they must match exactly to be reused across runs.

7) Batch experiments via XML config

For reproducible end-to-end experiment sweeps (model × dataset × cost objective × algorithm), use the XML-driven runner. It calls the same draft / target binaries that the UI uses, but reads its settings from a single XML file so a run can be replayed exactly.

bash evaluation/run_main_experiment_overall_performance.sh \
     --config-xml evaluation/overall_performance_draft_energy_humaneval_example.xml

Two example configs ship with the repo (drop them into your own copies and tweak the values):

evaluation/overall_performance_draft_energy_humaneval_example.xml — HumanEval, draft-energy objective.
evaluation/overall_performance_total_cost_mt_bench_example.xml — MT-bench, total-cost objective.

Excerpt (overall_performance_draft_energy_humaneval_example.xml):

<?xml version="1.0" encoding="UTF-8"?>
<experiment_config>
  <runtime>
    <TARGET_HOST>192.168.0.12</TARGET_HOST>
    <TARGET_PORT>26001</TARGET_PORT>
    <DEVICE_MAP>cuda:0</DEVICE_MAP>
    <DRAFT_DEVICE_NAME>rtx5080</DRAFT_DEVICE_NAME>
    <SERVER_NAME>rtxproa6000</SERVER_NAME>
  </runtime>

  <models>
    <BASE_MODEL_PATH>Qwen/Qwen2.5-14B-Instruct</BASE_MODEL_PATH>
    <DRAFT_MODEL_PATH>Qwen/Qwen2.5-1.5B-Instruct</DRAFT_MODEL_PATH>
    <TARGET_QUANTIZATION>none</TARGET_QUANTIZATION>
    <DRAFT_QUANTIZATION>none</DRAFT_QUANTIZATION>
  </models>

  <objective>
    <OBJECTIVE_METRICS_CSV>draft_energy</OBJECTIVE_METRICS_CSV>
    <AUTODRAFT_CS_LIST>50</AUTODRAFT_CS_LIST>
  </objective>

  <dataset>
    <BENCHES_CSV>humaneval</BENCHES_CSV>
  </dataset>

  <algorithms>
    <ENABLE_HYBRID_AUTODRAFT>1</ENABLE_HYBRID_AUTODRAFT>
  </algorithms>

  <tree>
    <PROPOSED_NODES>150</PROPOSED_NODES>
    <PROPOSED_MAX_DEPTH>15</PROPOSED_MAX_DEPTH>
    <!-- ... profile width / node lists ... -->
  </tree>
</experiment_config>

Notes:

TARGET_HOST / TARGET_PORT / SERVER_NAME must match the target server you launched in §5-3 (or via python examples/target.py).
Configuration precedence is env vars → XML values → script defaults — environment variables on the command line still override the XML, which is handy for quick one-off tweaks without editing the file.
The runner accepts both forms: direct tags (<MAX_NEW_TOKENS>256</MAX_NEW_TOKENS>) and <parameter> entries (<parameter name="MAX_NEW_TOKENS">256</parameter>).
Run bash evaluation/run_main_experiment_overall_performance.sh --help to see every variable the script understands.

8) `autodraft` Python API

autodraft is a thin Python wrapper around the MobiTree speculative-decoding runtime. Existing shell scripts (run_target.sh, run_mt_bench_sd.sh, etc.) and CLI behavior are kept intact; this section is purely additive.

8-1. End-to-end usage examples

See examples/ for runnable scripts that show the full target / draft flow:

examples/target.py — start the target server in lazy-load mode (one terminal).
examples/draft.py — run the draft side, send a prompt, and print the generated text + stats (another terminal).

Quickstart from a source checkout:

git clone https://github.com/PJChoi1/MobiTree.git
cd MobiTree
# Terminal A: target server
python examples/target.py
# Terminal B: draft side (produces real generated text)
python examples/draft.py

8-2. Install options

# Option A — editable from a source checkout
pip install -e .
pip install -e ".[dev]"     # adds tests, lint, build, twine

# Option B — from PyPI
pip install autodraft-sd

A single pip install autodraft-sd covers everything: the wrapper, the runtime (evaluation/ + opt_classic/), 4/8-bit quantization (bitsandbytes), and trade-off PNG rendering (matplotlib). There are no [quant] or [plot] extras to remember — autodraft-sd is a GPU library, so splitting hairs over a few MB doesn't help.

The PyPI distribution name is autodraft-sd because autodraft is already taken on PyPI. The Python import name is still autodraft — write from autodraft import Autodraft in code (same pattern as pip install pillow → from PIL import ...).

The PyPI wheel ships both the wrapper API (autodraft/) and the speculative-decoding runtime (evaluation/, opt_classic/), so a single pip install autodraft-sd lets engine.run(...) and serve_target(...) work without a source checkout (you still need a GPU and a CUDA-matched PyTorch wheel — see 8-5).

Research-only assets (chat_ui/, data/ benchmark datasets, result/) are intentionally excluded from the wheel; access them through the GitHub source checkout.

8-3. `Autodraft(...)` and `engine.run()` parameters

from autodraft import Autodraft

engine = Autodraft(
    draft_model="meta-llama/Llama-3.2-1B-Instruct",
    target_model="meta-llama/Llama-3.2-1B-Instruct",
    draft_quantization=None,    # None / "none" / "4bit" / "8bit"
    target_quantization=None,
    target_host="127.0.0.1",
    target_port=26001,
    cost="total_cost",          # "total_cost" (default) / "api_cost" /
                                # "energy_total" / "draft_energy" /
                                # "target_energy". Set once at init —
                                # the reference cache key depends on it,
                                # so to switch metrics, build a new
                                # Autodraft instance.
    hf_token=None,              # gated repos: pass token here or set HF_TOKEN
)

result = engine.run(
    input_text="...",
    proactive=False,
    cs="balanced",              # "tps" / "balanced" (default) / "cost",
                                # or a number in [0, 100]
    save_tradeoff=True,         # save reference trade-off curve
    tradeoff_dir=None,          # default: $MOBITREE_DATA_DIR/tradeoff
    server_name="autodraft",    # must match the target's server_name
    # Any other run_draft kwargs (~70 options) are forwarded as-is.
)

Result shape:

{
    "generated_text": "...",        # final model output
    "input_text": "...",
    "proactive": False,
    "cs": "balanced",
    "cost": "total_cost",
    "algorithm": "MobiTree",
    "stats": {                      # one-line summary
        "total_steps": 15,
        "total_new_tokens": 121,
        "total_time_seconds": 1.98,
        "tokens_per_second": 61.10,
        "tokens_per_step": 8.07,
        "avg_tree_width": 6.4,
        "avg_tree_depth": 7.2,
        "avg_nnodes": 50.5,         # avg nodes per tree
        "avg_accept_length": 7.07,
        "acceptance_ratio_avg": 0.98,
        "total_cost": 0.000125,     # in the unit of the chosen cost objective ($ or kWh)
        "api_cost": 0.0,
        "draft_cost": 0.000031,
        "target_cost": 0.000094,
    },
    "tradeoff_files": {
        # Filename is conditions-hashed (server, target, draft, device,
        # quantization, bench, cost, mode), so repeated calls with the
        # same conditions overwrite the same file (no timestamps).
        # Paired 1:1 with the reference cache.
        "json": ".../data/tradeoff/tradeoff_<server>_<target>_<device>_<draft>_tq-<...>_dq-<...>_<bench>_<cost>_<mode>.json",
        "png":  ".../data/tradeoff/tradeoff_<server>_<target>_<device>_<draft>_tq-<...>_dq-<...>_<bench>_<cost>_<mode>.png",  # only if matplotlib is installed
    },
    "answer_row": { ... },          # answers[-1] from the integrated result
    "raw_result": { ... },          # full integrated result (latency_statistics, accept_stats, etc.)
}

The target.py side is intentionally minimal because MobiTree always runs split-process and the target loads whatever the draft tells it to load via the reload_model RPC (lazy load):

from autodraft import serve_target

serve_target(
    host="0.0.0.0",
    port=26001,
    server_name="autodraft",
    hf_token=None,
)  # blocks forever (server loop)

8-4. Cache layout for the API

Profile / reference caches default to ./data/ under the directory where you launch the script (i.e. python my_script.py writes to <cwd>/data/profile/, <cwd>/data/reference/). This matches the source-checkout convention where users run from the repo root.

The first run with no cache auto-bakes the target latency profile and the draft latency profile (tens of seconds to a few minutes). Later runs cache-hit and start fast. Cache filenames embed the GPU name, so moving to a different GPU rebuilds them. To pin a custom GPU label:

export MOBITREE_DEVICE_NAME="rtx-5080"

If unset, the code auto-detects via torch.cuda.get_device_name(0), falling back to "unknown-gpu" if no GPU is available.

To share caches across projects (e.g. a fixed location):

export MOBITREE_DATA_DIR=/path/to/shared/cache

8-5. PyTorch installation note

autodraft-sd declares a generic torch dependency, but PyTorch wheels are CUDA-specific. For a working install, install your matching wheel first (e.g. torch==2.7.0+cu128 from this repo's requirements.txt) and then pip install autodraft-sd. Otherwise pip will pull a default CPU wheel that won't see your GPU.

8-6. Talking to a target on another machine

Run the target on the server machine, e.g. with python examples/target.py (or the legacy ./run_target.sh), then on the client machine:

from autodraft import Autodraft

engine = Autodraft(
    draft_model="...",
    target_model="...",
    target_host="TARGET_SERVER_IP",
    target_port=26001,
    cost="total_cost",   # api_cost / energy_total / draft_energy / target_energy
)

result = engine.run(
    "input text",
    proactive=True,
    cs="balanced",       # "tps" / "balanced" / "cost", or 0~100 number
    server_name="my-server",
)

8-7. Running examples without `pip install`

To import autodraft without installing, the repo root must be on sys.path. examples/draft.py and examples/target.py already include a 4-line bootstrap that inserts their parent directory, so python examples/draft.py from the repo root just works.

For your own scripts:

# (a) Run with -m from the repo root
python -m examples.draft

# (b) Add the repo to PYTHONPATH
export PYTHONPATH=/path/to/MobiTree:$PYTHONPATH
python my_script.py

# (c) Add the same sys.path bootstrap at the top of your script

8-8. HuggingFace token

Gated models (meta-llama/*, etc.) require an HF access token. Two options:

# (a) Pass directly to the constructor
engine = Autodraft(draft_model=..., target_model=..., hf_token="hf_xxx")

# (b) Set the env var (HF_TOKEN or HUGGING_FACE_HUB_TOKEN)
#   export HF_TOKEN=hf_xxx

hf_token is masked as '***' in repr(engine). Internally it is exported to HF_TOKEN / HUGGING_FACE_HUB_TOKEN before the runtime is imported, so the same token reaches both draft-side from_pretrained calls and any target server you launch in the same process tree.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Donghyeon_Kim218

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.15

May 6, 2026

0.1.14

May 5, 2026

This version

0.1.13

May 3, 2026

0.1.12

May 3, 2026

0.1.11

May 2, 2026

0.1.10

May 2, 2026

0.1.9

May 2, 2026

0.1.8

May 2, 2026

0.1.7

May 2, 2026

0.1.6

May 2, 2026

0.1.5

May 2, 2026

0.1.4

May 2, 2026

0.1.3

May 2, 2026

0.1.2

May 2, 2026

0.1.1

May 2, 2026

0.1.0

May 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autodraft_sd-0.1.13.tar.gz (510.4 kB view details)

Uploaded May 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autodraft_sd-0.1.13-py3-none-any.whl (557.6 kB view details)

Uploaded May 3, 2026 Python 3

File details

Details for the file autodraft_sd-0.1.13.tar.gz.

File metadata

Download URL: autodraft_sd-0.1.13.tar.gz
Upload date: May 3, 2026
Size: 510.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for autodraft_sd-0.1.13.tar.gz
Algorithm	Hash digest
SHA256	`318e1769987b9784f86bfdacc3b3066feaa14929a879e7b8383614614ab1f29e`
MD5	`ffd7b31ced5a2ddfae63ff512e2fe5b7`
BLAKE2b-256	`b82da2077e034f8f7fc48c5f9ae0dae9be5bd8ddd9e47cf0407bc3e0d5813a5a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for autodraft_sd-0.1.13.tar.gz:

Publisher: publish.yml on PJChoi1/MobiTree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: autodraft_sd-0.1.13.tar.gz
- Subject digest: 318e1769987b9784f86bfdacc3b3066feaa14929a879e7b8383614614ab1f29e
- Sigstore transparency entry: 1431798573
- Sigstore integration time: May 3, 2026
Source repository:
- Permalink: PJChoi1/MobiTree@dfe656e9cca732f2b61b791b2aee54a4eac4fd5b
- Branch / Tag: refs/tags/v0.1.13
- Owner: https://github.com/PJChoi1
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@dfe656e9cca732f2b61b791b2aee54a4eac4fd5b
- Trigger Event: push

File details

Details for the file autodraft_sd-0.1.13-py3-none-any.whl.

File metadata

Download URL: autodraft_sd-0.1.13-py3-none-any.whl
Upload date: May 3, 2026
Size: 557.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for autodraft_sd-0.1.13-py3-none-any.whl
Algorithm	Hash digest
SHA256	`469a7a7269b3bc97002e4414bdb1ffa74f8a62557ead1b519471baf204a2c112`
MD5	`17f32f5e2942a12e9c01b603654f07c6`
BLAKE2b-256	`be29047c9d341761a3a60fad3c844a6108ce763417d75703c02ac2091c0131f3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for autodraft_sd-0.1.13-py3-none-any.whl:

Publisher: publish.yml on PJChoi1/MobiTree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: autodraft_sd-0.1.13-py3-none-any.whl
- Subject digest: 469a7a7269b3bc97002e4414bdb1ffa74f8a62557ead1b519471baf204a2c112
- Sigstore transparency entry: 1431798613
- Sigstore integration time: May 3, 2026
Source repository:
- Permalink: PJChoi1/MobiTree@dfe656e9cca732f2b61b791b2aee54a4eac4fd5b
- Branch / Tag: refs/tags/v0.1.13
- Owner: https://github.com/PJChoi1
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@dfe656e9cca732f2b61b791b2aee54a4eac4fd5b
- Trigger Event: push

autodraft-sd 0.1.13

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

AutoDraft: Automatic Cost-Performance Adaptation in User-Cloud Distributed Speculative Decoding

1) Prerequisites

1-1. Required software

1-2. Driver / CUDA sanity check

2) Get the project

3) Virtual environment + dependencies

4) Running (the most common path)

4-1. Step 1 — Launch the UI

4-2. Step 2 — Verify UI is reachable

4-3. Step 3 — Register a server through the UI

5) UI usage (for first-time users)

5-1. Pick server / model / quantization

5-2. Key buttons / settings

5-3. Run the target with the same server_name

6) Cache / profile file locations

7) Batch experiments via XML config

8) autodraft Python API

8-1. End-to-end usage examples

8-2. Install options

8-3. Autodraft(...) and engine.run() parameters

8-4. Cache layout for the API

8-5. PyTorch installation note

8-6. Talking to a target on another machine

8-7. Running examples without pip install

8-8. HuggingFace token

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

5-3. Run the target with the same `server_name`

8) `autodraft` Python API

8-3. `Autodraft(...)` and `engine.run()` parameters

8-7. Running examples without `pip install`