Local evaluation harness for the DreamHouse timber-frame benchmark
Project description
DreamHouse Benchmark
Evaluate AI models on timber-frame structure generation. Your model receives reference images and constraints, generates 3D geometry in Blender, and the server validates it against 10 structural engineering tests.
Prerequisites
- Python 3.9+
- Blender 4.5.4 LTS is the recommended runtime. This release uses Blender's bundled Python 3.11, which matches the provided validator artifact.
pip install dreamhouse
Server
The evaluator runs locally on your machine. The server in this repo mirrors the public /v1/... API and invokes Blender under the hood to validate each submission.
http://localhost:8000
Interactive Swagger docs: http://localhost:8000/docs
See Running the server locally below for setup.
Quick Start
Install the package from the repo:
pip install dreamhouse
For development from a cloned repository, use pip install -e . instead.
Check your local environment:
dreamhouse doctor
Set up benchmark artifacts. This downloads the split task pack, reassembles it, verifies the checksum, and installs the validator artifact:
dreamhouse setup --download-artifacts
If Blender is not already installed, use the tested default:
dreamhouse setup --download-artifacts --install-blender
The recommended runtime is Blender 4.5.4 LTS with bundled Python 3.11.
If you use your own Blender install, run dreamhouse doctor and make sure its
Python version is 3.11.x.
Validator artifacts are selected by Blender's bundled Python version. The
initial release provides validation.pyc for Python 3.11. Future releases may
include versioned artifacts such as validation-cp311.pyc; dreamhouse setup
will choose the matching file automatically.
Run a harness smoke test:
dreamhouse smoke-test --task BN_01_0003 --output-dir ./runs/smoke_BN_01_0003
This uses a built-in stub agent. It verifies setup, Blender execution, geometry export, server submission, and validation. It is not a model evaluation.
Run one task with your own agent:
dreamhouse run \
--task BN_01_0003 \
--agent my_agent:generate \
--output-dir ./runs/BN_01_0003
Your agent function must accept (prompt: str, images: list[str], feedback: list[dict]) and return Blender Python code. See
examples/agent_template.py.
Two common agent patterns are supported out of the box:
Hosted OpenAI API:
export OPENAI_API_KEY=...
export OPENAI_MODEL=gpt-4.1
dreamhouse run --task BN_01_0003 --agent examples.openai_agent:generate
Self-hosted OpenAI-compatible endpoint:
export OPENAI_BASE_URL=http://127.0.0.1:8001/v1
export OPENAI_API_KEY=dummy
export OPENAI_MODEL=my-vision-model
dreamhouse run --task BN_01_0003 --agent examples.openai_compatible_agent:generate
For local integration testing without a real model, run the mock endpoint:
python examples/mock_openai_server.py --port 8001
List available task ids:
dreamhouse list-tasks --limit 20
You can also start only the local API server:
dreamhouse server --port 8000
How It Works
Inputs and outputs
Users provide:
dreamhouse_tasks_1200.dhpack- installed validator artifact
- Blender path
- model code that turns a task prompt + 5 reference images into Blender Python
Each run outputs:
- generated Blender code
- generated
structure.blend - exported geometry submission
- validation results and feedback history
The benchmark does not require users to access the original source .blend
files.
Step 1: Get a task and prompt your model → your model outputs Blender Python code
Step 2: Execute the code in Blender, export geometry → produces a JSON submission file
Step 3: Submit to the server and get results → pass/fail on 10 structural tests
Step 4: (Optional) Feed failures back, iterate
Full working code for every step: examples/quickstart.py
Example inputs/outputs for every step: examples/walkthrough/
Step 1: Get a task and prompt your model
Fetch a task from the server — you'll get constraints (footprint, stories, roof type) and 5 reference images (front, back, left, right, top). Pass these to your model and ask it to generate a Blender Python script.
import requests
SERVER = "http://localhost:8000"
task = requests.get(f"{SERVER}/v1/tasks/AF_01_0018").json()
Example task response: examples/walkthrough/1_task_response.json
Example prompt for your model: examples/walkthrough/2_example_prompt.md
Step 2: Execute in Blender and prepare submission
Run your model's output in Blender headless, then export the geometry using the bundled helper script.
import subprocess
subprocess.run(["blender", "--background", "--python", "generated_code.py"], check=True)
subprocess.run([
"blender", "--background", "structure.blend",
"--python", "helpers/blender_export.py",
"--", "submission.json", "AF_01_0018"
], check=True)
This produces submission.json — a list of members with their names, positions, and bounding boxes. Example: examples/walkthrough/4_submission.json
Step 3: Submit and get validation results
Create a session, submit the geometry, and poll for results.
import json, time
session = requests.post(f"{SERVER}/v1/sessions", json={
"task_id": "AF_01_0018", "model_id": "my-model-v1", "protocol": "stepwise",
}).json()
with open("submission.json") as f:
geometry = json.load(f)
job = requests.post(
f"{SERVER}/v1/sessions/{session['session_id']}/submit",
json={"members": geometry["members"]},
).json()
while True:
time.sleep(2)
result = requests.get(
f"{SERVER}/v1/sessions/{session['session_id']}/results/{job['job_id']}"
).json()
if result["status"] in ("complete", "failed"):
break
print(result["results"]["all_passed"]) # True or False
print(result["results"]["tests"]) # per-test pass/fail
Example results: examples/walkthrough/7_result_pass.json, examples/walkthrough/7_result_fail.json
Step 4: (Optional) Iterate
If tests failed, feed the results back to your model, regenerate, and submit again to the same session.
Structural Tests
Each submission is validated against 10 tests. all_passed is true only when all 10 pass.
| Test | What it checks |
|---|---|
completeness |
Has members from all 4 categories: foundation, floor, walls, roof |
load_path |
Continuous load path from roof down to foundation |
span_limits |
Joists/rafters don't exceed allowable spans for their size |
deflection |
Members stay within deflection limits |
roof_coverage |
Rafters cover the full footprint without large gaps |
gap_detection |
No gaps larger than 24" on-center between framing members |
point_load |
Posts/beams align with supports below |
cantilever |
Cantilevers don't exceed backspan/4 or 24" |
stability_score |
At least 80% of members are connected to the ground |
dual_end_connection |
Rafters and studs are supported at both ends |
Member Naming
The validator infers each member's role from its name. Include one of these keywords (case-insensitive):
| Category | Keywords |
|---|---|
| Foundation | Sill, Post, BeamPost, Foundation |
| Floor | CenterBeam, Rim, Joist |
| Walls | Plate, Stud, King, Trimmer, Header, Cripple |
| Roof | Ridge, Rafter, Raf, Collar, Lookout, Purlin, Valley, Hip |
Names must be unique: Stud_01, Stud_02, etc.
Running the server locally
The benchmark runs locally. Users need this repo, Blender, and the provided benchmark artifacts:
dreamhouse_tasks_1200.dhpack— task metadata and reference images- compiled validator artifact — installed into
server/_private/
The original benchmark source .blend files are not required to run
evaluation.
1. Install dependencies
pip install dreamhouse
2. Download benchmark artifacts
Download the runtime artifacts from the maintainer-provided release folder:
https://drive.google.com/drive/folders/1hY4xohyQ7IxxQSG5tz0U-e6OZOdGOV3b?usp=drive_link
The folder should contain:
SHA256SUMS.txt
validation.pyc
dreamhouse_tasks_1200.dhpack.part-00
dreamhouse_tasks_1200.dhpack.part-01
dreamhouse_tasks_1200.dhpack.part-02
dreamhouse setup --download-artifacts performs the download, reassembly,
checksum verification, and validator install automatically. For manual setup,
download all files into one local directory and reassemble the task pack:
cat dreamhouse_tasks_1200.dhpack.part-* > dreamhouse_tasks_1200.dhpack
shasum -a 256 dreamhouse_tasks_1200.dhpack
Expected checksum:
390273e4b300ea35985ea569e4b1684a60ce3feb865f8194ff87c801109dff86
If the checksum does not match, re-download the split files before running the benchmark.
3. Install the validator
Install the validator artifact provided with the benchmark release:
python scripts/install_validator.py /path/to/validation.pyc
# Verify it is in place
python scripts/install_validator.py --check
If you are a maintainer working from source, the same command also accepts a
local validation.py and compiles it for you. The installed validator lives
in server/_private/, which is ignored by git.
4. Point the server at the task pack
Set DREAMHOUSE_TASKS_PACK to the .dhpack file:
export DREAMHOUSE_TASKS_PACK=/absolute/path/to/dreamhouse_tasks_1200.dhpack
Do not unpack the task pack. The local server reads it directly.
5. Start the server
# Blender executable (adjust for your OS if different)
export BLENDER_PATH=/Applications/Blender.app/Contents/MacOS/Blender
uvicorn server.app:app --host 127.0.0.1 --port 8000
Swagger UI: http://localhost:8000/docs Health check: http://localhost:8000/healthz
If port 8000 is already in use (e.g. by another local service), pick a different port and tell the example clients about it:
uvicorn server.app:app --host 127.0.0.1 --port 8765
export DREAMHOUSE_SERVER=http://localhost:8765
Configuration
| Env var | Purpose | Default |
|---|---|---|
DREAMHOUSE_TASKS_PACK |
Path to the provided task pack | unset |
DREAMHOUSE_VALIDATOR |
Path to the compiled validator | <repo>/server/_private/validation.pyc |
DREAMHOUSE_BLENDER_TIMEOUT |
Seconds per validation run | 180 |
BLENDER_PATH |
Blender executable | auto-detected by dreamhouse; server fallback is macOS default |
DREAMHOUSE_SERVER |
Server URL used by the example clients | http://localhost:8000 |
Rate Limits
- Session creation: 5/minute
- Geometry submission: 10/minute
- Sessions expire after 48 hours
Ready-to-Run Examples
examples/quickstart.py — single-task loop
export DREAMHOUSE_SERVER=http://localhost:8000 # if different, e.g. 8765
python examples/quickstart.py \
--task BN_01_0003 \
--agent my_agent:generate \
--output-dir ./runs/BN_01_0003
Use dreamhouse run --agent module:function for real model evaluation.
The shipped stub is only used by dreamhouse smoke-test; it emits four sill
plates so you can verify the harness plumbing.
For direct script-level harness testing only, pass --use-stub.
What to expect in the console
A successful run prints 7 numbered phases per attempt, then an Artifacts summary:
[1] Fetching task BN_01_0003... pulls task spec + 5 reference images
[2] Creating eval session... returns a session_id
[3] Generating code (attempt N)... calls your --agent function
[4] Executing in Blender... runs the code, saves structure.blend
[5] Exporting geometry... writes submission.json with N members
[6] Submitting to eval server... POSTs /v1/sessions/.../submit
[7] Results: [VALIDATION FAILED] validator's verdict on those members
Passed: 5/10
Failed tests: completeness, load_path, roof_coverage, ...
If all_passed is True the loop exits; otherwise it retries up to
--max-retries times (default 3) using the feedback.
With the stub model you should expect:
- Steps 1-6 all succeed (plumbing is fine)
Passed: 5/10every attempt (sill plates only)- Failing tests:
completeness,load_path,roof_coverage,gap_detection,stability_score - Member count grows each retry (4 → 8 → 12) because the Blender scene is preserved across retries and the stub keeps adding the same 4 sills. Real iterative models use this to refine their previous attempt.
If any step fails before [7] you'll see one of:
Blender error:— generated code raised in Blender (syntax error, bad API call, etc.)No members exported— Blender ran but the collection was emptyNo results from server— submit/poll timed out or the server hit an internal error. Check the uvicorn log and that the validator is installed (python scripts/install_validator.py --check).
Output layout
--output-dir (or a fresh temp directory if omitted):
<output_dir>/
├── task.json task spec fetched from the server
├── images/ reference images for this task only
├── structure.blend last Blender scene produced by the model
├── submission.json latest export (same content as newest attempt's)
├── attempts/
│ ├── attempt_1/
│ │ ├── code.py model-generated Blender Python for this attempt
│ │ ├── submission.json exported geometry sent to the server
│ │ └── result.json full validation result returned by the server
│ └── attempt_2/...
├── results.json latest validation result (shortcut)
└── summary.json task id, session id, every attempt, final verdict
How to read each file
task.json— exactly what the server handed the model, including the constraints (footprint, stories, roof type) used to write the prompt. Useful when debugging why the model misunderstood the task.images/*.png— the 5 reference views for the current task.attempts/attempt_N/code.py— the raw Blender Python the model produced for attempt N. Open it in any editor to inspect what the model chose to generate.attempts/attempt_N/submission.json— the geometry that was actually sent to the server. One entry per framing member withname,location,dimensions,bbox_world_corners,matrix_world. Thenameprefix (Sill_,Stud_,Rafter_, ...) is what the validator uses to classify each member.attempts/attempt_N/result.json— the full response from/v1/sessions/.../results/..., includingtests(per-test pass/fail),all_passed,pass_rate, andstability_details/dual_end_detailsdiagnostics.structure.blend— the Blender scene after the last attempt. Open in Blender to visually inspect what was built; handy for debugging geometry issues the numeric tests point at.results.json— a shortcut to the last attempt'sresult.json, so you don't have to know which attempt number was last.summary.json— the single most useful file. Contains the task id, session id, finalstatus(passed,failed,blender_error,export_empty,submit_failed, orno_submission), every attempt's member count / pass rate / failed tests / feedback string, and the finalresultsobject. This is what you read into a notebook or dashboard when scoring a model across many tasks.
Interpreting the tests
All 10 tests are described in detail in the Structural Tests
table above. When reading results.json["tests"], the most common
failure patterns are:
| Failure | Usually means |
|---|---|
completeness |
Model didn't produce at least one member from each of foundation / floor / walls / roof |
load_path |
Loads from roof/floors don't reach a foundation member |
roof_coverage |
Rafters don't span the full footprint (partial roof) |
gap_detection |
Framing spacing is wider than 24" on-center in some region |
stability_score |
<80% of members are connected (via adjacency) to the ground |
dual_end_connection |
Rafters or studs not supported at both ends |
point_load |
Posts/beams landing on unsupported spans |
cantilever |
Overhang exceeds backspan/4 or 24" |
span_limits |
Joist/rafter spans exceed IRC tables for their cross-section |
deflection |
Members sag more than the allowable deflection limit |
examples/pipeline_scaffold.py — multi-step pipeline with retries
python examples/pipeline_scaffold.py --config examples/config_example.yaml
Configured via examples/config_example.yaml.
Subclass ModelBackend (see DummyBackend in the file) to plug in your
own model. Writes per-step validation reports and per-attempt artifacts
into the configured output_dir.
The DREAMHOUSE_SERVER env var overrides the YAML server_url, so a
single config file works across ports without edits.
License
See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dreamhouse-0.1.0.tar.gz.
File metadata
- Download URL: dreamhouse-0.1.0.tar.gz
- Upload date:
- Size: 43.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a728c962ce0a1fda21c720030c92beeb441455dd3e0bdf5bd3ac8cb03644588d
|
|
| MD5 |
793346ec350c361f105dc1b77ba10473
|
|
| BLAKE2b-256 |
ae6344481161073fd4d091d1365354742ebe7c26313f46d082f990d297474a4c
|
File details
Details for the file dreamhouse-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dreamhouse-0.1.0-py3-none-any.whl
- Upload date:
- Size: 44.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3487e691001bf7b3a07f25d0dceab2895793db1115f30c822349b24cc008d575
|
|
| MD5 |
e92bc78cd64680f59fa637566d4b5944
|
|
| BLAKE2b-256 |
19fde75dbcbc5d9d4f77d7b24b855296d8c13eb9777583f78806573dd3dcacb8
|