Skip to main content

Vision-language models as state estimators for PDDL-based planning

Project description

S3E: Semantic Symbolic State Estimation

License

Overview

s3e is a Python package for estimating grounded PDDL state predicates from images using vision-language models (VLMs).

It is designed for workflows that need to connect visual observations to symbolic planning. Given a PDDL domain and problem, s3e enumerates grounded predicates, translates them into model-friendly queries, and returns either boolean state assignments, per-predicate probabilities, or normalized model outputs suitable for inspection and debugging.

The package integrates naturally with Unified Planning / PDDL-based systems and supports both HuggingFace and OpenAI-backed VLMs, as well as custom backends.

For a longer tutorial, see the tutorial notebook.

Features

  • Estimate boolean symbolic states or probabilistic predicate values from one or more images.
  • Parse PDDL domains and problems from strings or .pddl files.
  • Automatically ground predicates over the current problem objects.
  • Translate predicates with pluggable strategies: IdentityTranslator, TemplateTranslator, PrewrittenTranslator, and LLMTranslator.
  • Use HuggingFace VLMs, OpenAI VLMs, or custom implementations via the VLMBackend interface.
  • Support multi-image estimation with either single-pass or per-image averaging.
  • Expose normalized VLMOutput objects for prompt tuning and backend inspection.
  • Convert estimated states back into Unified Planning-compatible state objects.
  • Cache LLM-generated predicate translations for reuse across runs.

Installation

Prerequisites

  • Python >=3.10
  • pip
  • git if installing from source
  • For larger HuggingFace VLMs, a GPU-capable PyTorch environment is recommended

Install from source

git clone https://github.com/CLAIR-LAB-TECHNION/s3e.git
cd s3e
pip install -e .

You can also install directly from the GitHub repository without cloning:

pip install "git+https://github.com/CLAIR-LAB-TECHNION/s3e.git"

Optional dependencies

Install OpenAI support:

pip install -e '.[openai]'

Install development dependencies:

pip install -e '.[dev]'

Optional acceleration for supported HuggingFace models:

FlashAttention installation is platform- and hardware-dependent. If your chosen model and environment support it, follow the installation guide to set it up.

Quick Start / Usage

The example below uses a small HuggingFace model and template-based predicate translation.

from PIL import Image

from s3e import SemanticStateEstimator, TemplateTranslator

domain_pddl = """
(define (domain blocksworld)
  (:requirements :typing)
  (:types block)
  (:predicates
    (on ?x - block ?y - block)
    (clear ?x - block)
  )
)
"""

problem_pddl = """
(define (problem bw-2)
  (:domain blocksworld)
  (:objects a b - block)
  (:init (on a b) (clear a))
  (:goal (on b a))
)
"""

translator = TemplateTranslator(
    {
        "on": "Is the {0} block on top of the {1} block?",
        "clear": "Is the {0} block clear?",
    }
)

estimator = SemanticStateEstimator(
    domain_pddl,
    problem_pddl,
    vlm="HuggingFaceTB/SmolVLM-256M-Instruct",
    query_translator=translator,
    user_prompt_template="Answer yes or no only: {query}",
)

images = [Image.open("scene.png")]

state = estimator(images)
probabilities = estimator.estimate_probabilities(images)

print(state)
print(probabilities)

You can also inspect normalized backend outputs directly:

raw_outputs = estimator.estimate_raw(images)
print(raw_outputs["on(a,b)"])

To convert the boolean state back into a Unified Planning state object:

from s3e.pddl.up_utils import state_dict_to_up_state

up_state = state_dict_to_up_state(estimator.up_problem, state)

For OpenAI-backed models, install the optional dependency and use an OpenAI/-prefixed model ID, for example "OpenAI/gpt-4o".

API Reference / Configuration

Core estimator

SemanticStateEstimator(domain, problem, vlm, ...) is the main entry point.

Key arguments:

  • domain, problem: PDDL domain and problem, provided either as strings or file paths.
  • vlm: a VLMBackend instance or a model string. Strings prefixed with OpenAI/ select the OpenAI backend; all other strings select the HuggingFace backend.
  • query_translator: translation strategy used to convert grounded predicates into queries.
  • confidence: default threshold used when converting probabilities into booleans.
  • multi_image_strategy: either "single" or "average".
  • probability_method: either "logprobs" or "text_match".
  • true_tokens, false_tokens: optional token groups used for probability extraction.
  • batch_size: number of predicate queries grouped into each backend batch.
  • user_prompt_template: format string for each translated query; must contain {query}.
  • additional_instructions: additional text appended to the system prompt.
  • vlm_kwargs: keyword arguments forwarded when vlm is provided as a model string.
  • inference_kwargs: per-query inference arguments forwarded to backend query/query_batch calls.
    • For OpenAI models, these are request arguments for chat.completions.create (for example temperature, max_completion_tokens).
    • For HuggingFace models, these are forwarded to model(...) in logprobs mode and model.generate(...) in generation mode.

vlm_kwargs and inference_kwargs are intentionally different:

  • vlm_kwargs configure backend/client construction.
    • OpenAI backend: forwarded to openai.OpenAI(...) (for example api_key, base_url, timeout).
    • HuggingFace backend: forwarded to backend/model construction (for example device_map, torch_dtype, attn_implementation).
  • inference_kwargs configure runtime inference and are forwarded on every query.

Example:

estimator = SemanticStateEstimator(
    domain_pddl,
    problem_pddl,
    vlm="OpenAI/gpt-4o",
    vlm_kwargs={"api_key": "..."},
    inference_kwargs={"temperature": 0.2, "max_completion_tokens": 200},
)

For HuggingFace generation mode (probability_method="text_match"), s3e applies a deterministic default (do_sample=False) unless overridden via inference_kwargs. No default generation cap is imposed; set max_new_tokens in inference_kwargs if you want an explicit cap.

Common methods:

  • estimator(images) -> dict[str, bool]: return a boolean symbolic state.
  • estimate_probabilities(images) -> dict[str, float]: return per-predicate probabilities.
  • estimate_raw(images) -> dict[str, VLMOutput]: return normalized backend outputs.
  • swap_problem(domain, problem): rebuild the estimator for a new planning problem.

Translators

  • IdentityTranslator: use grounded predicates as-is.
  • TemplateTranslator: format grounded predicates with per-predicate templates.
  • PrewrittenTranslator: provide explicit prompts for each grounded predicate.
  • LLMTranslator: generate natural-language prompts with an LLM and optionally cache them.

Environment variables and optional configuration

  • OPENAI_API_KEY: required for OpenAIVLM and OpenAI-backed LLMTranslator usage.
  • cache_dir on LLMTranslator: enables on-disk caching of generated predicate translations.

Contributing

Install development dependencies:

pip install -e '.[dev]'

Run the fast test loop:

pytest -m "not slow"

Run the full test suite:

pytest

To contribute:

  1. Fork the repository and create a feature branch.
  2. Add or update tests for behavioral changes.
  3. Run the relevant test commands before submitting.
  4. Open a pull request with a concise description of the change and its motivation.

License

This project is licensed under the MIT License. See LICENSE for details.

Citation

@inproceedings{azranS3ESemanticSymbolic2025,
  title = {{{S3E}}: {{Semantic Symbolic State Estimation With Vision-Language Foundation Models}}},
  shorttitle = {{{S3E}}},
  booktitle = {{{AAAI}} 2025 {{Workshop LM4Plan}}},
  author = {Azran, Guy and Goshen, Yuval and Yuan, Kai and Keren, Sarah},
  year = 2025,
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3e-0.1.0.tar.gz (45.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

s3e-0.1.0-py3-none-any.whl (32.8 kB view details)

Uploaded Python 3

File details

Details for the file s3e-0.1.0.tar.gz.

File metadata

  • Download URL: s3e-0.1.0.tar.gz
  • Upload date:
  • Size: 45.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for s3e-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fcf831e1d360171b06fdba36c27545ba4ca7814a6f7a9f88a9ea0b741ca1c9c0
MD5 006b187998c0fdaeab700e0e5553feb4
BLAKE2b-256 2db7b168daadf2b859e5c672e1e7a3f6a031b059c82aaf144873cb43d0df1a8f

See more details on using hashes here.

File details

Details for the file s3e-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: s3e-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 32.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for s3e-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b3674e754b14af23d2d45b7b2c12588d50817a845269f614a7b1174d8fd4971c
MD5 bce042ce5fb1f6c70d8cfac481c1cbde
BLAKE2b-256 83aae188fb1c35326abbf93c2d521ed56f820e7ec176e5c5b54fa37b1cb13a29

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page