Skip to main content

Workflow of reproducible multimodal inference for urban environment evaluation.

Project description

image PyPI Downloads PyPI Downloads Docs image

logo

Urban-WORM

Introduction

Urban-WORM (Workflow Of Reproducible Multimodal Inference) is a user-friendly high-level interface designed for building geo-referenced urban datasets with model-generated ground-truth labels. It covers the full pipeline — from collecting crowdsourced street views, photos, and sounds near building footprints, through batched VLM inference, to an organized export of labeled metadata.

workflow

Features

Data collection

  • Collect geotagged street views (Mapillary/Google), photos (Flickr), and audio (Freesound/Radio Aporee) within the proximity of building footprints or other POIs
  • Calibrate panorama orientation to face a given location; auto-compute field-of-view from building footprints
  • Filter personal photos with face detection; slice audio recordings into fixed-duration clips
  • Crash-safe checkpointing — pass checkpoint_path to any collection method; already-fetched locations are skipped on resume, so a failed run never starts from zero

Inference / ground-truth labeling

  • Define a structured output schema once; all backends share the same one_inference / batch_inference interface
  • Unsloth (recommended) — GPU-accelerated local VLM with optional GPU batching; 2–4× faster than Ollama; automatically spreads the model across all visible GPUs when more than one is present, with OOM-safe chunk retry so failed batches fall back to item-by-item instead of producing silent stub outputs
  • Ollama — lightweight local inference, no GPU required
  • llama.cpp — highly customizable sampling; supports audio input
  • Cloud APIs — Claude (Anthropic), GPT-4o (OpenAI), Gemini (Google) via InferenceAPI
  • Crash-safe checkpointing on all batch_inference methods — resume mid-run without reprocessing completed images

Note: models can make mistakes and results still need to be reviewed and used carefully.

Export

  • GeoTaggedData.export() — one call produces a metadata.csv paired with an organized images/ or audio/ folder, with optional label columns merged in

Installation

Step 1 — Core package

pip install urban-worm

Step 2 — Choose your inference backend

Unsloth is the recommended backend for local inference (GPU-accelerated, fastest).

Unsloth — recommended (GPU required)

GPU-specific torch must be installed before the unsloth extra, otherwise pip falls back to a slow CPU-only build:

# CUDA (most modern NVIDIA GPUs):
pip install torch --index-url https://download.pytorch.org/whl/cu124

# macOS Apple Silicon (MPS):
pip install torch          # MPS is enabled by default on macOS

Then install the extra:

pip install "urban-worm[unsloth]"

Tested checkpoints: unsloth/Qwen3-VL-3B-Instruct, unsloth/Qwen3-VL-8B-Instruct, unsloth/gemma-3-4b-it, unsloth/Qwen2-VL-2B-Instruct, unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit. Any vision model that unsloth.FastVisionModel can load should work.

Ollama — lightweight local inference (no GPU required)

Install the Ollama application first:

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# macOS
brew install ollama

# Windows — download the installer from https://ollama.com/

Then install the Python client:

pip install "urban-worm[ollama]"

llama.cpp — CLI-based local inference

The llama-mtmd-cli binary must be installed separately:

# macOS / Linux
brew install llama.cpp

# Windows
winget install llama.cpp

More options: llama.cpp install guide. GGUF model collections: ggml-org multimodal GGUFs.

The Python binding is installed via the extra:

# CPU build (no compile flags needed):
pip install "urban-worm[llamacpp]"

# CUDA build:
CMAKE_ARGS="-DGGML_CUDA=on" pip install "urban-worm[llamacpp]"

# Metal build (macOS):
CMAKE_ARGS="-DGGML_METAL=on" pip install "urban-worm[llamacpp]"

Cloud APIs (Claude / GPT-4o / Gemini)

pip install "urban-worm[api]"

Audio support (optional)

Only needed if you use get_sound_from_location():

pip install "urban-worm[audio]"

All extras at once

Note: GPU torch must still be pre-installed before running pip install "urban-worm[all]". See the Unsloth section above.

pip install "urban-worm[all]"          # all backends + API providers (no audio)
pip install "urban-worm[all,audio]"    # + audio slicing

Dev install from source

pip install -e git+https://github.com/billbillbilly/urbanworm.git#egg=urban-worm
pip install "urban-worm[dev]"

Usage

Collect street views with crash-safe checkpointing

from urbanworm import GeoTaggedData

gtd = GeoTaggedData()
gtd.getBuildings(bbox=(-83.208, 42.374, -83.206, 42.375), source='osm')

# Step 1 — fetch metadata from Mapillary (resumes from svi.jsonl if interrupted)
gtd.get_svi_from_locations(
    key="YOUR_MAPILLARY_KEY",
    distance=30,
    reoriented=True,
    checkpoint_path="run/svi.jsonl",
)

# Step 2 — download images to disk (resume-safe: existing files are never overwritten)
gtd.download_to_dir(data='svi', to_dir='run/images')

Inference with a local VLM (Unsloth — recommended)

from urbanworm import InferenceUnsloth
from typing import Literal

schema = {
    "occupancy": (Literal["occupied", "unoccupied", "uncertain"], ...),
    "visual_evidence": (str, ...),
}

infer = InferenceUnsloth(
    llm="unsloth/Qwen3-VL-3B-Instruct",
    load_in_4bit=True,
    geo_tagged_data=gtd,
    schema=schema,
    # device and max_memory are optional — defaults shown below:
    # device=None        → auto: "auto" when multiple GPUs are detected,
    #                      "cuda:0" for a single GPU, "cpu" otherwise
    # max_memory=None    → auto: 90 % of each GPU's total VRAM, e.g.
    #                      {0: "10GiB", 1: "10GiB"} for two 12 GB GPUs
)

df = infer.batch_inference(
    system="You are an urban researcher assessing housing conditions.",
    prompt="Is this house occupied or vacant? Describe the visual evidence.",
    batch_size=4,             # batch > 1 trades VRAM for throughput
    max_new_tokens=256,
    checkpoint_path="run/labels.jsonl",   # resume-safe
)

Multi-GPU note — when multiple CUDA GPUs are present, InferenceUnsloth automatically sets device_map="auto" and splits the model layers across all of them. You can override the per-GPU memory budget with max_memory, for example max_memory={0: "10GiB", 1: "10GiB"} to leave 2 GB headroom on each of two 12 GB cards. If a batch triggers an out-of-memory error at runtime, the failed chunk is automatically retried one item at a time after clearing the CUDA cache, so you lose at most one image rather than the entire batch.

Inference with a cloud API

from urbanworm import InferenceAPI

infer = InferenceAPI(
    llm="claude-sonnet-4-5",   # or "gpt-4o", "gemini-2.0-flash"
    provider="anthropic",       # or "openai", "google"
    api_key="YOUR_API_KEY",
    geo_tagged_data=gtd,
    schema=schema,
)

df = infer.batch_inference(
    system="You are an urban researcher assessing housing conditions.",
    prompt="Is this house occupied or vacant? Describe the visual evidence.",
    checkpoint_path="run/labels_claude.jsonl",
)

Export to an organized dataset

# Produces dataset/metadata.csv + dataset/images/
csv_path = gtd.export(output_dir="dataset", data="svi", labels=df)

More examples: docs/1_basic_inference.ipynb, docs/3_ground_truth_labeling.ipynb.

To do

v0.1.x:

  • A module for collecting social media data (Flickr and Freesound)
  • A method for inferencing sound recordings

v0.2.x:

  • Crash-safe checkpointing for collection and inference
  • Cloud API inference backend (Claude / GPT-4o / Gemini)
  • export() — organized dataset export with metadata CSV
  • Full ground-truth labeling tutorial notebook
  • A web UI providing interactive operation and data visualization

Legal Notice

This repository and its content are provided for educational and research purposes only. By using the information and code provided, users acknowledge that they are using the APIs and models at their own risk and agree to comply with any applicable laws and regulations.

Acknowledgements

The inference backends are built on:

The GIS data sourcing, image processing, and data collection functionality is built on:

The development of this package is supported and inspired by the city of Detroit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

urban_worm-0.2.3.tar.gz (310.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

urban_worm-0.2.3-py3-none-any.whl (296.0 kB view details)

Uploaded Python 3

File details

Details for the file urban_worm-0.2.3.tar.gz.

File metadata

  • Download URL: urban_worm-0.2.3.tar.gz
  • Upload date:
  • Size: 310.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for urban_worm-0.2.3.tar.gz
Algorithm Hash digest
SHA256 cddaff0b84813f50fbe20e70f2c353707b0a8d20f12bd8bb277e6bb5a1233093
MD5 9ac2d45ff6782247772f65dcd2ec2a0e
BLAKE2b-256 572ab3b43f5889dd85a8044f5eac755ca47fbcb369eb1ecc30844df30ac8b134

See more details on using hashes here.

File details

Details for the file urban_worm-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: urban_worm-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 296.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for urban_worm-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e27505227b81c5eaa6b07e6a81d9a0b24dfc455e4d9c342bc975a652eb788a36
MD5 baf7b4553ef05514c6a84ebac4cdd946
BLAKE2b-256 88b26e0ae12eeeaa07c229a2f06859cea9701ba1aa3a2adfeada0f8db1c53882

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page