Skip to main content

Workflow of reproducible multimodal inference for urban environment evaluation.

Project description

image PyPI Downloads PyPI Downloads Docs image

logo

Urban-WORM

Introduction

Urban-WORM (Workflow Of Reproducible Multimodal Inference) is a user-friendly high-level interface designed for building geo-referenced urban datasets with model-generated ground-truth labels. It covers the full pipeline — from collecting crowdsourced street views, photos, and sounds near building footprints, through batched VLM inference, to an organized export of labeled metadata.

workflow

Features

Data collection

  • Collect geotagged street views (Mapillary/Google), photos (Flickr), and audio (Freesound/Radio Aporee) within the proximity of building footprints or other POIs
  • Calibrate panorama orientation to face a given location; auto-compute field-of-view from building footprints
  • Filter personal photos with face detection; slice audio recordings into fixed-duration clips
  • Crash-safe checkpointing — pass checkpoint_path to any collection method; already-fetched locations are skipped on resume, so a failed run never starts from zero

Inference / ground-truth labeling

  • Define a structured output schema once; all backends share the same one_inference / batch_inference interface
  • Unsloth (recommended) — GPU-accelerated local VLM with optional GPU batching; 2–4× faster than Ollama; automatically spreads the model across all visible GPUs when more than one is present, with OOM-safe chunk retry so failed batches fall back to item-by-item instead of producing silent stub outputs
  • Ollama — lightweight local inference, no GPU required
  • llama.cpp — highly customizable sampling; supports audio input
  • Cloud APIs — Claude (Anthropic), GPT-4o (OpenAI), Gemini (Google) via InferenceAPI
  • Crash-safe checkpointing on all batch_inference methods — resume mid-run without reprocessing completed images

Note: models can make mistakes and results still need to be reviewed and used carefully.

Export

  • GeoTaggedData.export() — one call produces a metadata.csv paired with an organized images/ or audio/ folder, with optional label columns merged in

Installation

Step 1 — Core package

pip install urban-worm

Step 2 — Choose your inference backend

Unsloth is the recommended backend for local inference (GPU-accelerated, fastest).

Unsloth — recommended (GPU required)

GPU-specific torch must be installed before the unsloth extra, otherwise pip falls back to a slow CPU-only build:

# CUDA (most modern NVIDIA GPUs):
pip install torch --index-url https://download.pytorch.org/whl/cu124

# macOS Apple Silicon (MPS):
pip install torch          # MPS is enabled by default on macOS

Then install the extra:

pip install "urban-worm[unsloth]"

Tested checkpoints: unsloth/Qwen3-VL-3B-Instruct, unsloth/Qwen3-VL-8B-Instruct, unsloth/gemma-3-4b-it, unsloth/Qwen2-VL-2B-Instruct, unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit. Any vision model that unsloth.FastVisionModel can load should work.

Ollama — lightweight local inference (no GPU required)

Install the Ollama application first:

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# macOS
brew install ollama

# Windows — download the installer from https://ollama.com/

Then install the Python client:

pip install "urban-worm[ollama]"

llama.cpp — CLI-based local inference

The llama-mtmd-cli binary must be installed separately:

# macOS / Linux
brew install llama.cpp

# Windows
winget install llama.cpp

More options: llama.cpp install guide. GGUF model collections: ggml-org multimodal GGUFs.

The Python binding is installed via the extra:

# CPU build (no compile flags needed):
pip install "urban-worm[llamacpp]"

# CUDA build:
CMAKE_ARGS="-DGGML_CUDA=on" pip install "urban-worm[llamacpp]"

# Metal build (macOS):
CMAKE_ARGS="-DGGML_METAL=on" pip install "urban-worm[llamacpp]"

Cloud APIs (Claude / GPT-4o / Gemini)

pip install "urban-worm[api]"

Audio support (optional)

Only needed if you use get_sound_from_location():

pip install "urban-worm[audio]"

All extras at once

Note: GPU torch must still be pre-installed before running pip install "urban-worm[all]". See the Unsloth section above.

pip install "urban-worm[all]"          # all backends + API providers (no audio)
pip install "urban-worm[all,audio]"    # + audio slicing

Dev install from source

pip install -e git+https://github.com/billbillbilly/urbanworm.git#egg=urban-worm
pip install "urban-worm[dev]"

Usage

Collect street views with crash-safe checkpointing

from urbanworm import GeoTaggedData

gtd = GeoTaggedData()
gtd.getBuildings(bbox=(-83.208, 42.374, -83.206, 42.375), source='osm')

# Step 1 — fetch metadata from Mapillary (resumes from svi.jsonl if interrupted)
gtd.get_svi_from_locations(
    key="YOUR_MAPILLARY_KEY",
    distance=30,
    reoriented=True,
    checkpoint_path="run/svi.jsonl",
)

# Step 2 — download images to disk (resume-safe: existing files are never overwritten)
gtd.download_to_dir(data='svi', to_dir='run/images')

Inference with a local VLM (Unsloth — recommended)

from urbanworm import InferenceUnsloth
from typing import Literal

schema = {
    "occupancy": (Literal["occupied", "unoccupied", "uncertain"], ...),
    "visual_evidence": (str, ...),
}

infer = InferenceUnsloth(
    llm="unsloth/Qwen3-VL-3B-Instruct",
    load_in_4bit=True,
    geo_tagged_data=gtd,
    schema=schema,
    # device and max_memory are optional — defaults shown below:
    # device=None        → auto: "auto" when multiple GPUs are detected,
    #                      "cuda:0" for a single GPU, "cpu" otherwise
    # max_memory=None    → auto: 90 % of each GPU's total VRAM, e.g.
    #                      {0: "10GiB", 1: "10GiB"} for two 12 GB GPUs
)

df = infer.batch_inference(
    system="You are an urban researcher assessing housing conditions.",
    prompt="Is this house occupied or vacant? Describe the visual evidence.",
    batch_size=4,             # batch > 1 trades VRAM for throughput
    max_new_tokens=256,
    checkpoint_path="run/labels.jsonl",   # resume-safe
)

Multi-GPU note — when multiple CUDA GPUs are present, InferenceUnsloth automatically sets device_map="auto" and splits the model layers across all of them. You can override the per-GPU memory budget with max_memory, for example max_memory={0: "10GiB", 1: "10GiB"} to leave 2 GB headroom on each of two 12 GB cards. If a batch triggers an out-of-memory error at runtime, the failed chunk is automatically retried one item at a time after clearing the CUDA cache, so you lose at most one image rather than the entire batch.

Inference with a cloud API

from urbanworm import InferenceAPI

infer = InferenceAPI(
    llm="claude-sonnet-4-5",   # or "gpt-4o", "gemini-2.0-flash"
    provider="anthropic",       # or "openai", "google"
    api_key="YOUR_API_KEY",
    geo_tagged_data=gtd,
    schema=schema,
)

df = infer.batch_inference(
    system="You are an urban researcher assessing housing conditions.",
    prompt="Is this house occupied or vacant? Describe the visual evidence.",
    checkpoint_path="run/labels_claude.jsonl",
)

Export to an organized dataset

# Produces dataset/metadata.csv + dataset/images/
csv_path = gtd.export(output_dir="dataset", data="svi", labels=df)

More examples: docs/1_basic_inference.ipynb, docs/3_ground_truth_labeling.ipynb.

To do

v0.1.x:

  • A module for collecting social media data (Flickr and Freesound)
  • A method for inferencing sound recordings

v0.2.x:

  • Crash-safe checkpointing for collection and inference
  • Cloud API inference backend (Claude / GPT-4o / Gemini)
  • export() — organized dataset export with metadata CSV
  • Full ground-truth labeling tutorial notebook
  • A web UI providing interactive operation and data visualization

Legal Notice

This repository and its content are provided for educational and research purposes only. By using the information and code provided, users acknowledge that they are using the APIs and models at their own risk and agree to comply with any applicable laws and regulations.

Acknowledgements

The inference backends are built on:

The GIS data sourcing, image processing, and data collection functionality is built on:

The development of this package is supported and inspired by the city of Detroit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

urban_worm-0.2.2.tar.gz (309.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

urban_worm-0.2.2-py3-none-any.whl (294.4 kB view details)

Uploaded Python 3

File details

Details for the file urban_worm-0.2.2.tar.gz.

File metadata

  • Download URL: urban_worm-0.2.2.tar.gz
  • Upload date:
  • Size: 309.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for urban_worm-0.2.2.tar.gz
Algorithm Hash digest
SHA256 310923d2f063875f2fec58bf7ca125e3e6b5195ee1563800f390f1059169f88b
MD5 d161dea11bb3b50337d3f47fa57e31f6
BLAKE2b-256 08460eb4def8f9bddcc1cc4543acfdbf97f2b7a78132d056ab31586b3a588b16

See more details on using hashes here.

File details

Details for the file urban_worm-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: urban_worm-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 294.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for urban_worm-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b4e1ac6ffc30c32c5254e7ef467b5588e505785eb53cf7f86f03b1962cf85266
MD5 db6199a4d1eea09fbc6e3e343d98985f
BLAKE2b-256 7a3562fbded9d170bc83c439ce49631f261ff4aef17f77454d3034b85dbf98ab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page