Skip to main content

Singapore Public Housing (HDB) Valuation Engine using geospatial accessibility scoring.

Project description

HDB Valuation Engine ๐Ÿ‡ธ๐Ÿ‡ฌ

Open In Colab

Ironclad CI Pipeline Deploy Documentation Python 3.10+ Code style: black License: MIT

A quantitative tool for identifying undervalued Singapore public housing assets using spatial data analysis.

The Real-World Problems

  1. The LRT Deception: Commercial portals treat LRT (Light Rail) and MRT (Heavy Rail) as equal. This is false. LRT loops add significant commute latency. Buyers need a metric that rewards True Connectivity.
  2. The Lease Illusion: Buyers often fixate on raw price, ignoring lease decay. A 'cheap' flat with 50 years remaining is often a worse asset than a pricier unit with 95 years.
  3. Data Overload: With thousands of transactions, manual comparisons are impossible. Buyers need statistical anomaly detection, not just a search bar.

The Engineering Solution

This engine ingests historical transaction data to calculate a 'True Value Score' for every flat.

  • LRT-Exclusion Algorithm: Uses Regex filtering and KDTree spatial indexing to calculate walking distance strictly to MRT nodes.
  • Depreciation Logic: Normalizes price against remaining lease life to find the true cost of ownership.
  • Z-Score Ranking: Identifies properties trading 2 deviations below their cluster average.

Key Features

  • Dual Interface: Use as a CLI tool OR as a Python module in your own projects
  • ๐ŸŽจ Beautiful CLI (v0.4.1+): Rich terminal output with colored tables, progress spinners, and visual feedback
  • Strict OOP pipeline with type hints and logging
  • Robust lease parsing and inference (handles text and infers from lease_commence_date + month)
  • Lease-adjusted price efficiency metric and group-wise Z-Score valuation
  • Extended filters (exact/partial) and numeric ranges
  • TransportScorer with KDTree and strict LRT exclusion (regex ^(BP|S[WE]|P[WE]))
  • Export to CSV/JSON/Parquet; optional full export
  • Configurable peer grouping via --group-by
  • Caching for fast repeated transport queries; cache management subcommand

Algorithm Overview

Core Valuation Pipeline

  1. Lease parsing/inference

    • Parse remaining_lease strings to float years (e.g., 61 years 04 months โ†’ 61.33).
    • If absent, infer years: remaining_years = 99 - ((YYYY + (MM-1)/12) - lease_commence_year).
  2. Bala's Curve: Non-Linear Lease Depreciation (v0.3.0+)

    • Why it matters: HDB leases don't lose value linearly. A flat with 80 years remaining holds almost full value, while one with 30 years faces steep depreciation. Traditional linear models miss this critical market behavior.
    • Mathematical model: depreciation_factor = exp(-k ร— ((99 - remaining) / 99)^n)
      • Default parameters: k=3.0 (decay rate), n=2.5 (curve steepness)
    • Real-world behavior:
      • 99 years โ†’ factor = 1.00 (no depreciation)
      • 80 years โ†’ factor = 0.95 (minimal depreciation, ~5% loss)
      • 60 years โ†’ factor = 0.75 (moderate depreciation, ~25% loss)
      • 40 years โ†’ factor = 0.44 (accelerating depreciation, ~56% loss)
      • 20 years โ†’ factor = 0.18 (severe depreciation, ~82% loss)
    • Impact: Properties with shorter leases get penalized more heavily in valuation, reflecting true market economics and helping buyers avoid "cheap but depreciating" traps.
    • Academic foundation: Based on Bala's studies on Singapore HDB lease decay and observed market behavior in resale transactions.
  3. Price efficiency (lease-adjusted)

    • Base: price_efficiency = resale_price / (floor_area_sqm ร— remaining_lease_years)
    • Adjusted: price_efficiency_adjusted = base_efficiency / depreciation_factor
    • Lower values indicate better cost per effective area-year (better value)
  4. Group-wise Z-Score

    • Group by --group-by (default: town, flat_type) and compute z = (x - ฮผ) / ฯƒ.
    • Identifies statistical outliers within peer groups (e.g., cheap 4-ROOM flats in PUNGGOL compared to other 4-ROOM PUNGGOL flats)
  5. Valuation score

    • valuation_score = -z_price_efficiency (higher โ†’ more undervalued relative to peers).
    • Combined with growth potential analysis to identify "deep value" opportunities
  6. Transport accessibility (optional)

    • Compute nearest MRT exit distance (LRT excluded) and Accessibility_Score = max(0, 10 - 2 ร— dist_km).
    • By default adjusts price_efficiency; use --no-accessibility-adjust for analysis-only.

System Architecture & Data Flow

flowchart TB
    subgraph Input["๐Ÿ“Š Data Sources"]
        A1[HDB Resale CSV<br/>~200K+ Records]
        A2[LTA MRT Station<br/>GeoJSON API]
    end

    subgraph Pipeline["๐Ÿ”„ Processing Pipeline"]
        B1[HDBLoader<br/>Schema Normalization]
        B2[FeatureEngineer<br/>Lease Parsing & Inference]
        B3[TransportScorer<br/>KDTree Spatial Indexing]
        B4[ValuationEngine<br/>Statistical Scoring]
        B5[ReportGenerator<br/>Filtering & Ranking]
    end

    subgraph Algorithms["๐Ÿงฎ Core Algorithms"]
        C1["Lease Depreciation Model<br/>remaining = 99 - (txn_year - commence_year)"]
        C2["Price Efficiency<br/>PE = price / (area ร— lease_years)"]
        C3["LRT Exclusion Filter<br/>Regex: ^(BP|S[WE]|P[WE])"]
        C4["KDTree Nearest Neighbor<br/>O(log n) Spatial Query"]
        C5["Haversine Distance<br/>Great-Circle Calculation"]
        C6["Group-wise Z-Score<br/>z = (x - ฮผ) / ฯƒ<br/>within (town, flat_type) cohorts"]
        C7["Accessibility Score<br/>AS = max(0, 10 - 2ร—dist_km)"]
        C8["Valuation Score<br/>VS = -z_PE ร— (1 + AS/10)"]
    end

    subgraph Output["๐Ÿ“ˆ Outputs"]
        D1[Ranked DataFrame<br/>Top-N Undervalued Units]
        D2[CLI Report<br/>Formatted Table]
        D3[Export Files<br/>CSV/JSON/Parquet]
        D4[Programmatic API<br/>Python Module Integration]
    end

    A1 --> B1
    A2 --> B3
    B1 --> B2
    B2 --> C1
    B2 --> C2
    C1 --> B4
    C2 --> B4
    B3 --> C3
    C3 --> C4
    C4 --> C5
    C5 --> C7
    B4 --> C6
    C6 --> C8
    C7 --> C8
    C8 --> B5
    B5 --> D1
    D1 --> D2
    D1 --> D3
    D1 --> D4

    style Input fill:#e1f5ff,stroke:#01579b,stroke-width:2px
    style Pipeline fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    style Algorithms fill:#fff3e0,stroke:#e65100,stroke-width:2px
    style Output fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px

    style C4 fill:#ffeb3b,stroke:#f57f17,stroke-width:3px
    style C6 fill:#ffeb3b,stroke:#f57f17,stroke-width:3px
    style C8 fill:#ffeb3b,stroke:#f57f17,stroke-width:3px

Technical Highlights

๐ŸŽฏ Statistical Rigor

  • Group-wise Z-score normalization ensures fair peer comparison across 26 towns and 7 flat types
  • Robust handling of zero-variance groups and missing data
  • Mathematical foundation allows for reproducible, bias-free property valuation

๐Ÿ—บ๏ธ Geospatial Innovation

  • KDTree spatial indexing enables O(log n) nearest-neighbor queries on 200K+ properties
  • Haversine distance calculation accounts for Earth's curvature (ยฑ0.5% accuracy)
  • Regex-based LRT exclusion (BP/SW/SE/PW/PE lines) ensures only heavy rail stations are considered
  • Caching mechanism reduces repeated spatial queries from minutes to milliseconds

๐Ÿ’ฐ Financial Modeling

  • Lease depreciation model adjusts for Singapore's 99-year leasehold system
  • Time-value-of-money consideration through remaining lease normalization
  • Price efficiency metric captures $/sqm/year for true cost-of-ownership analysis

๐Ÿ—๏ธ Software Engineering

  • Object-oriented pipeline with strict type hints (PEP 484 compliant)
  • Dual interface: CLI for analysts, Python API for integration
  • 66% test coverage with 26/26 tests passing
  • Comprehensive logging and error handling for production reliability

Installation

Dashboard Command Center (Streamlit)

Optional: Install the Palantir-style dashboard dependencies:

pip install ".[dashboard]"

Run the Command Center:

streamlit run dashboard/app.py

Streamlit Cloud deployment (zero-config)

In Streamlit Cloud:

  • App file: dashboard/app.py
  • Requirements file: dashboard/requirements.txt

The dashboard uses pydeck CARTO basemaps (pdk.map_styles.CARTO_DARK) so it runs instantly without API keys.

Notes:

  • Flux live simulation requires pip install ".[flux]". If Warp/CUDA is unavailable, the dashboard falls back to Replay Mode (load frames from dashboard/data/flux_frames/).
  • Generate Replay Mode frames locally:
    pip install ".[flux]"
    hdb-valuation-engine flux-replay \
      --agents 100000 \
      --frames 60 \
      --steps-per-frame 5 \
      --out-dir dashboard/data/flux_frames \
      --parquet dashboard/data/flux_frames.parquet \
      --fps 10
    

Replay Mode (Streamlit Cloud) supports animated playback controls (play/pause, loop, speed, scrub).

Replay data format (for Streamlit Cloud)

Replay frames live in dashboard/data/flux_frames/ and are simple JSON files:

{"lon": [103.8, 103.81], "lat": [1.30, 1.31]}

This minimal schema is deliberate: it keeps the dashboard fast, portable, and independent of GPU availability.

If you provide --parquet, flux-replay also emits a long-form time-series (frame, t_seconds, lon, lat) for analytics.

Dashboard dataset resolution order (multi-path)

The dashboard resolves its default resale dataset in the following order:

  1. ResaleFlatPrices/<default csv> (vendored in-repo; Streamlit Cloud friendly)
  2. .data/ResaleFlatPrices/<default csv> (local cache)
  3. .data/<default csv>

If none are found, pass an explicit path or run hdb-valuation-engine fetch.

  • The dashboard is a portfolio showcase layer and is not required for core library usage.

Optional: To enable the GPU-accelerated Agent Simulation (Project Flux), install with the flux extra: pip install ".[flux]".

Quick start (CPU backend for CI / laptops):

from simulation import FluxEngine

engine = FluxEngine(num_agents=1024, grid_res=64, device="cpu")
engine.step(60)  # 1 second at dt=1/60
pos = engine.positions_numpy()  # (N, 2) NumPy array

Project Flux: Current Limitations (and how we plan to fix them)

Project Flux is intentionally an MVP-grade GPU simulation stack today: it proves the CUDA/SoA execution path and provides deterministic, benchmarkable dynamics. The next milestones focus on algorithmic realism and scalability.

Current limitations

  • No agentโ€“agent interactions: the current kernel is O(N) with flow-field sampling only (no collision avoidance, flocking, congestion, or social forces).
  • Single-step Euler integrator: stable for demos but not ideal for stiff dynamics, constraints, or long-horizon simulations.
  • Grid sampling is nearest-cell: no bilinear interpolation, which can cause aliasing at low grid resolutions.
  • Geospatial calibration is minimal: positions are in a normalized 2D domain; coupling to real Singapore geometry requires projection + calibrated datasets.
  • Single-GPU, single-process execution: no domain decomposition or multi-GPU scaling.
  • Outputs are snapshots (by default): trajectory logging for 500k+ agents requires chunked, columnar storage and careful IO design.

Planned solutions (algorithm + data upgrades)

  • Spatial hashing / uniform grid neighbor search: add O(N) (expected) neighbor interactions (flocking, collision avoidance, density) without O(Nยฒ) blowups.
  • Higher-order or semi-implicit integration: upgrade to RK2/RK4 or symplectic/semi-implicit schemes where appropriate.
  • Bilinear flow interpolation: smoother, resolution-independent flow sampling.
  • Dataset-driven fields: learn or fit flow fields from transport networks / OD matrices / pedestrian graph datasets rather than analytic fields.
  • Domain decomposition: tiling + halo exchange strategies for multi-GPU / multi-process scaling.
  • Columnar time-series output: chunked Parquet (or Arrow IPC) with downsampling strategies for visualization.

Engineering TODO themes (how to read the codebase)

When you see TODO tags in the code, they are intentionally categorized to communicate engineering intent:

  • TODO(optimization): algorithmic complexity and data-structure upgrades (e.g., moving from naive loops to spatial indexing).
  • TODO(gpu): reducing CPUโ†”GPU transfers or moving computations fully on-device.
  • TODO(research): improving modeling fidelity via better metrics or calibrated proxies (graph impedance, learned costs).
  • TODO(limitations): explicit simplifying assumptions we plan to relax with richer datasets (demographics, behavior).

See docs/usage-guide.md and docs/source/roadmap.md for the active engineering roadmap.

From PyPI:

pip install hdb-valuation-engine

From source (recommended in a virtual environment):

python3 -m venv .venv
source .venv/bin/activate  # Windows: .venv\\Scripts\\Activate.ps1
pip install -r requirements.txt

Quick Start

Platform-agnostic data fetching (no Make needed):

  • Fetch all supported datasets (HDB resale CSV + MRT exits GeoJSON):
hdb-valuation-engine fetch
  • Fetch entire HDB resale dataset (no row limit) plus MRT exits:
hdb-valuation-engine fetch --limit 0
  • Only MRT exits to a custom path:
hdb-valuation-engine fetch --datasets mrt --mrt-out .data/LTAMRTStationExitGEOJSON.geojson
  • Only HDB resale CSV with 10k rows to default location:
hdb-valuation-engine fetch --datasets resale --limit 10000

Module usage (NEW - Clean Python API):

from hdb_valuation_engine import HDBValuationEngineApp

# Initialize the engine
app = HDBValuationEngineApp()

# Process data and get results
results = app.process(
    input_path="ResaleFlatPrices/Resale flat prices based on registration date from Jan-2017 onwards.csv",
    town="PUNGGOL",
    budget=600000,
    top_n=5
)

# Results is a pandas DataFrame
print(results)
print(f"\nFound {len(results)} undervalued properties")

# Access specific columns
for idx, row in results.iterrows():
    print(f"{row['town']}, {row['flat_type']}: ${row['resale_price']:,.0f} (Score: {row['valuation_score']:.2f})")

With MRT accessibility (default):

from hdb_valuation_engine import HDBValuationEngineApp

app = HDBValuationEngineApp()

# MRT scoring runs by default when the default MRT dataset is present
results = app.process(
    input_path="resale.csv",
    town="BISHAN",
    budget=800000,
    top_n=10
)

# Results include MRT distance and accessibility scores
print(results[["town", "resale_price", "Nearest_MRT", "Dist_m", "Accessibility_Score", "valuation_score"]])

To use a custom MRT catalog:

results = app.process(
    input_path="resale.csv",
    mrt_catalog=".data/LTAMRTStationExitGEOJSON.geojson",
    town="BISHAN",
    budget=800000,
    top_n=10
)

See EXAMPLES.md for 10+ comprehensive usage examples including:

  • Using pre-loaded DataFrames
  • Custom grouping and filters
  • Exporting results
  • Using individual pipeline components
  • Integration with Flask/web APIs

๐Ÿ“š Interactive Jupyter Tutorials

Learn the HDB Valuation Engine through hands-on interactive notebooks:

notebooks/01_quickstart_tutorial.ipynb ๐Ÿš€

15-minute beginner tutorial

  • Loading and processing HDB data
  • Running valuation analysis with filters
  • Visualizing Bala's Curve depreciation
  • Understanding valuation scores
  • Exporting results

notebooks/02_advanced_analysis.ipynb ๐Ÿ“Š

30-minute intermediate deep dive

  • Transport accessibility and MRT proximity analysis
  • Peer group strategies and z-score interpretation
  • Statistical outlier detection
  • Cross-town market comparisons
  • Custom filtering workflows

notebooks/03_custom_workflows.ipynb ๐Ÿ› ๏ธ

20-minute advanced customization

  • Building custom analysis pipelines
  • Experimenting with depreciation models
  • Creating rental yield estimators
  • Multi-scenario batch processing
  • Property comparison tools

See notebooks/README.md for full documentation, installation guide, and learning paths.


๐ŸŽจ Rich CLI Interface (v0.4.1+)

The HDB Valuation Engine features a beautiful, modern command-line interface powered by the Rich library:

โœจ Visual Features

Before (v0.4.0):

Processing complete.
town        flat_type    resale_price    floor_area_sqm
PUNGGOL     2 ROOM       225000          50.0
BISHAN      4 ROOM       550000          90.0

After (v0.4.1):

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ HDB Valuation Engine v0.4.1                                               โ”‚
โ”‚ Identifying undervalued properties using Bala's Curve & transport scoring โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ ™ Processing data...

Filters: Town: PUNGGOL | Budget: $600,000
Found 15 properties

                     ๐Ÿ  Top 10 Undervalued HDB Properties
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Rank โ”‚ Town    โ”‚ Flat Type โ”‚ Address      โ”‚       Price โ”‚ Area (mยฒ)โ”‚ Lease (yrs)โ”‚  Score โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚      1 โ”‚ PUNGGOL โ”‚ 4 ROOM    โ”‚ 310A Punggโ€ฆ  โ”‚    $450,000 โ”‚     90.0 โ”‚       85.2 โ”‚   2.34 โ”‚
โ”‚      2 โ”‚ PUNGGOL โ”‚ 4 ROOM    โ”‚ 268C Punggโ€ฆ  โ”‚    $475,000 โ”‚     92.0 โ”‚       89.5 โ”‚   1.87 โ”‚
โ”‚      3 โ”‚ PUNGGOL โ”‚ 3 ROOM    โ”‚ 110 Edgefโ€ฆ   โ”‚    $385,000 โ”‚     67.0 โ”‚       82.1 โ”‚   1.65 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โ ‹ Exporting to CSV...
โœ“ Exported top 10 results to results.csv

๐ŸŽฏ Rich Features

  • Styled Tables: Beautiful Unicode borders with color-coded scores

    • ๐ŸŸข Bold Green: Excellent value (score โ‰ฅ 2.0)
    • ๐ŸŸข Green: Good value (score โ‰ฅ 1.0)
    • ๐ŸŸก Yellow: Fair value (score โ‰ฅ 0)
    • โšช Dim: Below average (score < 0)
  • Progress Indicators: Animated spinners for long operations

    • Data processing
    • Export operations
    • Cache building
  • Status Icons: Clear visual feedback

    • โœ“ Success messages
    • โš  Warnings
    • โœ— Errors
  • Smart Formatting:

    • Prices: $450,000 (thousands separators)
    • Areas: 90.0 mยฒ (decimal precision)
    • Scores: 2.34 (2 decimal places)
  • Filter Summaries: See your active filters at a glance

  • Result Counts: Know exactly how many properties match

Example Usage

# Basic search with Rich output
hdb-valuation-engine --town PUNGGOL --budget 600000 --top 10

# With export and progress indicators
hdb-valuation-engine --town BISHAN --output results.csv --output-format csv

# Multiple filters with visual feedback
hdb-valuation-engine --town-like PUNGG --flat-type "4 ROOM" --budget 500000

CLI usage (after install):

hdb-valuation-engine --input "ResaleFlatPrices/Resale flat prices based on registration date from Jan-2017 onwards.csv" --top 5 -v

Usage

hdb-valuation-engine --input <path/to/file.csv> [OPTIONS]

Core options

  • --input Path to HDB resale CSV data
  • --top Number of results to display (default: 10)
  • Logging: -v (INFO) or -vv (DEBUG)

Filters

  • Town: --town PUNGGOL (exact), --town-like unggol (partial)
  • Flat Type: --flat-type "5 ROOM" (exact), --flat-type-like room (partial)
  • Flat Model: --flat-model "Improved" (exact), --flat-model-like improv (partial)
  • Storey: --storey-min 7 --storey-max 12
  • Area (sqm): --area-min 60 --area-max 120
  • Remaining Lease (years): --lease-min 60 --lease-max 95
  • Budget (max resale_price): --budget 600000

Grouping (peer comparison)

--group-by town flat_type [flat_model]

Transport Accessibility (MRT via GeoJSON)

  • Fast, cached KDTree for nearest MRT exit queries (10k+ rows). Cache saved under .cache_transport/.
  • Provide LTA MRT Station Exit GeoJSON to enable accessibility scoring:
# You can fetch a current GeoJSON via the built-in fetcher
hdb-valuation-engine fetch --datasets mrt --mrt-out .data/LTAMRTStationExitGEOJSON.geojson

# Then reference it when running valuations
--mrt-catalog .data/LTAMRTStationExitGEOJSON.geojson
  • Excludes LRT strictly via regex ^(BP|S[WE]|P[WE]) and filters names containing LRT as a fallback.
  • Adds:
    • Nearest_MRT
    • Dist_m
    • Accessibility_Score = max(0, 10 - 2 * dist_km)
  • Analysis-only mode (no adjustment):
--no-accessibility-adjust

Exporting

--output top10.csv --output-format csv            # CSV (default)
--output top10.json --output-format json          # JSON
--output top10.parquet --output-format parquet    # Parquet (falls back to CSV if engine missing)
--export-full                                     # Export all filtered rows instead of Top-N

Quick Usage Examples

  1. Cache management subcommand
# Show cache directory
hdb-valuation-engine cache -v

# Clear cache in default location
hdb-valuation-engine cache --clear -v

# Use a custom cache dir
hdb-valuation-engine cache --transport-cache-dir .transport_cache --clear -v
  1. Build and cache KDTree from LTA GeoJSON; show Top-10 with adjustment:
hdb-valuation-engine \
  --input "ResaleFlatPrices/Resale flat prices based on registration date from Jan-2017 onwards.csv" \
  --mrt-catalog ".data/LTAMRTStationExitGEOJSON.geojson" \
  --top 10 -v
  1. Use cached KDTree on subsequent runs (faster); analysis-only mode (no price adjustment):
hdb-valuation-engine \
  --input "ResaleFlatPrices/Resale flat prices based on registration date from Jan-2017 onwards.csv" \
  --mrt-catalog ".data/LTAMRTStationExitGEOJSON.geojson" \
  --no-accessibility-adjust --top 10 -v
  1. Custom cache directory and force clear before building:
hdb-valuation-engine \
  --input "ResaleFlatPrices/Resale flat prices based on registration date from Jan-2017 onwards.csv" \
  --mrt-catalog ".data/LTAMRTStationExitGEOJSON.geojson" \
  --transport-cache-dir ".transport_cache" --clear-transport-cache --top 5 -v
  1. CSV catalog path (still supported; auto-excludes LRT lines):
hdb-valuation-engine \
  --input "ResaleFlatPrices/Resale flat prices based on registration date from Jan-2017 onwards.csv" \
  --mrt-catalog "/path/to/mrt_catalog.csv" --top 5 -v
  1. Combine with group-by and export options:
hdb-valuation-engine \
  --input "ResaleFlatPrices/Resale flat prices based on registration date from Jan-2017 onwards.csv" \
  --mrt-catalog ".data/LTAMRTStationExitGEOJSON.geojson" \
  --group-by town flat_type flat_model \
  --export-full --output top.json --output-format json --top 10 -v

Smoke Test Summary

  • 2017 onwards: Parsed remaining_lease strings successfully; produced Top-10 Punggol table under budget 600k. Export worked.
  • 2012โ€“2014: Inferred remaining lease from lease_commence_date and month; produced Top-10 Punggol table.
  • 2000โ€“Feb 2012: Inference path also worked; produced Top-5 for Ang Mo Kio under budget 200k.
  • Extended filters and partial matching verified; --output, --export-full, and --output-format worked as expected.

Design & Implementation Notes

  • Columns normalized to lowercase with underscores
  • Robust z-score handling for zero-variance groups
  • Logging across load, feature engineering, scoring, filtering, and export

Release and Tagging

To create a 0.1.0 release and push the tag:

git add -A
git commit -m "chore(release): cut 0.1.0"

git tag -a v0.1.0 -m "Initial PyPI packaging for hdb-valuation-engine"

git push origin main
git push origin v0.1.0

Running Tests

Note: You can fetch data without Make on any platform using the built-in fetch command:

# Fetch all datasets with defaults
hdb-valuation-engine fetch

# Fetch entire resale CSV and MRT exits
hdb-valuation-engine fetch --limit 0
  • Recommended: use a virtual environment
python3 -m venv .venv
source .venv/bin/activate  # Windows: .venv\\Scripts\\Activate.ps1
pip install -r requirements.txt
pytest -q

Optional dataset for an extra smoke test

One test is skipped by default unless a local dataset is available. To enable it:

  • Create a folder named ResaleFlatPrices at the repository root (same level as tests/ and README.md).
  • Place one or more HDB resale CSV files inside that folder, for example:
    • Resale flat prices based on registration date from Jan-2017 onwards.csv

You can fetch a small sample automatically with:

make setup-venv          # one-time environment setup
make fetch-sample-data   # downloads a subset into ./ResaleFlatPrices/

When this folder exists and contains at least one .csv file, the optional smoke test in tests/test_cli_export.py::TestOptionalRealDataset will run. If the folder is missing or empty, the test is skipped with reason:

ResaleFlatPrices folder not present; skipping optional smoke test

โš ๏ธ Current Limitations

While the HDB Valuation Engine provides sophisticated quantitative analysis, users should be aware of the following limitations in our approach and available data:

๐Ÿ“Š Data & Methodology Constraints

1. Historical Data Only

  • Limitation: Analysis is based on past transactions from data.gov.sg
  • Impact: Cannot predict future market conditions, policy changes, or economic shifts
  • Mitigation: Use as one input among many; combine with market research and professional advice

2. Bala's Curve Parameterization

  • Limitation: Default parameters (k=3.0, n=2.5) are empirically derived but not officially validated
  • Impact: Depreciation curve may not perfectly match individual property circumstances
  • Mitigation: Parameters are configurable; users can adjust based on their research or domain expertise

3. Incomplete Quality Metrics

  • Missing factors we cannot quantify:
    • Unit condition and renovation status
    • View quality (facing, unblocked)
    • Noise levels (traffic, construction)
    • Unit position within block (corner, middle)
    • Block facilities (lift landing, accessibility)
    • Estate maturity and community amenities
    • Upcoming infrastructure developments
  • Impact: Two flats with identical specs may have vastly different actual value
  • Mitigation: Use tool for initial screening; conduct physical inspections before decisions

4. MRT Accessibility Simplification

  • Limitation: Distance-to-MRT is Euclidean (straight-line), not walking distance
  • Missing factors:
    • Bus connectivity and frequency
    • Actual walking paths and obstacles
    • MRT line quality differences (Express vs regular)
    • Future MRT line plans
  • Impact: Score may over/undervalue properties based on real commute experience
  • Mitigation: Visit properties and test actual commute times

5. Static Peer Grouping

  • Limitation: Z-scores compare within (town, flat_type) only by default
  • Impact: May miss value opportunities across towns or compare incomparable properties
  • Mitigation: Use custom --group-by parameters; analyze multiple groupings

๐Ÿ› ๏ธ Technical & Data Limitations

6. Schema Assumptions

  • Limitation: Expects standardized column names from data.gov.sg format
  • Impact: May fail or produce incorrect results with differently structured data
  • Mitigation: Review Schema class in loader.py; customize if needed

7. No Ground Truth Validation

  • Limitation: We cannot verify if "undervalued" properties actually become good investments
  • Impact: High valuation score โ‰  guaranteed good deal
  • Mitigation: This is a screening tool, not investment advice

8. Outlier Sensitivity

  • Limitation: Z-scores can be skewed by extreme outliers in small peer groups
  • Impact: Unusual transactions can distort valuations for entire groups
  • Mitigation: Review raw data; filter by sample size; use multiple grouping strategies

9. No Macroeconomic Context

  • Missing factors:
    • Interest rate environment
    • Government housing policies (grants, restrictions)
    • Economic cycles and unemployment
    • Population growth and immigration trends
  • Impact: Tool cannot warn about systemic overvaluation or market timing
  • Mitigation: Consult economic indicators and professional financial advisors

๐Ÿ—๏ธ Singapore-Specific Constraints

10. HDB-Only Focus

  • Limitation: Does not cover private condos, landed property, or commercial real estate
  • Impact: Cannot compare HDB vs private housing value propositions
  • Mitigation: Use specialized tools for private property analysis

11. Policy Change Risk

  • Limitation: Cannot predict changes to:
    • Lease Buyback Scheme eligibility
    • Voluntary Early Redevelopment Scheme (VERS)
    • CPF usage rules
    • Resale levy structures
  • Impact: Tool may not capture full financial picture of HDB ownership
  • Mitigation: Stay updated on HDB policies and consult with HDB directly

12. No Rental Yield Analysis

  • Limitation: Does not estimate rental income potential or investment yields
  • Impact: Cannot advise on buy-to-rent strategies
  • Mitigation: Use separate rental market analysis tools

๐ŸŽฏ Recommended Usage

This tool is best used as:

  • โœ… Initial screening to identify potentially undervalued properties
  • โœ… Quantitative input to supplement qualitative research
  • โœ… Learning tool to understand lease depreciation and market dynamics
  • โœ… Comparative analysis within similar property cohorts

This tool should NOT be used as:

  • โŒ Sole basis for property purchase decisions
  • โŒ Investment advice or financial recommendations
  • โŒ Replacement for professional valuation services
  • โŒ Predictor of future appreciation or returns

Always combine with:

  • Physical property inspections
  • Professional property agents and valuers
  • Financial advisors for affordability analysis
  • Legal consultation for transaction structure
  • Personal circumstances and long-term plans

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hdb_valuation_engine-0.5.4.tar.gz (61.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hdb_valuation_engine-0.5.4-py3-none-any.whl (55.9 kB view details)

Uploaded Python 3

File details

Details for the file hdb_valuation_engine-0.5.4.tar.gz.

File metadata

  • Download URL: hdb_valuation_engine-0.5.4.tar.gz
  • Upload date:
  • Size: 61.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for hdb_valuation_engine-0.5.4.tar.gz
Algorithm Hash digest
SHA256 5f48c923145f50099380f709abe4a031ff504b493e2c5b2dfcdd9a113ed80d7a
MD5 9017045de9be3842080a55f3072f2527
BLAKE2b-256 31b4a6eed04c183c6ce274487773e2a0868006d8b0f9a53a70772251e4f3a117

See more details on using hashes here.

File details

Details for the file hdb_valuation_engine-0.5.4-py3-none-any.whl.

File metadata

File hashes

Hashes for hdb_valuation_engine-0.5.4-py3-none-any.whl
Algorithm Hash digest
SHA256 33096fec16057b65c903b46426f2895b8866ed4613a52109191e1af864198950
MD5 5e7314246dcebf4a00f790d4ee57e895
BLAKE2b-256 350b8995af830d0af06ba5fd561348bbcf80a05ea6b11a74ca47e2ce4ba8b708

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page