Skip to main content

A package to scrape Utah housing data for certain cities of Salt Lake and Utah counties

Project description

If you are interested in running through a tutorial or reading the documentation, you can go here

STAT 386 Final Project: Utah Housing Data Scraper

Overview

This project is a Python package designed to collect and analyze Utah housing data from UtahRealEstate.com. It focuses on properties in Utah County and Salt Lake County, providing structured data as well as tools for cleaning, visualization, and analysis.

The package's web scraper uses Playwright for browser automation and easy integration into other Python projects.


Features

  • Scrapes housing listings for multiple cities in Utah
  • Extracts details such as:
    • MLS number
    • Price
    • Address
    • Beds, Baths, Square Footage
    • Year Built, Lot Size, Garage
    • Listing Agent
  • Outputs data as:
    • Pandas DataFrame or
    • CSV file
  • Configurable:
    • Number of listings per city
    • Target cities

Project Structure

Stat_386_final_project/
├── LICENSE
├── README.md
├── Documentation.qmd
├── Tutorial.qmd
├── TechnicalReport.qmd
├── index.qmd
├── pyproject.toml
├── streamlit_page.py
├── styles.css
├── uv.lock
├── data/
│   ├── Salt_Lake_County_housing_data.csv
│   ├── test_data.csv
│   └── utah_housing_data_ORIGINAL.csv
├── scripts/
│   ├── _scraper_less_intensive.py
│   ├── salt_lake_county.py
│   └── scraper.py
├── src/
│   └── utah_housing_stat386/
│       ├── __init__.py
│       ├── core.py
│       ├── cleaning.py
│       ├── demo.py
│       ├── data/
│       └── streamlit_app.py
├── tests/
│   ├── package_test.py
│   ├── test_cleaning.py
│   └── test.ipynb
└── docs/
    └── (generated Quarto HTML files)

Package

The main package is utah_housing_stat386, located in the src/utah_housing_stat386/ directory. It contains the core functionality for scraping, cleaning, and data handling.

  • core.py: Contains the main scraping logic and data fetching functions
  • cleaning.py: Data cleaning and validation functions for housing data
  • demo.py: Demo functions for testing and quick prototyping
  • __init__.py: Initializes the package and exposes all public functions

Installation

  1. Install Playwright browsers (required for scraping):
    pip install playwright
    playwright install
    
    This will download the necessary browser binaries (Chromium, Firefox, WebKit) for Playwright.

Usage

The main functionality is exposed via the get_data function in utah_housing_stat386.core, which scrapes data directly. Warning: This function is extremely memory instensive. If static data is sufficient, it is easiest—highly recommneded—to simply use the data_no_scape function instead.

Example: Basic Data Fetching

# Import dependencies
from utah_housing_stat386.core import get_data
from utah_housing_stat386.cleaning import data_no_scape
import pandas as pd
import nest_asyncio
nest_asyncio.apply()


#####  Dynamic scraping  #####

# Fetch data for specific cities, 5 listings per city, return as DataFrame
df = get_data(max_listings=5, cities=['provo', 'salt-lake-city'], output="pandas")
print(df.head())

# Save data to CSV file instead instead
get_data(max_listings=5, output="csv")


#####  Static data (RECOMMENDED)  #####
df_static = data_no_scape()

Configuration

  • max_listings: Number of listings per city (default: 5)
  • cities: List of cities (default: all supported cities)
  • output: "pandas" DataFrame or "csv" file (default: "pandas")

Supported cities include:

  • Utah County: alpine, american-fork, eagle-mountain, highland, lindon, lehi, orem, provo, saratoga-springs, spanish-fork
  • Salt Lake County: draper, holladay, midvale, millcreek, cottonwood-heights, murray, salt-lake-city, sandy, south-jordan, south-salt-lake, sugarhouse, west-jordan, west-valley

Data Files

  • data/utah_housing_data_ORIGINAL.csv: Sample of scraped data for Utah County
  • data/Salt_Lake_County_housing_data.csv: Sample of scraped data for Salt Lake County
  • data/test_data.csv: Test dataset (produced in development)

Scripts (produced in development)

  • scripts/scraper.py: Main scraper script using Playwright
  • scripts/_scraper_less_intensive.py: Less intensive version of the scraper
  • scripts/salt_lake_county.py: Script to scrape Salt Lake County data

Other Files & Resources

  • pyproject.toml: Project configuration, dependencies, and package metadata
  • uv.lock: Lock file for reproducible dependency management
  • streamlit_page.py: Interactive Streamlit web interface for data exploration
  • Tutorial.qmd: User guide and tutorial for the package
  • TechnicalReport.qmd: Detailed technical documentation and methodology
  • tests/: Unit tests and integration tests for the package
  • docs/: Pre-built Quarto HTML documentation files

Data Cleaning

The package includes comprehensive data cleaning functions to transform raw scraped data into analysis-ready format.

Quick Start with Cleaned Data

from utah_housing_stat386 import get_cleaned_data, cleaned_static_data

# Get cleaned data directly (via scraping, memory-intensive)
df_clean = get_cleaned_data(max_listings=10, output="pandas")
print(df_clean.head())

# Get static data (highly recommended)
df_static_clean = cleaned_static_data()
print(df_static_clean.head())

Manual Cleaning Workflow

from utah_housing_stat386.cleaning import data_no_scape
from utah_housing_stat386 import get_data, clean_housing_data, remove_duplicates, remove_invalid_entries

# Get raw data (statically)
df_raw = data_no_scape()

# Apply cleaning step-by-step
df_clean = clean_housing_data(df_raw)
df_clean = remove_duplicates(df_clean)
df_clean = remove_invalid_entries(df_clean)

Individual Cleaning Functions

from utah_housing_stat386 import clean_price, clean_lot_size, clean_garage

# Clean specific fields
df['price'] = df['price'].apply(clean_price)
df['lot_size'] = df['lot_size'].apply(clean_lot_size)  # Converts to acres
df['garage'] = df['garage'].apply(clean_garage)  # Extracts garage spaces

Cleaning Functions Reference

Function Description Example Input Example Output
clean_price() Converts price strings to numeric "$481,999" 481999.0
clean_numeric_field() Cleans beds, baths, sqft "1,252" 1252.0
clean_year_built() Validates year built "1919" 1919
clean_lot_size() Converts to acres "0.10 Ac" 0.1
clean_garage() Extracts garage spaces "2 Car" 2
clean_housing_data() Applies all cleaning DataFrame Cleaned DataFrame
remove_duplicates() Removes duplicate listings DataFrame Deduplicated DataFrame
remove_invalid_entries() Removes rows with missing critical data DataFrame Filtered DataFrame
check_is_nan() Checks whether a value is NaN or empty None / "" True / False
clean_address() Standardizes and trims address strings "123 Main St,, " "123 Main St"
clean_city() Normalizes city names to lowercase and trims whitespace " Provo " "provo"
get_cleaned_data() Fetches data (via get_data), applies cleaning and returns DataFrame or writes CSV get_cleaned_data(max_listings=5) Cleaned DataFrame or path to CSV
data_no_scape() Loads the bundled static CSV files and concatenates them into a DataFrame n/a DataFrame
cleaned_static_data() Loads static CSVs and returns a cleaned DataFrame (applies cleaning pipeline) n/a Cleaned DataFrame

Demo & Testing

The package includes demo functionality to get started quickly:

from utah_housing_stat386 import run_demo, demo_cleaning, load_demo_data

# Run full demo with sample data
run_demo()

# Load demo dataset
df_demo = load_demo_data()

# See demo cleaning in action
demo_cleaning()

Tests are located in the tests/ directory and can be run with:

pytest tests/

License

MIT 2025


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

utah_housing_stat386-0.3.2.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

utah_housing_stat386-0.3.2-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file utah_housing_stat386-0.3.2.tar.gz.

File metadata

  • Download URL: utah_housing_stat386-0.3.2.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for utah_housing_stat386-0.3.2.tar.gz
Algorithm Hash digest
SHA256 5851f419f0536b0cc024dcc6d4662162623b84ece1dc8494a308fb3654b872c7
MD5 533bf0a3a8778ef9ca78eb90eba565b6
BLAKE2b-256 68a900ded732bf154b1b796a6c805a20db4e6a59279b2b97ea611affe82d80bc

See more details on using hashes here.

File details

Details for the file utah_housing_stat386-0.3.2-py3-none-any.whl.

File metadata

File hashes

Hashes for utah_housing_stat386-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 dd86391c472465b4f5a1cb1e7dde87077d38b654efa6645d95417695fdc3f99a
MD5 b914a79c849d05c5c735f600bb2a1329
BLAKE2b-256 bc5cb93abec0b3956c12920b18a3a002668a3cd7119f9bd92b6d001f5fb4195e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page