A package to scrape Utah housing data for certain cities of Salt Lake and Utah counties
Project description
If you are interested in running through a tutorial or reading the documentation, you can go here
STAT 386 Final Project: Utah Housing Data Scraper
Overview
This project is a Python package designed to collect and analyze Utah housing data from UtahRealEstate.com. It focuses on properties in Utah County and Salt Lake County, providing structured data as well as tools for cleaning, visualization, and analysis.
The package's web scraper uses Playwright for browser automation and easy integration into other Python projects.
Features
- Scrapes housing listings for multiple cities in Utah
- Extracts details such as:
- MLS number
- Price
- Address
- Beds, Baths, Square Footage
- Year Built, Lot Size, Garage
- Listing Agent
- Outputs data as:
- Pandas DataFrame or
- CSV file
- Configurable:
- Number of listings per city
- Target cities
Project Structure
Stat_386_final_project/
├── LICENSE
├── README.md
├── Documentation.qmd
├── Tutorial.qmd
├── TechnicalReport.qmd
├── index.qmd
├── pyproject.toml
├── streamlit_page.py
├── styles.css
├── uv.lock
├── data/
│ ├── Salt_Lake_County_housing_data.csv
│ ├── test_data.csv
│ └── utah_housing_data_ORIGINAL.csv
├── scripts/
│ ├── _scraper_less_intensive.py
│ ├── salt_lake_county.py
│ └── scraper.py
├── src/
│ └── utah_housing_stat386/
│ ├── __init__.py
│ ├── core.py
│ ├── cleaning.py
│ ├── demo.py
│ ├── data/
│ └── streamlit_app.py
├── tests/
│ ├── package_test.py
│ ├── test_cleaning.py
│ └── test.ipynb
└── docs/
└── (generated Quarto HTML files)
Package
The main package is utah_housing_stat386, located in the src/utah_housing_stat386/ directory. It contains the core functionality for scraping, cleaning, and data handling.
core.py: Contains the main scraping logic and data fetching functionscleaning.py: Data cleaning and validation functions for housing datademo.py: Demo functions for testing and quick prototyping__init__.py: Initializes the package and exposes all public functions
Installation
- Install Playwright browsers (required for scraping):
pip install playwright playwright install
This will download the necessary browser binaries (Chromium, Firefox, WebKit) for Playwright.
Usage
The main functionality is exposed via the get_data function in utah_housing_stat386.core, which scrapes data directly. Warning: This function is extremely memory instensive. If static data is sufficient, it is easiest—highly recommneded—to simply use the data_no_scape function instead.
Example: Basic Data Fetching
# Import dependencies
from utah_housing_stat386.core import get_data
from utah_housing_stat386.cleaning import data_no_scape
import pandas as pd
import nest_asyncio
nest_asyncio.apply()
##### Dynamic scraping #####
# Fetch data for specific cities, 5 listings per city, return as DataFrame
df = get_data(max_listings=5, cities=['provo', 'salt-lake-city'], output="pandas")
print(df.head())
# Save data to CSV file instead instead
get_data(max_listings=5, output="csv")
##### Static data (RECOMMENDED) #####
df_static = data_no_scape()
Configuration
- max_listings: Number of listings per city (default: 5)
- cities: List of cities (default: all supported cities)
- output:
"pandas"DataFrame or"csv"file (default:"pandas")
Supported cities include:
- Utah County: alpine, american-fork, eagle-mountain, highland, lindon, lehi, orem, provo, saratoga-springs, spanish-fork
- Salt Lake County: draper, holladay, midvale, millcreek, cottonwood-heights, murray, salt-lake-city, sandy, south-jordan, south-salt-lake, sugarhouse, west-jordan, west-valley
Data Files
data/utah_housing_data_ORIGINAL.csv: Sample of scraped data for Utah Countydata/Salt_Lake_County_housing_data.csv: Sample of scraped data for Salt Lake Countydata/test_data.csv: Test dataset (produced in development)
Scripts (produced in development)
scripts/scraper.py: Main scraper script using Playwrightscripts/_scraper_less_intensive.py: Less intensive version of the scraperscripts/salt_lake_county.py: Script to scrape Salt Lake County data
Other Files & Resources
pyproject.toml: Project configuration, dependencies, and package metadatauv.lock: Lock file for reproducible dependency managementstreamlit_page.py: Interactive Streamlit web interface for data explorationTutorial.qmd: User guide and tutorial for the packageTechnicalReport.qmd: Detailed technical documentation and methodologytests/: Unit tests and integration tests for the packagedocs/: Pre-built Quarto HTML documentation files
Data Cleaning
The package includes comprehensive data cleaning functions to transform raw scraped data into analysis-ready format.
Quick Start with Cleaned Data
from utah_housing_stat386 import get_cleaned_data, cleaned_static_data
# Get cleaned data directly (via scraping, memory-intensive)
df_clean = get_cleaned_data(max_listings=10, output="pandas")
print(df_clean.head())
# Get static data (highly recommended)
df_static_clean = cleaned_static_data()
print(df_static_clean.head())
Manual Cleaning Workflow
from utah_housing_stat386.cleaning import data_no_scape
from utah_housing_stat386 import get_data, clean_housing_data, remove_duplicates, remove_invalid_entries
# Get raw data (statically)
df_raw = data_no_scape()
# Apply cleaning step-by-step
df_clean = clean_housing_data(df_raw)
df_clean = remove_duplicates(df_clean)
df_clean = remove_invalid_entries(df_clean)
Individual Cleaning Functions
from utah_housing_stat386 import clean_price, clean_lot_size, clean_garage
# Clean specific fields
df['price'] = df['price'].apply(clean_price)
df['lot_size'] = df['lot_size'].apply(clean_lot_size) # Converts to acres
df['garage'] = df['garage'].apply(clean_garage) # Extracts garage spaces
Cleaning Functions Reference
| Function | Description | Example Input | Example Output |
|---|---|---|---|
clean_price() |
Converts price strings to numeric | "$481,999" | 481999.0 |
clean_numeric_field() |
Cleans beds, baths, sqft | "1,252" | 1252.0 |
clean_year_built() |
Validates year built | "1919" | 1919 |
clean_lot_size() |
Converts to acres | "0.10 Ac" | 0.1 |
clean_garage() |
Extracts garage spaces | "2 Car" | 2 |
clean_housing_data() |
Applies all cleaning | DataFrame | Cleaned DataFrame |
remove_duplicates() |
Removes duplicate listings | DataFrame | Deduplicated DataFrame |
remove_invalid_entries() |
Removes rows with missing critical data | DataFrame | Filtered DataFrame |
check_is_nan() |
Checks whether a value is NaN or empty | None / "" | True / False |
clean_address() |
Standardizes and trims address strings | "123 Main St,, " | "123 Main St" |
clean_city() |
Normalizes city names to lowercase and trims whitespace | " Provo " | "provo" |
get_cleaned_data() |
Fetches data (via get_data), applies cleaning and returns DataFrame or writes CSV |
get_cleaned_data(max_listings=5) |
Cleaned DataFrame or path to CSV |
data_no_scape() |
Loads the bundled static CSV files and concatenates them into a DataFrame | n/a | DataFrame |
cleaned_static_data() |
Loads static CSVs and returns a cleaned DataFrame (applies cleaning pipeline) | n/a | Cleaned DataFrame |
Demo & Testing
The package includes demo functionality to get started quickly:
from utah_housing_stat386 import run_demo, demo_cleaning, load_demo_data
# Run full demo with sample data
run_demo()
# Load demo dataset
df_demo = load_demo_data()
# See demo cleaning in action
demo_cleaning()
Tests are located in the tests/ directory and can be run with:
pytest tests/
License
MIT 2025
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file utah_housing_stat386-0.3.2.tar.gz.
File metadata
- Download URL: utah_housing_stat386-0.3.2.tar.gz
- Upload date:
- Size: 12.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5851f419f0536b0cc024dcc6d4662162623b84ece1dc8494a308fb3654b872c7
|
|
| MD5 |
533bf0a3a8778ef9ca78eb90eba565b6
|
|
| BLAKE2b-256 |
68a900ded732bf154b1b796a6c805a20db4e6a59279b2b97ea611affe82d80bc
|
File details
Details for the file utah_housing_stat386-0.3.2-py3-none-any.whl.
File metadata
- Download URL: utah_housing_stat386-0.3.2-py3-none-any.whl
- Upload date:
- Size: 15.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd86391c472465b4f5a1cb1e7dde87077d38b654efa6645d95417695fdc3f99a
|
|
| MD5 |
b914a79c849d05c5c735f600bb2a1329
|
|
| BLAKE2b-256 |
bc5cb93abec0b3956c12920b18a3a002668a3cd7119f9bd92b6d001f5fb4195e
|