Skip to main content

A package for detecting home and work locations from timestamped stop locations.

Project description

HoWDe

HoWDe (Home and Work Detection) is a Python package designed to identify home and work locations from individual timestamped sequences of stop locations. It processes stop location data to label each location as 'Home', 'Work', or 'None' based on user-defined parameters and heuristics.

A complete description of the algorithm can be found in our pre-print

Features

  • Processes stop location datasets to detect home and work locations.
  • Allows customization through various parameters to fine-tune detection heuristics.
  • Supports batch processing with multiple parameter configurations.
  • Outputs results as a PySpark DataFrame for seamless integration with big data workflows.

Installation

HoWDe requires Python 3.6 or later and a functional PySpark environment.

1. Install PySpark

Before installing HoWDe, ensure PySpark and Java are properly configured. For detailed setup instructions, please refer to the official PySpark Installation Guidelines

Installation Note:
PySpark may raise Py4JJavaError if Java or Spark are not properly configured. We recommend checking the Debugging PySpark and Py4JJavaError Guidelines

Compatibility Note:
Once PySpark/Java is correctly configured, HoWDe runs consistently across macOS, Ubuntu, and Windows. The following environments have been tested:

  • Python 3.9 + PySpark 3.3 + Java 20.0
  • Python 3.12 + PySpark 4.0 + Java 17.0

2. Install HoWDe

Once PySpark is installed and configured, you can install HoWDe via pip:

pip install HoWDe

Usage

The core function of the HoWDe package is HoWDe_labelling, which performs the detection of home and work locations.

def HoWDe_labelling(
    input_data,
    edit_config_default=None,
    range_window_home=28,
    range_window_work=42,
    C_hours=0.4,
    C_days_H=0.4,
    C_days_W=0.5,
    f_hours_H=0.7,
    f_hours_W=0.4,
    f_days_W=0.6,
    output_format="stop",
    verbose=False,
):
    """
    Perform Home and Work Detection (HoWDe)
    """

📥 Input Data

HoWDe expects the input to be a PySpark DataFrame containing one row per user stop, with the following columns:

Column Type Description
useruuid str or int Unique user identifier.
loc str or int Stop location ID (unique per useruuid).
⚠️ Avoid using -1 to label meaningful stops, as these are dropped following the Infostop convention.
start long Start time of the stop (Unix timestamp).
end long End time of the stop (Unix timestamp).
tz_hour_start, tz_minute_start int Optional. Time zone offsets (hours and minutes) used to convert UTC timestamps to local time, if applicable.
country int Optional. Country code; if not provided, a default "GL0B" label is assigned.

Example

+---------+-----+-------------+-------------+---------------+----------------+---------+
| useruuid| loc | start       | end         | tz_hour_start | tz_minute_start| country |
+---------+-----+-------------+-------------+---------------+----------------+---------+
| 1001    |  1 | 1704031200  | 1704034800  | 1             | 0              | DK      |
| 1001    |  2 | 1704056400  | 1704060000  | 1             | 0              | DK      |
+---------+-----+-------------+-------------+---------------+----------------+---------+

💡 Scalability Tip: This package involves heavy computations (e.g., window functions, UDFs). To ensure efficient parallel processing, use df.repartition("useruuid") to distribute data across partitions evenly. This reduces memory bottlenecks and improves resource utilization.

⚙️ Key Parameters

Parameter Type Description Suggested value and range
range_window_home int or list Sliding window size (in days) used to detect home locations. 28 [14-112]
range_window_work int or list Sliding window size (in days) used to detect work locations. 42 [14-112]
C_hours float or list Minimum fraction of night/business hourly-bins with data in a day 0.4 [0.2-0.9]
C_days_H float or list Minimum fraction of days with data in a window 0.4 [0.1-0.6]
C_days_W float or list Minimum fraction of days with data in a window 0.5 [0.4-0.6]
f_hours_H float or list Minimum average fraction of night hourly-bins (across days in the window) required for a location to qualify as Home. 0.7 [0.5-0.9]
f_hours_W float or list Minimum average fraction of business hourly-bins (across days in the window) required for a location to qualify as Work. 0.4 [0.4-0.6]
f_days_W float or list Minimum fraction of days within the window a location should be visited to qualify as Work. 0.6 [0.5-0.8]

All parameters listed above can also be provided as lists to explore multiple configurations in a single run.

💡 Tuning Tip: When adjusting detection parameters, start by refining the temporal coverage filters C_days_H, C_days_W to match the characteristics of your data. Once these are well aligned, tune the estimation thresholds f_hours_H, f_hours_W, f_days_W based on the case of study according to the specifics of your case study. These estimation thresholds play a major role in determining how strictly the algorithm identifies consistent home and work locations.

While we provide recommended parameter ranges to guide your exploration, the hard-coded limits in howde/config.py are intentionally more relaxed—they simply prevent non-sensical values. Inputs falling outside these hard limits will raise an error.

🔧 Other Parameters

  • edit_config_default (dict, optional): Optional dictionary that allows overriding the default settings in howde/config.py to fine-tune preprocessing and detection behavior.
    The dictionary should include parameters:

    • is_time_local — interpret timestamps as local time (True) or UTC (False)
    • min_stop_t — minimum stop duration (seconds)
    • start_hour_day, end_hour_day — hours used for home detection
    • start_hour_work, end_hour_work — hours used for work detection
    • data_for_predict — use only past data for estimation
  • stops_output (bool): If stop, returns stop-level data with location_type and one row per stop. If change, returns a compact DataFrame with only one row per day with home/work location changes.

  • verbose (bool): If True, reports processing steps.

📤 Returns

If a single parameter configuration is used, the function returns a PySpark DataFrame with three additional columns:

  • detect_H_loc The location ID (loc) identified as Home. Assigned if the location satisfies all filtering criteria. As such, it represents a day-level assessment, taking into account observations within a sliding window of t ± range_window_home / 2 days.
  • detect_W_loc The location ID (loc) identified as Work. Assigned if the location satisfies all filtering criteria. As such, it represents a day-level assessment, taking into account observations within a sliding window of t ± range_window_work / 2 days.
  • location_type Indicates the detected location type for each stop ('H' for Home, 'W' for Work, or 'O' for Other), based on matching the stop location to the inferred home/work labels.

If multiple parameter configurations are provided (as lists), the function returns a list of dictionaries, each with keys:

  • configs: including the configuration used
  • res: including the resulting labeled PySpark DataFrame (as described above)

Example Usage

from pyspark.sql import SparkSession
from howde import HoWDe_labelling

# Initialize Spark session
spark = SparkSession.builder.appName('HoWDeApp').getOrCreate()

# Load your stop location data
input_data = spark.read.parquet('path_to_your_data.parquet')

# Run HoWDe labelling
labeled_data = HoWDe_labelling(
    input_data,
    range_window_home=28,
    range_window_work=42,
    C_hours=0.4,
    C_days_H=0.4,
    C_days_W=0.5,
    f_hours_H=0.7,
    f_hours_W=0.4,
    f_days_W=0.6,
    output_format="stop",
    verbose=False,
)

# Show the results
labeled_data.show()

See more examples at /tutorials

Data

Anonymized stop location data with true home and work labels will be available at:

De Sojo Caso, Silvia; Lucchini, Lorenzo; Alessandretti, Laura (2025). Benchmark datasets for home and work location detection: stop sequences and annotated labels. Technical University of Denmark. Dataset. https://doi.org/10.11583/DTU.28846325

License

This project is licensed under the MIT License. See the License file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

howde-2.0.0.tar.gz (19.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

howde-2.0.0-py3-none-any.whl (18.4 kB view details)

Uploaded Python 3

File details

Details for the file howde-2.0.0.tar.gz.

File metadata

  • Download URL: howde-2.0.0.tar.gz
  • Upload date:
  • Size: 19.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for howde-2.0.0.tar.gz
Algorithm Hash digest
SHA256 33f13bb5577b4061039a79e8c62d71cf31bdc9d0eac39721f07bb73552dbc5d2
MD5 904bfdb0a9a9eeb7a5a71e616ac06e99
BLAKE2b-256 91f88a29bfeeef0df89984b05ad8f66f7f3dc48caa9803ed3b80659dc7e2d933

See more details on using hashes here.

File details

Details for the file howde-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: howde-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 18.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for howde-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bb5a834a2d83dc98503660e2c4d4dae92f86e9cad0d5ab5f2fa1fca19a7c8aa4
MD5 e26d0b82bbbab0143c573543249102d1
BLAKE2b-256 728055fa29618135a9baeba0c424c47c84b93cdcaa7fea3792c525bb44f0afc0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page