Skip to main content

Create a report for mobility data with differential privacy guarantees.

Project description

https://img.shields.io/pypi/v/dp_mobility_report.svg Documentation Status

dp_mobility_report: A python package to create a mobility report with differential privacy (DP) guarantees, especially for urban human mobility data.

Quickstart

Install

pip install dp-mobility-report

or from GitHub:

pip install git+https://github.com/FreeMoveProject/dp_mobility_report

Data preparation

df:

  • A pandas DataFrame.

  • Expected columns: User ID uid, Trip ID tid, timestamp datetime (expected is a datetime-like string, e.g., in the format yyyy-mm-dd hh:mm:ss. If datetime contains int values, it is interpreted as sequence positions, i.e., if the dataset only consists of sequences without timestamps), latitude and longitude in CRS EPSG:4326 lat and lng. (We thereby closely followed the format of the scikit-mobility TrajDataFrame.)

  • Here you can find an example dataset.

tessellation:

  • A geopandas GeoDataFrame with polygons.

  • Expected columns: tile_id.

  • The tessellation is used for spatial aggregations of the data.

  • Here you can find an example tessellation.

  • If you don’t have a tessellation, you can use this code to create a tessellation.

Create a DpMobilityReport

import pandas as pd
import geopandas as gpd
from dp_mobility_report import DpMobilityReport

df = pd.read_csv(
    "https://raw.githubusercontent.com/FreeMoveProject/dp_mobility_report/main/tests/test_files/test_data.csv"
)
tessellation = gpd.read_file(
    "https://raw.githubusercontent.com/FreeMoveProject/dp_mobility_report/main/tests/test_files/test_tessellation.geojson"
)

report = DpMobilityReport(df, tessellation, privacy_budget=10, max_trips_per_user=5)

report.to_file("my_mobility_report.html")

The parameter privacy_budget (in terms of epsilon-DP) determines how much noise is added to the data. The budget is split between all analyses of the report. If the value is set to None no noise (i.e., no privacy guarantee) is applied to the report.

The parameter max_trips_per_user specifies how many trips a user can contribute to the dataset at most. If a user is represented with more trips, a random sample is drawn according to max_trips_per_user. If the value is set to None the full dataset is used. Note, that deriving the maximum trips per user from the data violates the differential privacy guarantee. Thus, None should only be used in combination with privacy_budget=None.

Please refer to the documentation for information on further parameters. Here you can find information on the analyses of the report.

Example HTMLs can be found in the examples folder.

Create a BenchmarkReport

A benchmark report evaluate the similarity of two (differentially private) mobility reports from one or two mobility datasets. This can be based on two datasets (df_base and df_alternative) or one dataset (df_base)) with different privacy settings. The arguments df, privacy_budget, user_privacy, max_trips_per_user and budget_split can differ for the two datasets set with the according ending _base and _alternative. The other arguments are the same for both reports. For the evaluation, similarity measures (namely the (mean) absolute percentage error (PE), Jensen-Shannon divergence (JSD), Kullback-Leibler divergence (KLD), and the earth mover’s distance (EMD)) are computed to quantify the statistical similarity for each analysis. The evaluation, i.e., benchmark report, will be generated as an HTML file, using the .to_file() method.

Benchmark of two different datasets

This example creates a benchmark report with similarity measures for two mobility datasets, called base and alternative in the following. This is intended to compare different datasets with the same or no privacy budget.

import pandas as pd
import geopandas as gpd
from dp_mobility_report import BenchmarkReport

# -- insert paths --
df_base = pd.read_csv("mobility_dataset_base.csv")
df_alternative = pd.read_csv("mobility_dataset_alternative.csv")
tessellation = gpd.read_file("tessellation.gpkg")

benchmark_report = BenchmarkReport(
    df_base=df_base, tesselation=tessellation, df_alternative=df_alternative
)

# Dictionary containing the similarity measures for each analysis
similarity_measures = benchmark_report.similarity_measures
# The measure selection indicates which similarity measure
# (e.g. KLD, JSD, EMD, PE) has been selected for each analysis
measure_selection = benchmark_report.measure_selection

# If you do not want to access the selection of similarity measures
# but e.g. the Jensen-Shannon divergence for all analyses:
jsd = benchmark_report.jsd

# benchmark_report.to_file("my_benchmark_mobility_report.html")

The parameter measure_selection specifies which similarity measures should be chosen for the similarity_measures dictionary that is an attribute of the BenchmarkReport. The default is set to a specific set of similarity measures for each analysis which can be accessed by dp_mobility_report.default_measure_selection(). The default of single analyses can be overwritten as shown in the following:

from dp_mobility_report import BenchmarkReport, default_measure_selection
from dp_mobility_report import constants as const

# print the default measure selection
print(default_measure_selection())

# change default of EMD for visits_per_tile to JSD.
# For the other analyses the default measure is remained
custom_measure_selection = {const.VISITS_PER_TILE: const.JSD}

benchmark_report = BenchmarkReport(
    df_base=df_base,
    tesselation=tessellation,
    df_alternative=df_alternative,
    measure_selection=custom_measure_selection,
)

Benchmark of the same dataset with different privacy settings

This example creates a BenchmarkReport with similarity measures for the same mobility dataset with different privacy settings (privacy_budget, user_privacy, max_trips_per_user and budget_split) to assess the utility loss of the privacy budget for the different analyses.

import pandas as pd
import geopandas as gpd
from dp_mobility_report import BenchmarkReport

# -- insert paths --
df_base = pd.read_csv("mobility_dataset_base.csv")
tessellation = gpd.read_file("tessellation.gpkg")

benchmark_report = BenchmarkReport(
    df_base=df_base,
    tesselation=tessellation,
    privacy_budget_base=None,
    privacy_budget_alternative=5,
    max_trips_per_user_base=None,
    max_trips_per_user_alternative=4,
)

similarity_measures = benchmark_report.similarity_measures

# benchmark_report.to_file("my_benchmark_mobility_report.html")

Please refer to the documentation for information on further parameters.

Examples

Berlin mobility data simulated using the DLR TAPAS Model: [Code used for Berlin]

Madrid CRTM survey data: [Code used for Madrid]

Beijing Geolife dataset: [Code used for Beijing]

Benchmark Report: [Code used for Benchmarkreport of Berlin]

(Here you find the code of the data preprocessing to obtain the needed format)

Citing

if you use dp-mobility-report please cite the following paper:

@article{doi:10.1080/17489725.2022.2148008,
                author = {Alexandra Kapp and Saskia Nuñez von Voigt and Helena Mihaljević and Florian Tschorsch},
                title = {Towards mobility reports with user-level privacy},
                journal = {Journal of Location Based Services},
                volume = {17},
                number = {2},
                pages = {95-121},
                year  = {2023},
                publisher = {Taylor & Francis},
                doi = {10.1080/17489725.2022.2148008}
}

Credits

This package was highly inspired by the pandas-profiling/pandas-profiling and scikit-mobility packages.

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

This package was developed as part of the freemove project which is funded by:

https://www.freemove.space/assets/images/bmbf-logo.svg

History

0.2.10 (2024-01-03)

  • Fix to work with pandas 2.2.0rc0 update

0.2.9 (2023-08-17)

  • Fix to work with pandas 2.1.0rc0 update

0.2.8 (2023-04-03)

  • Bug fix: smape of trips per day

0.2.7 (2023-03-30)

  • Update requirements

0.2.6 (2023-03-24)

  • Bug fix: shape mismatch in similarity_measures for edge case (only counts in bin “inf”)

0.2.5 (2023-03-24)

  • Bug fix: compatibility with pandas >= 2.0 and pandas < 2.0

0.2.4 (2023-03-23)

  • Enhance HTML design

  • Include info texts for all analyses

  • Include documentation for differential privacy and an info box about DP in the report

  • Enhance documentation

  • Add option for subtitle in DpMobilityReport and BenchmarkReport to name the report.

0.2.3 (2023-02-13)

  • Bug fix: handle if no visit is within the tessallation

  • Bug fix: handle if no OD trip is within the tessallation

  • Bug fix: unify histogram bins rounding issue

0.2.2 (2023-02-01)

  • Bug fix: exclude user_time_delta if there is no user with at least two trips.

  • Bug fix: set max_trips_per_user correctly if user_privacy=False.

  • Enhancement: do not exclude jump_length and travel_time if no tessellation is given

0.2.1 (2023-01-24)

  • Bug fix: Correct range of scale for visits per time and tile map.

0.2.0 (2023-01-23)

  • Create a BenchmarkReport class that evaluates the similarity of two (differentially private) mobility reports from one or two mobility datasets and creates an HTML output similar to the DpMobilityReport.

0.1.8 (2023-01-16)

  • Refine handling of OD Analysis input data:
    • warn if there are no trips with more than a single record and exclude OD Analysis

    • use all trips for travel time and jump length computation instead of only trips inside tessellation.

0.1.7 (2023-01-10)

  • Restructuring of HTML headlines.

0.1.6 (2023-01-09)

  • Refactoring of template files.

0.1.5 (2022-12-12)

  • Remove scikit-mobility dependency and refactor od flow visualization.

0.1.4 (2022=12=07)

  • Remove Google Fonts from HTML.

0.1.3 (2022-12-05)

  • Handle FutureWarning of pandas.

0.1.2 (2022-11-24)

  • Enhanced documentation for all properties of DpMobilityReport class

0.1.1 (2022-10-27)

  • fix bug: prevent error “key trips not found” in trips_over_time if sum of trip_count is 0

0.1.0 (2022-10-21)

  • make tessellation an Optional parameter

  • allow DataFrames without timestamps but sequence numbering instead (i.e., integer for timestamp column)

  • allow to set seed for reproducible sampling of the dataset (according to max_trips_per_user)

0.0.8 (2022-10-20)

  • Fixes addressing deprecation warnings.

0.0.7 (2022-10-17)

  • parameter for a custom split of the privacy budget between different analyses

  • extend ‘analysis_selection’ to include single analyses instead of entire segments

  • parameter for ‘analysis_exclusion’ instead of selection

  • bug fix: include all possible categories for days and hour of days

  • bug fix: show correct percentage of outliers

  • show 95% confidence-interval instead of upper and lower bound

  • show privacy budget and confidence interval for each analysis

0.0.6 (2022-09-30)

  • Remove scaling of counts to match a consistent trip_count / record_count (from ds_statistics) in visits_per_tile, visits_per_time_tile and od_flows. Scaling was implemented to keep the report consistent, though it is removed for now as it introduces new issues.

  • Minor bug fixes in the visualization: outliers were not correctly converted into percentage.

0.0.5 (2022-08-26)

Bug fix: correct scaling of timewindow counts.

0.0.4 (2022-08-22)

  • Simplify naming: from MobilityDataReport to DpMobilityReport

  • Simplify import: from from dp_mobility_report import md_report.MobilityDataReport to from dp_mobility_report import DpMobilityReport

  • Enhance documentation: change style and correctly include API reference.

0.0.3 (2022-07-22)

  • Fix broken link.

0.0.2 (2022-07-22)

  • First release to PyPi.

  • It includes all basic functionality, though still in alpha version and under development.

0.0.1 (2021-12-16)

  • First version used for evaluation in Alexandra Kapp, Saskia Nuñez von Voigt, Helena Mihaljević & Florian Tschorsch (2022) Towards mobility reports with user-level privacy, Journal of Location Based Services, DOI: 10.1080/17489725.2022.2148008.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dp-mobility-report-0.2.10.tar.gz (124.8 kB view details)

Uploaded Source

Built Distribution

dp_mobility_report-0.2.10-py2.py3-none-any.whl (125.8 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file dp-mobility-report-0.2.10.tar.gz.

File metadata

  • Download URL: dp-mobility-report-0.2.10.tar.gz
  • Upload date:
  • Size: 124.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for dp-mobility-report-0.2.10.tar.gz
Algorithm Hash digest
SHA256 cdbfb76d940c98dfa7efcd7b80920de1316aade5b8cf29580bd6c1d8f8f7f21c
MD5 ea4890f5b40dcc87a290a4fc6c54ea0e
BLAKE2b-256 4195f701bc98d5c97fe113c8d9fbd57ca4a59b7e5c6ba421afc31d397d3b9b37

See more details on using hashes here.

File details

Details for the file dp_mobility_report-0.2.10-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for dp_mobility_report-0.2.10-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6603fa752eac8f9b36fd0d7749c8c13a875220bef48d3e8efda5ce5cec85b837
MD5 542a145f76f1865f56f4aa61535e2000
BLAKE2b-256 1584d25cea6ea95906078ad01d5d98b22b11dc547d1162f4f626fac021378817

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page