Synthetic multilingual accommodation review data generator for Hack4Her travel-safety prototypes.
Project description
Hack4Her Mock Accommodation Reviews
This repo contains a dependency-free Python generator for synthetic Booking.com-style accommodation reviews for the Hack4Her challenge theme: women's safety while travelling.
The generated data is mock data only. Reviews, properties, labels, and coordinates are synthetic and must not be interpreted as real Booking.com customer reviews or real safety ratings for any location.
Generated Files
The default 1k balanced dataset has already been generated:
data/mock_reviews_balanced_1000.csvdata/mock_reviews_balanced_1000.jsonldata/mock_reviews_balanced_1000.summary.jsondata/mock_review_source_context_pool_10000.csvdata/mock_review_source_context_pool_10000.jsonl
Additional 1k scenario datasets are available in:
data/scenarios/data/random/
Larger 10k scenario datasets are available in:
data/scenarios_10k/data/random_10k/
Pre-generated participant-ready starter packs are available in:
data/starter_1000/data/starter_10000/
New generated outputs default to data_output_generated/, which is ignored by git.
The dataset includes multilingual reviews in English, Spanish, French, German, Dutch, Italian, Portuguese, and Arabic.
Run
For detailed usage, see docs/USAGE.md. For PyPI publishing, see docs/PUBLISHING.md.
Installable Package
After the package is published to PyPI:
python -m pip install hack4her-review-data
hack4her-data --starter-pack --records 1000
For the Rich/Typer visual terminal:
python -m pip install "hack4her-review-data[cli]"
hack4her-data-cli
Participant Start Point
Use starter packs when teams need data to begin building without seeing organizer labels.
Fancy terminal UI:
macOS/Linux:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-cli.txt
python3 scripts/hack4her_cli.py
Windows PowerShell:
py -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements-cli.txt
python scripts\hack4her_cli.py
The fancy CLI opens a Booking.com Hack4Her branded terminal menu where teams select the dataset type, record count, output format, and output folder. Visual menu outputs automatically hide organizer/evaluation labels in the main dataset and create a separate 10% labeled golden sample for validation or scoring. It uses a cross-platform Rich/Typer interface with an animated Booking.com header, smaller Hack4Her text in pink, scenario safety-mix previews, output-folder checks, generation-plan panels, written-file summaries, and animated progress bars. The dependency-free script below remains available for teams that only want Python standard library commands.
Direct fancy CLI commands also work:
python3 scripts/hack4her_cli.py menu
python3 scripts/hack4her_cli.py doctor
python3 scripts/hack4her_cli.py starter --records 1000
python3 scripts/hack4her_cli.py scenarios
Generate participant-ready CSV files for all deterministic scenarios:
python3 scripts/generate_mock_reviews.py --starter-pack --records 1000
Choose any size from 1000 to 10000 in steps of 1000:
python3 scripts/generate_mock_reviews.py --starter-pack --records 5000
python3 scripts/generate_mock_reviews.py --starter-pack --records 10000
Starter packs default to:
data_output_generated/
Each starter pack contains one public CSV per scenario, one 10% labeled golden CSV per scenario, summaries, and a small README explaining how to choose a dataset.
Generate the default deterministic 1k balanced dataset:
python3 scripts/generate_mock_reviews.py
Generate a specific scenario:
python3 scripts/generate_mock_reviews.py --records 1000 --scenario safety_heavy --output-dir data_output_generated
python3 scripts/generate_mock_reviews.py --records 1000 --scenario location_focus --output-dir data_output_generated
python3 scripts/generate_mock_reviews.py --records 1000 --scenario host_focus --output-dir data_output_generated
python3 scripts/generate_mock_reviews.py --records 1000 --scenario stay_focus --output-dir data_output_generated
python3 scripts/generate_mock_reviews.py --records 1000 --scenario mostly_positive --output-dir data_output_generated
Generate all deterministic scenarios:
python3 scripts/generate_mock_reviews.py --all-scenarios --records 1000 --output-dir data_output_generated
Generate all deterministic 10k scenarios:
python3 scripts/generate_mock_reviews.py --all-scenarios --records 10000 --output-dir data_output_generated
Generate a deliberately random set. This changes on each run unless --seed is provided:
python3 scripts/generate_mock_reviews.py --scenario random --records 1000 --output-dir data_output_generated
Generate a 10k random set:
python3 scripts/generate_mock_reviews.py --scenario random --records 10000 --output-dir data_output_generated
Generate a participant-facing version without helper labels:
python3 scripts/generate_mock_reviews.py --records 1000 --scenario balanced --public --format csv --output-dir data_output_generated
With --public, the full main dataset hides organizer labels and the script also writes a _golden_10pct.csv file with labels for 10% of rows.
Reproducibility
Deterministic scenarios use a stable 10k source context pool and a default seed of 20260522, so everyone running the same command gets the same records. The normal record choices are 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, and 10000. The random scenario intentionally uses a fresh random seed unless you pass --seed.
Use --write-source-pool to write the synthetic 10k source context pool:
python3 scripts/generate_mock_reviews.py --records 1000 --scenario balanced --write-source-pool --output-dir data_output_generated
Scenarios
balanced: mixed travel reviews with a visible safety signal.safety_heavy: many safety-related reviews across location, host, and stay.location_focus: safety around neighborhood, route, entrance, or transit.host_focus: host conduct, check-in conduct, and support response.stay_focus: room, lock, access, privacy, and on-property safety concerns.mostly_positive: mostly normal or positive reviews with sparse safety concerns.random: non-deterministic topic mix for surprise testing.
Useful Columns
review_text,review_title,language,rating: primary participant-facing review fields.city,country,latitude,longitude,area_type: useful for map prototypes.is_safety_related,safety_category,safety_concern_level,safety_signal: helper labels for testing or evaluation.topic,sentiment,labels: additional organizer-facing metadata.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hack4her_review_data-0.1.0.tar.gz.
File metadata
- Download URL: hack4her_review_data-0.1.0.tar.gz
- Upload date:
- Size: 48.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3128a17d7d9d5d5b88119dbfb153ea8be897d5817da53607d9ae1c7a8e6bba69
|
|
| MD5 |
a985263704bb3587c361367ce62faeff
|
|
| BLAKE2b-256 |
f4be4798681cea0ba5fad2ed9314a0da50f37e7676245595fb4ee1b08692c4fa
|
Provenance
The following attestation bundles were made for hack4her_review_data-0.1.0.tar.gz:
Publisher:
publish.yml on iflashlord/hack4her-review-data
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hack4her_review_data-0.1.0.tar.gz -
Subject digest:
3128a17d7d9d5d5b88119dbfb153ea8be897d5817da53607d9ae1c7a8e6bba69 - Sigstore transparency entry: 1604910349
- Sigstore integration time:
-
Permalink:
iflashlord/hack4her-review-data@4e465891a70e8b234ff0743ccfa50a03f84f8960 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/iflashlord
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4e465891a70e8b234ff0743ccfa50a03f84f8960 -
Trigger Event:
release
-
Statement type:
File details
Details for the file hack4her_review_data-0.1.0-py3-none-any.whl.
File metadata
- Download URL: hack4her_review_data-0.1.0-py3-none-any.whl
- Upload date:
- Size: 44.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6345d25d79604cea51af6b977a015b3e3c62c6bc65ce1f11344df4a6a52b615c
|
|
| MD5 |
aa66c48c5b2fc69dc2689b308fe33244
|
|
| BLAKE2b-256 |
0598f3ed5998b4da16c09ac27ab02a1a69e149ca751dff76beb6f46c9f1442b5
|
Provenance
The following attestation bundles were made for hack4her_review_data-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on iflashlord/hack4her-review-data
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hack4her_review_data-0.1.0-py3-none-any.whl -
Subject digest:
6345d25d79604cea51af6b977a015b3e3c62c6bc65ce1f11344df4a6a52b615c - Sigstore transparency entry: 1604910483
- Sigstore integration time:
-
Permalink:
iflashlord/hack4her-review-data@4e465891a70e8b234ff0743ccfa50a03f84f8960 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/iflashlord
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4e465891a70e8b234ff0743ccfa50a03f84f8960 -
Trigger Event:
release
-
Statement type: