Skip to main content

Synthetic multilingual accommodation review data generator for Hack4Her travel-safety prototypes.

Project description

Hack4Her Mock Accommodation Reviews

This repo contains a dependency-free Python generator for synthetic Booking.com-style accommodation reviews for the Hack4Her challenge theme: women's safety while travelling.

The generated data is mock data only. Reviews, properties, labels, and coordinates are synthetic and must not be interpreted as real Booking.com customer reviews or real safety ratings for any location.

Generated Files

The default 1k balanced dataset has already been generated:

  • data/mock_reviews_balanced_1000.csv
  • data/mock_reviews_balanced_1000.jsonl
  • data/mock_reviews_balanced_1000.summary.json
  • data/mock_review_source_context_pool_10000.csv
  • data/mock_review_source_context_pool_10000.jsonl

Additional 1k scenario datasets are available in:

  • data/scenarios/
  • data/random/

Larger 10k scenario datasets are available in:

  • data/scenarios_10k/
  • data/random_10k/

Pre-generated participant-ready starter packs are available in:

  • data/starter_1000/
  • data/starter_10000/

New generated outputs default to data_output_generated/, which is ignored by git.

The dataset includes multilingual reviews in English, Spanish, French, German, Dutch, Italian, Portuguese, and Arabic.

Run

For detailed usage, see docs/USAGE.md. For PyPI publishing, see docs/PUBLISHING.md.

Installable Package

After the package is published to PyPI:

python -m pip install hack4her-review-data
hack4her-data --starter-pack --records 1000

For the Rich/Typer visual terminal:

python -m pip install "hack4her-review-data[cli]"
hack4her-data-cli

Participant Start Point

Use starter packs when teams need data to begin building without seeing organizer labels.

Fancy terminal UI:

macOS/Linux:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-cli.txt
python3 scripts/hack4her_cli.py

Windows PowerShell:

py -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements-cli.txt
python scripts\hack4her_cli.py

The fancy CLI opens a Booking.com Hack4Her branded terminal menu where teams select the dataset type, record count, output format, and output folder. Visual menu outputs automatically hide organizer/evaluation labels in the main dataset and create a separate 10% labeled golden sample for validation or scoring. It uses a cross-platform Rich/Typer interface with an animated Booking.com header, smaller Hack4Her text in pink, scenario safety-mix previews, output-folder checks, generation-plan panels, written-file summaries, and animated progress bars. The dependency-free script below remains available for teams that only want Python standard library commands.

Direct fancy CLI commands also work:

python3 scripts/hack4her_cli.py menu
python3 scripts/hack4her_cli.py doctor
python3 scripts/hack4her_cli.py starter --records 1000
python3 scripts/hack4her_cli.py scenarios

Generate participant-ready CSV files for all deterministic scenarios:

python3 scripts/generate_mock_reviews.py --starter-pack --records 1000

Choose any size from 1000 to 10000 in steps of 1000:

python3 scripts/generate_mock_reviews.py --starter-pack --records 5000
python3 scripts/generate_mock_reviews.py --starter-pack --records 10000

Starter packs default to:

  • data_output_generated/

Each starter pack contains one public CSV per scenario, one 10% labeled golden CSV per scenario, summaries, and a small README explaining how to choose a dataset.

Generate the default deterministic 1k balanced dataset:

python3 scripts/generate_mock_reviews.py

Generate a specific scenario:

python3 scripts/generate_mock_reviews.py --records 1000 --scenario safety_heavy --output-dir data_output_generated
python3 scripts/generate_mock_reviews.py --records 1000 --scenario location_focus --output-dir data_output_generated
python3 scripts/generate_mock_reviews.py --records 1000 --scenario host_focus --output-dir data_output_generated
python3 scripts/generate_mock_reviews.py --records 1000 --scenario stay_focus --output-dir data_output_generated
python3 scripts/generate_mock_reviews.py --records 1000 --scenario mostly_positive --output-dir data_output_generated

Generate all deterministic scenarios:

python3 scripts/generate_mock_reviews.py --all-scenarios --records 1000 --output-dir data_output_generated

Generate all deterministic 10k scenarios:

python3 scripts/generate_mock_reviews.py --all-scenarios --records 10000 --output-dir data_output_generated

Generate a deliberately random set. This changes on each run unless --seed is provided:

python3 scripts/generate_mock_reviews.py --scenario random --records 1000 --output-dir data_output_generated

Generate a 10k random set:

python3 scripts/generate_mock_reviews.py --scenario random --records 10000 --output-dir data_output_generated

Generate a participant-facing version without helper labels:

python3 scripts/generate_mock_reviews.py --records 1000 --scenario balanced --public --format csv --output-dir data_output_generated

With --public, the full main dataset hides organizer labels and the script also writes a _golden_10pct.csv file with labels for 10% of rows.

Reproducibility

Deterministic scenarios use a stable 10k source context pool and a default seed of 20260522, so everyone running the same command gets the same records. The normal record choices are 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, and 10000. The random scenario intentionally uses a fresh random seed unless you pass --seed.

Use --write-source-pool to write the synthetic 10k source context pool:

python3 scripts/generate_mock_reviews.py --records 1000 --scenario balanced --write-source-pool --output-dir data_output_generated

Scenarios

  • balanced: mixed travel reviews with a visible safety signal.
  • safety_heavy: many safety-related reviews across location, host, and stay.
  • location_focus: safety around neighborhood, route, entrance, or transit.
  • host_focus: host conduct, check-in conduct, and support response.
  • stay_focus: room, lock, access, privacy, and on-property safety concerns.
  • mostly_positive: mostly normal or positive reviews with sparse safety concerns.
  • random: non-deterministic topic mix for surprise testing.

Useful Columns

  • review_text, review_title, language, rating: primary participant-facing review fields.
  • city, country, latitude, longitude, area_type: useful for map prototypes.
  • is_safety_related, safety_category, safety_concern_level, safety_signal: helper labels for testing or evaluation.
  • topic, sentiment, labels: additional organizer-facing metadata.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hack4her_review_data-0.1.0.tar.gz (48.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hack4her_review_data-0.1.0-py3-none-any.whl (44.1 kB view details)

Uploaded Python 3

File details

Details for the file hack4her_review_data-0.1.0.tar.gz.

File metadata

  • Download URL: hack4her_review_data-0.1.0.tar.gz
  • Upload date:
  • Size: 48.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hack4her_review_data-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3128a17d7d9d5d5b88119dbfb153ea8be897d5817da53607d9ae1c7a8e6bba69
MD5 a985263704bb3587c361367ce62faeff
BLAKE2b-256 f4be4798681cea0ba5fad2ed9314a0da50f37e7676245595fb4ee1b08692c4fa

See more details on using hashes here.

Provenance

The following attestation bundles were made for hack4her_review_data-0.1.0.tar.gz:

Publisher: publish.yml on iflashlord/hack4her-review-data

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hack4her_review_data-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for hack4her_review_data-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6345d25d79604cea51af6b977a015b3e3c62c6bc65ce1f11344df4a6a52b615c
MD5 aa66c48c5b2fc69dc2689b308fe33244
BLAKE2b-256 0598f3ed5998b4da16c09ac27ab02a1a69e149ca751dff76beb6f46c9f1442b5

See more details on using hashes here.

Provenance

The following attestation bundles were made for hack4her_review_data-0.1.0-py3-none-any.whl:

Publisher: publish.yml on iflashlord/hack4her-review-data

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page