Skip to main content

Synthetic data generator for snail mutation survey

Project description

Snailz

snail logo

snailz is a synthetic data generator that models a study of snails in the Pacific Northwest which are growing to unusual size as a result of exposure to pollution. The package can generate fully-reproducible datasets of varying sizes and with varying statistical properties, and is intended primarily for classroom use. For example, an instructor can give each learner a unique dataset to analyze, while learners can test their analysis pipelines using datasets they generate themselves.

The Story

Years ago, logging companies dumped toxic waste in a remote region of Vancouver Island. As the containers leaked and the pollution spread, some snails in the region began growing unusually large. Your team is now collecting and analyzing specimens from affected regions to determine if exposure to pollution is responsible.

snailz generates three related sets of data:

  • Grids: the survey grids where pollution levels are measured.
  • Persons: the scientists conducting the study.
  • Samples: the snails collected from the survey sites.

Usage

  1. pip install snailz (or the equivalent command for your Python environment).
  2. snailz --help to see available commands.

To generate example data in a fresh directory:

# Create and activate Python virtual environment
$ uv venv
$ source .venv/bin/activate

# Install snailz and dependencies
$ uv pip install snailz

# Write default parameter values to the ./params.json file
$ snailz --defaults > params.json

# Generate all output files in the ./data directory
$ snailz --params params.json --outdir data

Parameters

snailz reads controlling parameters from a JSON file, and can generate a file with default parameter values as a starting point. The parameters, their meanings, and their properties are:

Name Purpose Default
clumsy_factor personal effect on mass measurement 0.5
grid_size width and height of (square) survey grids 11
locale locale for person name generation et_EE
num_grids number of survey grids 3
num_persons number of persons 5
num_samples number of samples 20
pollution_factor pollution effect on mass 0.3
precision decimal places used to record masses 2
sample_date_max maximum sample date 2025-03-31
sample_date_min minimum sample date 2025-01-01
sample_mass_max maximum sample mass 1.5
sample_mass_min minimum sample mass 0.5
seed random number generation seed 123456

Data Dictionary

All of the generated data is stored in CSV files.

Grids

The pollution readings for each survey grid are stored in a file Gnnnn.csv (e.g., G0003.csv). These CSV files do not have column headers; instead, each contains a square integer matrix of pollution readings. A typical file is:

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,1,1,0,0,0,0
0,0,0,0,0,0,0,0,1,2,1,0,0,0,0
0,0,0,0,0,0,0,0,2,1,0,0,0,0,0
0,0,0,0,0,0,0,1,2,0,0,0,0,0,0
0,0,0,0,0,0,0,1,2,1,0,0,0,0,0
0,0,0,0,0,0,0,0,1,2,0,0,0,0,0
0,0,0,0,0,0,0,2,2,1,0,0,0,0,0
0,0,0,0,0,0,0,1,3,0,0,0,0,0,0
0,0,0,0,0,0,0,1,3,1,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Persons

persons.csv stores the scientists performing the study in CSV format (with column headers):

id personal family
P06 Artur Aasmäe
P07 Katrin Kool

Its fields are:

Field Purpose Properties
id identifier text, unique, required
personal personal name text, required
family family name text, required

Samples

samples.csv stores information about sampled snails in CSV format (with column headers):

sample_id grid_id x y person when mass
S0001 G0001 9 8 P0004 2025-01-16 1.02
S0002 G0001 8 9 P0005 2025-03-30 2.39

Its fields are:

Field Purpose Properties
sample_id specimen identifier text, unique, required
grid_id grid identifie text, required
x X coordinate in grid integer, required
y Y coordinate in grid integer, required
person who collected the sample text, required
when date sample collected date, required
mass sample weight in grams real, required

The output directory also contains a file called changes.json that records parameters used to alter data, such as the daily growth rate of snails and the ID of the clumsy scientist whose measurements have systematic errors.

Colophon

snailz was inspired by the Palmer Penguins dataset and by conversations with Rohan Alexander about his book Telling Stories with Data.

The snail logo was created by sunar.ko.

My thanks to everyone who built the tools this project relies on, including:

  • pydantic for storing and validating data (including parameters).
  • pytest and faker for testing.
  • ruff for checking the code.
  • uv for managing packages and the virtual environment.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snailz-3.1.0.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

snailz-3.1.0-py3-none-any.whl (10.8 kB view details)

Uploaded Python 3

File details

Details for the file snailz-3.1.0.tar.gz.

File metadata

  • Download URL: snailz-3.1.0.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for snailz-3.1.0.tar.gz
Algorithm Hash digest
SHA256 a04927830acf6d5cac1c976de864434ce4366968c72e399950b1db01d00957ce
MD5 16156fcd3fc1e1d12b7d21af3d315ecb
BLAKE2b-256 bd51f451a78c9f369cbb50c9e7ae9eb3ce9c4ae7070cbfe4b9d1b2cd55c971e3

See more details on using hashes here.

File details

Details for the file snailz-3.1.0-py3-none-any.whl.

File metadata

  • Download URL: snailz-3.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for snailz-3.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f1b92782832a3f2ebccfd0bd924330258917e248c4f60cdf265cdf0f40f21dbd
MD5 f4e2f6708e1086fdb2fb5ded0236ebe8
BLAKE2b-256 852c8bc09e2467b525ed1e42b3a358963fdafcf91dfb168be19b7a2253a5549c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page