Skip to main content

Synthetic data generator for snail mutation survey

Project description

Snailz

snail logo

snailz is a synthetic data generator that models a study of snails in the Pacific Northwest which are growing to unusual size as a result of exposure to pollution. The package generates fully-reproducible datasets of varying sizes and with varying statistical properties, and is intended for classroom use. For example, an instructor can give each learner a unique dataset to analyze, while learners can test their analysis pipelines using datasets they generate themselves.

The Story

Years ago, logging companies dumped toxic waste in a remote region of Vancouver Island. As the containers leaked and the pollution spread, some snails in the region began growing unusually large. Your team is now collecting and analyzing specimens from affected regions to determine if exposure to pollution is responsible.

snailz generates several related datasets:

  • Grids: the survey grids where pollution levels are measured.
  • Persons: the scientists conducting the study.
  • Samples: the snails collected from the survey sites.
  • Machines: the equipment used in the survey.
  • Ratings: the scientists' proficiency ratings with the machines.

Usage

To generate example data in a fresh directory:

# Create and activate Python virtual environment.
$ uv venv
$ source .venv/bin/activate

# Install snailz and dependencies.
$ uv pip install snailz

# Get help.
$ snailz --help

# Generate and display a dataset using the default parameters.
$ snailz --outdir -

# Write default parameter values to ./params.json for editing.
$ snailz --defaults > params.json

# Generate output with custom parameters in the ./data directory.
$ snailz --params params.json --outdir data

Parameters

snailz reads controlling parameters from a JSON file, and can generate a file with default parameter values as a starting point. The parameters, their meanings, and their properties are:

Name Purpose Default
clumsy_factor personal effect on mass measurement 0.5
grid_gap minimum spacing between grids (m) 1000.0
grid_size width and height of (square) survey grids 11
grid_spacing size of survey grid cell (m) 20
lat0 reference latitude of grids (deg) 48.8666632
lon0 reference longitude of grids (deg) -124.1999992
locale locale for person name generation et_EE
num_grids number of survey grids 3
num_machines number of pieces of laboratory equipment 5
num_persons number of persons 6
num_samples number of samples 20
pollution_factor pollution effect on mass 0.3
precision decimal places used to record masses 2
sample_date min/max sample dates (YYYY-MM-DD) (2025-01-01, 2025-01-01)
sample_size sample mass mean and std. dev. (g) (50, 10)
seed random number generation seed 123456

Data Dictionary

All of the generated data is stored in CSV files and in a SQLite database.

Grids

The pollution readings for each survey grid are stored in a file Gnnnn.csv (e.g., G0003.csv). These CSV files do not have column headers; instead, each contains a square integer matrix of pollution readings. A typical file is:

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,1,1,0,0,0,0
0,0,0,0,0,0,0,0,1,2,1,0,0,0,0
0,0,0,0,0,0,0,0,2,1,0,0,0,0,0
0,0,0,0,0,0,0,1,2,0,0,0,0,0,0
0,0,0,0,0,0,0,1,2,1,0,0,0,0,0
0,0,0,0,0,0,0,0,1,2,0,0,0,0,0
0,0,0,0,0,0,0,2,2,1,0,0,0,0,0
0,0,0,0,0,0,0,1,3,0,0,0,0,0,0
0,0,0,0,0,0,0,1,3,1,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

The pollution readings for polluted grid cells are also stored in tidy format in grids.csv:

grid_id x y lat lon pollution
G0001 1 3 48.86720218670499 -124.1997260797134 1
G0001 2 3 48.86720218670499 -124.1994529594268 1

Its fields are:

Field Purpose Properties
grid_id identifier text, unique, required
x X coordinate in grid integer, required
y Y coordinate in grid integer, required
lat latitude of grid cell real, required
lon longitude of grid cell real, required
pollution pollution at that point integer, required

Persons

persons.csv stores the scientists performing the study in CSV format (with column headers):

person_id personal family supervisor_id
P06 Artur Aasmäe P22
P07 Katrin Kool

Its fields are:

Field Purpose Properties
person_id identifier text, unique, required
personal personal name text, required
family family name text, required
supervisor_id identifier text

Samples

samples.csv stores information about sampled snails in CSV format (with column headers):

sample_id grid_id x y pollution person_id timestamp mass diameter
S0001 G0001 9 8 0 P0004 2025-01-16 71.5 29.6
S0002 G0001 8 9 1 P0005 2025-03-30 62.1 28.9

Its fields are:

Field Purpose Properties
sample_id specimen identifier text, unique, required
grid_id grid identifier text, required
x X coordinate in grid integer, required
y Y coordinate in grid integer, required
pollution pollution at that point integer, required
person_id who collected the sample text, required
timestamp date sample collected date, required
mass sample weight (g) real, required
diameter sample diameter (mm) real, required

Machines

machines.csv stores a list of machines used in the survey:

machine_id name
M0001 Therma Sensor
M0002 Nano Fuge

Its fields are:

Field Purpose Properties
machine_id machine identifier text, unique, required
name machine name text, required

Ratings

ratings.csv stores the proficiency ratings of scientists with various machines:

person_id machine_id rating
P0006 M0004 1
P0001 M0003

Its fields are:

Field Purpose Properties
person_id who has the rating text, required
machine_id the machine they are rated on text, required
rating numeric rating integer

Extra Files

The output directory also contains a file called changes.json that records parameters used to alter data, such as the daily growth rate of snails and the ID of the clumsy scientist whose measurements have systematic errors.

Colophon

snailz was inspired by the Palmer Penguins dataset and by conversations with Rohan Alexander about his book Telling Stories with Data.

My thanks to everyone who built the tools this project relies on, including:

  • faker for data generation.
  • mkdocs for documentation.
  • pydantic for storing and validating data (including parameters).
  • pytest for testing.
  • ruff for checking the code.
  • taskipy for running tasks.
  • uv for managing packages and the virtual environment.

The snail logo was created by sunar.ko.

Acknowledgments

  • Greg Wilson is a programmer, author, and educator based in Toronto. He was the co-founder and first Executive Director of Software Carpentry and received ACM SIGSOFT's Influential Educator Award in 2020.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snailz-4.2.0.tar.gz (745.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

snailz-4.2.0-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file snailz-4.2.0.tar.gz.

File metadata

  • Download URL: snailz-4.2.0.tar.gz
  • Upload date:
  • Size: 745.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for snailz-4.2.0.tar.gz
Algorithm Hash digest
SHA256 9d3cac7455327a48ff4834f6f38faadd32f3db6980d2082300a513e02dca28e7
MD5 8333e6afe180d03a3e1fe6c0ea40ed9b
BLAKE2b-256 240ab0166a6132588c67727d617d1651ed1513ee7840aedae43b2b7176ae015c

See more details on using hashes here.

File details

Details for the file snailz-4.2.0-py3-none-any.whl.

File metadata

  • Download URL: snailz-4.2.0-py3-none-any.whl
  • Upload date:
  • Size: 15.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for snailz-4.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aa899bdf78669dfc2ddcb46062f7f130c6a4e40b57c6ba9301ff6de3f0b931f3
MD5 45c41d845c6d64eef1972ebba872ea44
BLAKE2b-256 9e8ed2133049cf0350807828a8ddbe892e5ab7fa6cd696dcb64a389850cf80fe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page