Synthetic data generator for snail mutation survey

These details have not been verified by PyPI

Project links

home

Project description

Snailz

snailz is a synthetic data generator that models a study of snails in the Pacific Northwest which are growing to unusual size as a result of exposure to pollution. The package generates fully-reproducible datasets of varying sizes and with varying statistical properties, and is intended for classroom use. For example, an instructor can give each learner a unique dataset to analyze, while learners can test their analysis pipelines using datasets they generate themselves.

The Story

Years ago, logging companies dumped toxic waste in a remote region of Vancouver Island. As the containers leaked and the pollution spread, some snails in the region began growing unusually large. Your team is now collecting and analyzing specimens from affected regions to determine if exposure to pollution is responsible.

snailz generates several related datasets:

Grids: the survey grids where pollution levels are measured.
Persons: the scientists conducting the study.
Samples: the snails collected from the survey sites.
Machines: the equipment used in the survey.
Ratings: the scientists' proficiency ratings with the machines.

Usage

To generate example data in a fresh directory:

# Create and activate Python virtual environment.
$ uv venv
$ source .venv/bin/activate

# Install snailz and dependencies.
$ uv pip install snailz

# Get help.
$ snailz --help

# Generate and display a dataset using the default parameters.
$ snailz --outdir -

# Write default parameter values to ./params.json for editing.
$ snailz --defaults > params.json

# Generate output with custom parameters in the ./data directory.
$ snailz --params params.json --outdir data

Parameters

snailz reads controlling parameters from a JSON file, and can generate a file with default parameter values as a starting point. The parameters, their meanings, and their properties are:

Name	Purpose	Default
`clumsy_factor`	personal effect on mass measurement	0.5
`grid_gap`	minimum spacing between grids (m)	1000.0
`grid_size`	width and height of (square) survey grids	11
`grid_spacing`	size of survey grid cell (m)	20
`lat0`	reference latitude of grids (deg)	48.8666632
`lon0`	reference longitude of grids (deg)	-124.1999992
`locale`	locale for person name generation	et_EE
`num_grids`	number of survey grids	3
`num_machines`	number of pieces of laboratory equipment	5
`num_persons`	number of persons	6
`num_samples`	number of samples	20
`pollution_factor`	pollution effect on mass	0.3
`precision`	decimal places used to record masses	2
`sample_date`	min/max sample dates (YYYY-MM-DD)	(2025-01-01, 2025-01-01)
`sample_size`	sample mass mean and std. dev. (g)	(50, 10)
`seed`	random number generation seed	123456

Data Dictionary

All of the generated data is stored in CSV files and in a SQLite database.

Grids

The pollution readings for each survey grid are stored in a file Gnnnn.csv (e.g., G0003.csv). These CSV files do not have column headers; instead, each contains a square integer matrix of pollution readings. A typical file is:

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,1,1,0,0,0,0
0,0,0,0,0,0,0,0,1,2,1,0,0,0,0
0,0,0,0,0,0,0,0,2,1,0,0,0,0,0
0,0,0,0,0,0,0,1,2,0,0,0,0,0,0
0,0,0,0,0,0,0,1,2,1,0,0,0,0,0
0,0,0,0,0,0,0,0,1,2,0,0,0,0,0
0,0,0,0,0,0,0,2,2,1,0,0,0,0,0
0,0,0,0,0,0,0,1,3,0,0,0,0,0,0
0,0,0,0,0,0,0,1,3,1,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

The pollution readings for polluted grid cells are also stored in tidy format in grids.csv:

grid_id	x	y	lat	lon	pollution
G0001	1	3	48.86720218670499	-124.1997260797134	1
G0001	2	3	48.86720218670499	-124.1994529594268	1
…	…	…	…	…	…

Its fields are:

Field	Purpose	Properties
`grid_id`	identifier	text, unique, required
`x`	X coordinate in grid	integer, required
`y`	Y coordinate in grid	integer, required
`lat`	latitude of grid cell	real, required
`lon`	longitude of grid cell	real, required
`pollution`	pollution at that point	integer, required

Persons

persons.csv stores the scientists performing the study in CSV format (with column headers):

person_id	personal	family	supervisor_id
P06	Artur	Aasmäe	P22
P07	Katrin	Kool
…	…	…	…

Its fields are:

Field	Purpose	Properties
`person_id`	identifier	text, unique, required
`personal`	personal name	text, required
`family`	family name	text, required
`supervisor_id`	identifier	text

Samples

samples.csv stores information about sampled snails in CSV format (with column headers):

sample_id	grid_id	x	y	pollution	person_id	timestamp	mass	diameter
S0001	G0001	9	8	0	P0004	2025-01-16	71.5	29.6
S0002	G0001	8	9	1	P0005	2025-03-30	62.1	28.9
…	…	…	…	…	…	…	…

Its fields are:

Field	Purpose	Properties
`sample_id`	specimen identifier	text, unique, required
`grid_id`	grid identifier	text, required
`x`	X coordinate in grid	integer, required
`y`	Y coordinate in grid	integer, required
`pollution`	pollution at that point	integer, required
`person_id`	who collected the sample	text, required
`timestamp`	date sample collected	date, required
`mass`	sample weight (g)	real, required
`diameter`	sample diameter (mm)	real, required

Machines

machines.csv stores a list of machines used in the survey:

machine_id	name
M0001	Therma Sensor
M0002	Nano Fuge
…	…

Its fields are:

Field	Purpose	Properties
`machine_id`	machine identifier	text, unique, required
`name`	machine name	text, required

Ratings

ratings.csv stores the proficiency ratings of scientists with various machines:

person_id	machine_id	rating
P0006	M0004	1
P0001	M0003
…	…	…

Its fields are:

Field	Purpose	Properties
`person_id`	who has the rating	text, required
`machine_id`	the machine they are rated on	text, required
`rating`	numeric rating	integer

Extra Files

The output directory also contains a file called changes.json that records parameters used to alter data, such as the daily growth rate of snails and the ID of the clumsy scientist whose measurements have systematic errors.

Colophon

snailz was inspired by the Palmer Penguins dataset and by conversations with Rohan Alexander about his book Telling Stories with Data.

My thanks to everyone who built the tools this project relies on, including:

faker for data generation.
mkdocs for documentation.
pydantic for storing and validating data (including parameters).
pytest for testing.
ruff for checking the code.
taskipy for running tasks.
uv for managing packages and the virtual environment.

The snail logo was created by sunar.ko.

Acknowledgments

Greg Wilson is a programmer, author, and educator based in Toronto. He was the co-founder and first Executive Director of Software Carpentry and received ACM SIGSOFT's Influential Educator Award in 2020.

Project details

These details have not been verified by PyPI

Project links

home

Release history Release notifications | RSS feed

5.5.4

Feb 20, 2026

5.5.3

Feb 20, 2026

5.5.2

Feb 20, 2026

5.5.1

Feb 15, 2026

5.5.0

Feb 13, 2026

5.4.0

Feb 13, 2026

5.3.0

Feb 8, 2026

5.2.1

Feb 1, 2026

5.2.0

Feb 1, 2026

5.1.0

Feb 1, 2026

5.0.1

Jan 31, 2026

5.0.0

Jan 31, 2026

4.3.0

Jan 25, 2026

4.2.1

Jan 25, 2026

This version

4.2.0

Jan 23, 2026

3.3.0

Jan 17, 2026

3.2.0

Jul 5, 2025

3.1.0

Jul 4, 2025

3.0.0

Jul 4, 2025

2.2.0

May 3, 2025

2.1.1

May 2, 2025

2.1.0

May 2, 2025

2.0.0

Apr 29, 2025

1.4.2

Apr 18, 2025

1.4.1

Apr 18, 2025

1.4.0

Apr 18, 2025

1.3.0

Apr 13, 2025

1.2.0

Apr 10, 2025

1.1.0

Apr 7, 2025

1.0.0

Apr 6, 2025

0.2.10

Apr 1, 2025

0.2.9

Apr 1, 2025

0.2.8

Mar 31, 2025

0.2.7

Mar 30, 2025

0.2.6

Mar 30, 2025

0.2.5

Mar 30, 2025

0.2.4

Mar 29, 2025

0.2.3

Mar 28, 2025

0.2.2

Mar 27, 2025

0.2.1

Mar 27, 2025

0.2.0

Mar 25, 2025

0.1.16

Mar 15, 2025

0.1.15

Dec 14, 2024

0.1.14

Aug 21, 2024

0.1.13

Aug 17, 2024

0.1.12

Aug 13, 2024

0.1.11

Aug 12, 2024

0.1.10

Aug 11, 2024

0.1.9

Aug 10, 2024

0.1.8

Aug 8, 2024

0.1.7

Aug 7, 2024

0.1.6

Aug 4, 2024

0.1.5

Aug 3, 2024

0.1.4

Aug 2, 2024

0.1.3

Jul 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snailz-4.2.0.tar.gz (745.7 kB view details)

Uploaded Jan 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

snailz-4.2.0-py3-none-any.whl (15.8 kB view details)

Uploaded Jan 23, 2026 Python 3

File details

Details for the file snailz-4.2.0.tar.gz.

File metadata

Download URL: snailz-4.2.0.tar.gz
Upload date: Jan 23, 2026
Size: 745.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for snailz-4.2.0.tar.gz
Algorithm	Hash digest
SHA256	`9d3cac7455327a48ff4834f6f38faadd32f3db6980d2082300a513e02dca28e7`
MD5	`8333e6afe180d03a3e1fe6c0ea40ed9b`
BLAKE2b-256	`240ab0166a6132588c67727d617d1651ed1513ee7840aedae43b2b7176ae015c`

See more details on using hashes here.

File details

Details for the file snailz-4.2.0-py3-none-any.whl.

File metadata

Download URL: snailz-4.2.0-py3-none-any.whl
Upload date: Jan 23, 2026
Size: 15.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for snailz-4.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aa899bdf78669dfc2ddcb46062f7f130c6a4e40b57c6ba9301ff6de3f0b931f3`
MD5	`45c41d845c6d64eef1972ebba872ea44`
BLAKE2b-256	`9e8ed2133049cf0350807828a8ddbe892e5ab7fa6cd696dcb64a389850cf80fe`

See more details on using hashes here.

snailz 4.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Snailz

Usage

Parameters

Data Dictionary

Grids

Persons

Samples

Machines

Ratings

Extra Files

Colophon

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes