Skip to main content

A fork of the dbtlabs Jaffle Shop generator, for baking realtime jaffles on kafka.

Project description

🥪 Jaffle Shop Generator 🏭

[!NOTE] This is not an official dbt Labs project. It is maintained on a volunteer basis by dbt Labs employees who are passionate about analytics engineering, the dbt Community, and jaffles, and feel that generating datasets for learning and practicing is important. Please understand it's a work in progress and not supported in the same way as dbt itself.

The Jaffle Shop Generator or jafgen is a simple command line tool for generating synthetic datasets suitable for analytics engineering practice or demonstrations. The data is generated in CSV format and is designed to be used with a relational database. It follows a simple schema, with tables for:

  • Customers (who place Orders)
  • Orders (from those Customers)
  • Products (the food and beverages the Orders contain)
  • Order Items (of those Products)
  • Supplies (needed for making those Products)
  • Stores (where the Orders are placed and fulfilled)
  • Tweets (Customers sometimes issue Tweets after placing an Order)

It uses some straightforward math to create seasonality and trends in the data, for instance weekends being less busy than weekdays, customers having certain preferences, and new store locations opening over time. We plan to add more data types and complexity as the codebase evolves.

Installation

Requires Python 3.10 or higher.

If you have pipx installed, jafgen is an ideal tool to use via pipx. You can generate data without installing anything in the local workspace using the following:

pipx run jafgen [options]

You can also install jafgen into your project or workspace, ideally in a virtual environment.

pip install jafgen

Use

jafgen takes one argument:

  • [int] Years to generate data for. The default is 1 year.

The following options are available:

  • --days [int] The number of days to generate data for. If both years and days are set, they will be added together.

  • --pre sets a prefix for the generated files in the format [prefix]_[file_name].csv. It defaults to raw.

Generate a simulation spanning 3 years from 2016-2019 with a prefix of cool:

jafgen 3 --pre cool

Purpose

Finding a good data set to practice, learn, or teach analytics engineering with can be difficult. Most open datasets are great for machine learning -- they offer single wide tables that you can manipulate and analyze. Full, real relational databases on the other hand are generally protected by private companies. Not only that, but they're a bit too real. To get to a state that a beginner or intermediate person can understand, there needs to be an advanced amount of analytics engineering transformation applied.

To that end, this project generates relatively simple, clean (but importantly, not perfect) data for a variety of entities in discrete tables, which can be transformed and combined into analytical building blocks. There are even trends (like seasonality) and personas (like buying patterns) that can be sussed out through data modeling!

Approach

The great @drewbanin watched the movie Synecdoche, New York, and was inspired by the idea of creating a complete simulation of a world. Rather than using discrete rules to generate synthetic data, instead setting up entities with behavior patterns and letting them loose to interact with each other. This was the birth of the Jaffle Shop Generator. There are customers, stores, products, and more, all with their own behaviors and interactions as time passes. These combine to create unique and realistic datasets on every run.

An important caveat is that jafgen is not idempotent. By design, it generates new data every time you run it based on the simulation's interactions. This is intended behavior, as it allows for more realistic and interesting data generation. Certain aspects are hard coded, like stores opening at certain times, but the output data is always unique.

We hope over time to add more complex behaviors and trends to the simulation!

Contribution

We welcome contribution to the project! It's relatively simple to get started, just clone the repo, spin up a virtual environment, and install the dependencies:

gh repo clone dbt-labs/jaffle-shop-generator
python3 -m venv .venv
# Install the package requirements
pip install -r requirements.txt
# Install the dev tooling (ruff and pytest)
pip install -r dev-requirements.txt
# Install the package in editable mode
pip install -e .

Working out from the jafgen command, you can see the main entrypoint in jaffle_shop_generator/cli.py. This calls the simulation found in jafgen/simulation.py. The simulation is where most of the magic happens.

We recommend installing our githook scripts locally. To do that, install Lefthook and run

lefthook install

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jafkafegen-0.4.17.tar.gz (26.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jafkafegen-0.4.17-py3-none-any.whl (79.1 kB view details)

Uploaded Python 3

File details

Details for the file jafkafegen-0.4.17.tar.gz.

File metadata

  • Download URL: jafkafegen-0.4.17.tar.gz
  • Upload date:
  • Size: 26.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.9.6 Darwin/24.6.0

File hashes

Hashes for jafkafegen-0.4.17.tar.gz
Algorithm Hash digest
SHA256 25fc2911289321da91010dbe10631dfeeee2fd6e51361364021332b79ae70007
MD5 212add6c70aaa7d824c2fffac954c5d0
BLAKE2b-256 8d2b11fa6d7516b4215ac0f35269dd592b85ac58a573c2ee6a344ae225081146

See more details on using hashes here.

File details

Details for the file jafkafegen-0.4.17-py3-none-any.whl.

File metadata

  • Download URL: jafkafegen-0.4.17-py3-none-any.whl
  • Upload date:
  • Size: 79.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.9.6 Darwin/24.6.0

File hashes

Hashes for jafkafegen-0.4.17-py3-none-any.whl
Algorithm Hash digest
SHA256 5bec05d3852706648a638ae1db12882dc93cc3a81b9d23e8cbf72ee89b2306ab
MD5 024ec20a51a4d4cd3959a66c3ff8ce26
BLAKE2b-256 631b14d0f05f17aef4c3be086778a6887660b8ccabbc7c36b7e9b724c62562b8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page