Skip to main content

A package which makes it easy to generate TPC-H data in parallel with DuckDB

Project description

tpch-datagen - by GizmoData

A utility to generate TPC-H data in parallel using DuckDB and multi-processing

tpch-datagen-ci Supported Python Versions PyPI version PyPI Downloads

Why?

Because generating TPC-H data can be time-consuming and resource-intensive. This project provides a way to generate TPC-H data in parallel using DuckDB and multi-processing.

Setup (to run locally)

Install Python package

You can install tpch-datagen from PyPi or from source.

Option 1 - from PyPi

# Create the virtual environment
python3 -m venv .venv

# Activate the virtual environment
. .venv/bin/activate

pip install tpch-datagen

Option 2 - from source - for development

git clone https://github.com/gizmodata/tpch-datagen

cd tpch-datagen

# Create the virtual environment
python3 -m venv .venv

# Activate the virtual environment
. .venv/bin/activate

# Upgrade pip, setuptools, and wheel
pip install --upgrade pip setuptools wheel

# Install TPC-H Datagen - in editable mode with client and dev dependencies
pip install --editable .[dev]

Note

For the following commands - if you running from source and using --editable mode (for development purposes) - you will need to set the PYTHONPATH environment variable as follows:

export PYTHONPATH=$(pwd)/src

Usage

Here are the options for the tpch-datagen command:

tpch-datagen --help
Usage: tpch-datagen [OPTIONS]

Options:
  --version / --no-version        Prints the TPC-H Datagen package version and
                                  exits.  [required]
  --scale-factor INTEGER          The TPC-H Scale Factor to use for data
                                  generation.
  --data-directory TEXT           The target output data directory to put the
                                  files into  [default: data; required]
  --work-directory TEXT           The work directory to use for data
                                  generation.  [default: /tmp; required]
  --overwrite / --no-overwrite    Can we overwrite the target directory if it
                                  already exists...  [default: no-overwrite;
                                  required]
  --num-chunks INTEGER            The number of chunks that will be generated
                                  - more chunks equals smaller memory
                                  requirements, but more files generated.
                                  [default: 10; required]
  --num-processes INTEGER         The maximum number of processes for the
                                  multi-processing pool to use for data
                                  generation.  [default: 10; required]
  --duckdb-threads INTEGER        The number of DuckDB threads to use for data
                                  generation (within each job process).
                                  [default: 1; required]
  --per-thread-output / --no-per-thread-output
                                  Controls whether to write the output to a
                                  single file or multiple files (for each
                                  process).  [default: per-thread-output;
                                  required]
  --compression-method [none|snappy|gzip|zstd]
                                  The compression method to use for the
                                  parquet files generated.  [default: zstd;
                                  required]
  --file-size-bytes TEXT          The target file size for the parquet files
                                  generated.  [default: 100m; required]
  --parquet-version [v1|v2]       The version of Parquet to use for the
                                  parquet files generated.  [default: v2;
                                  required]
  --help                          Show this message and exit.

[!NOTE]
Default values may change depending on the number of CPU cores you have, etc.

Handy development commands

Version management

Bump the version of the application - (you must have installed from source with the [dev] extras)
bumpver update --patch

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tpch_datagen-0.0.8.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tpch_datagen-0.0.8-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file tpch_datagen-0.0.8.tar.gz.

File metadata

  • Download URL: tpch_datagen-0.0.8.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tpch_datagen-0.0.8.tar.gz
Algorithm Hash digest
SHA256 9c2ab22fa6f65d355faff928a4edfaff9d5447a921ff292fb0bff2318438fb1b
MD5 7a044be8e9e790ae7e383e98ebbbc4b1
BLAKE2b-256 fee1c57cc6f8693710c71f6cfa4424cde2e01f3fe6aa10340de00a5f5019fe04

See more details on using hashes here.

Provenance

The following attestation bundles were made for tpch_datagen-0.0.8.tar.gz:

Publisher: ci.yml on gizmodata/tpch-datagen

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tpch_datagen-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: tpch_datagen-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 7.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tpch_datagen-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 37e2e023e878be96eb67f38f34de8e14f219412adb1612b70f693aa027816c98
MD5 ea40f7540c9372712af3d9a4ae4935da
BLAKE2b-256 47f95bb85955bf086306d424d5390320af67c997e256192f6228c08e9b33e170

See more details on using hashes here.

Provenance

The following attestation bundles were made for tpch_datagen-0.0.8-py3-none-any.whl:

Publisher: ci.yml on gizmodata/tpch-datagen

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page