A package which makes it easy to generate TPC-H data in parallel with DuckDB
Project description
tpch-datagen
A utility to generate TPC-H data in parallel using DuckDB and multi-processing
Why?
Because generating TPC-H data can be time-consuming and resource-intensive. This project provides a way to generate TPC-H data in parallel using DuckDB and multi-processing.
Setup (to run locally)
Install Python package
You can install tpch-datagen
from PyPi or from source.
Option 1 - from PyPi
# Create the virtual environment
python3 -m venv .venv
# Activate the virtual environment
. .venv/bin/activate
pip install tpch-datagen
Option 2 - from source - for development
git clone https://github.com/gizmodata/tpch-datagen
cd tpch-datagen
# Create the virtual environment
python3 -m venv .venv
# Activate the virtual environment
. .venv/bin/activate
# Upgrade pip, setuptools, and wheel
pip install --upgrade pip setuptools wheel
# Install TPC-H Datagen - in editable mode with client and dev dependencies
pip install --editable .[dev]
Note
For the following commands - if you running from source and using --editable
mode (for development purposes) - you will need to set the PYTHONPATH environment variable as follows:
export PYTHONPATH=$(pwd)/src
Usage
Here are the options for the tpch-datagen
command:
tpch-datagen --help
Usage: tpch-datagen [OPTIONS]
Options:
--version / --no-version Prints the Spark Connect Proxy version and
exits. [required]
--scale-factor INTEGER The TPC-H Scale Factor to use for data
generation.
--data-directory TEXT The target output data directory to put the
files into [default: data; required]
--work-directory TEXT The work directory to use for data
generation. [default: /tmp; required]
--overwrite / --no-overwrite Can we overwrite the target directory if it
already exists... [default: no-overwrite;
required]
--num-chunks INTEGER The number of chunks that will be generated
- more chunks equals smaller memory
requirements, but more files generated.
[default: 10; required]
--num-processes INTEGER The maximum number of processes for the
multi-processing pool to use for data
generation. [default: 10; required]
--duckdb-threads INTEGER The number of processes to use for data
generation. [default: 1; required]
--per-thread-output / --no-per-thread-output
Controls whether to write the output to a
single file or multiple files (for each
process). [default: per-thread-output;
required]
--compression-method [none|snappy|gzip|zstd]
The compression method to use for the
parquet files generated. [default: zstd;
required]
--file-size-bytes TEXT The target file size for the parquet files
generated. [default: 100m; required]
--help Show this message and exit.
[!NOTE]
Default values may change depending on the number of CPU cores you have, etc.
Handy development commands
Version management
Bump the version of the application - (you must have installed from source with the [dev] extras)
bumpver update --patch
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tpch_datagen-0.0.1.tar.gz
.
File metadata
- Download URL: tpch_datagen-0.0.1.tar.gz
- Upload date:
- Size: 8.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d7021d7dc53cb0f8a4253e6a427977e00e48c8b71c274f0c7a256e09d6d06e88 |
|
MD5 | adef5950b3dbd526b7e364b7e3f232d0 |
|
BLAKE2b-256 | 820d4a5bd2f66e2d3c0582d11ec198b29b9addeec83e5b6222a680506824743e |
File details
Details for the file tpch_datagen-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: tpch_datagen-0.0.1-py3-none-any.whl
- Upload date:
- Size: 7.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 418ee1d160b72bcd10929e5dae0c22df736865add345bd78c061b6bc30580be1 |
|
MD5 | a73ef4eb3d71dd9353004e655e1e8e77 |
|
BLAKE2b-256 | 3c9bca00529a31673cf1dbe73b591a7719bc53156dcaefa60d011150bf9f62d5 |