Skip to main content

bindings to libdbgen / tpch-dbgen

Project description

Ergonomically create TPC-H data thru Python as Arrow tables.

import pytpch
import pyarrow as pa

# Generate TPC-H data at scale 1 (~1GB)
tables: dict[str, pa.Table] = pytpch.dbgen(sf=1)

# Generate a single table at scale 1
tables: dict[str, pa.Table] = pytpch.dbgen(sf=1, table=pytpch.Table.Nation)

# Generate a single chunk out of n chunks of a single table
# this is wildly helpful when generating larger scale factors as you can make
# subsets of the data and store them or join them after some sort of parallelism.
tables: dict[str, pa.Table] = pytpch.dbgen(sf=1, table=pytpch.Table.Nation)


# NOTE! As mentioned in the docs for this function, it is NOT thread-safe.
#       If you want to generate data in parallel, you must do so in other processes for now
#       by using things like `multiprocessing` or `concurrent.futures.ProcessPoolExecutor`.
#       This is a TODO, as the original C code uses copious amounts of global and static function
#       variables to maintain state, and while the state is reset between function calls from refactoring
#       in milesgranger/libdbgen, these shared global states are not removed so thus not thread-safe.
#
# Example of generating data in parallel:
from concurrent.futures import ProcessPoolExecutor, wait

n_chunks = 10  # 10 total chunks

def gen_step(step):
    return pytpch.dbgen(sf=10, n_chunks=n_chunks, nth_step=step)

with ThreadPoolExecutor() as executor:
    jobs: list[dict[str, pa.Table]] = list(executor.map(gen_step, range(n_chunks)))
  

# Default reference queries provided (1-22) as:
print(pytpch.QUERY_1)

Tell me more...

Python bindings (thru Rust, b/c why not) to libdbgen which is a fork of databricks/tpch-dbgen for generating TPC-H data.

tpch-dbgen is originally a CLI to generate CSV files for TPC-H data. I wanted to make it into an ergonomic Python API for use in other projects.

TODOS (roughly in order of priority):

  • Support for more than Linux x86_64 (mostly just adapting C lib and updating CI)
  • Write directly to Arrow, removing CSV writing (w/ nanoarrow probably)
  • Make thread safe (remove global and static function variables in C lib, and remove changing of CWD)
  • Separate out the Rust stuff into it's own crate.

Build from source...

Roughly:

  • git clone --recursive git@github.com:milesgranger/pytpch.git
  • python -m pip install maturin
  • maturin build --release

That'll only work if you're on x86_64 linux for now, you can try adapting build.rs but good luck with that. For now.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

pytpch-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

File details

Details for the file pytpch-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for pytpch-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 dafcb32a8bce50910d9dd8f3a239190a7fbf0bdc8f11247c23512a0e1ee0bfa4
MD5 d53d1b5ef6e2e36716f88a4c8b669e2f
BLAKE2b-256 3ef6b9762f8d5cc47139fcf38d0d1d28a9827ce8ea09be49b097ece87381f3dd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page