A minimal, Polars-focused data processing benchmark suite

These details have not been verified by PyPI

Project links

Project description

PDS Benchmark

A minimal, Polars-focused data processing benchmark suite for evaluating performance across different execution modes.

Features

Polars-focused: Optimized for Polars with support for streaming, in-memory, and GPU execution
Minimal dependencies: Lightweight with essential dependencies only
Easy installation: Single command installation via pip
TPC-H benchmarks: Industry-standard TPC-H queries (1-22)
Flexible configuration: Comprehensive CLI options
Multiple modes: Streaming, in-memory, and GPU acceleration support

Quick Start

Installation

# Install from PyPI (when published)
pip install pds-benchmark

# Or install from source
git clone https://github.com/pds-benchmark/pds-benchmark
cd pds-benchmark
pip install -e .

Basic Usage

# Run with defaults (Polars streaming, scale factor 1)
pds-benchmark run

# Run with specific options
pds-benchmark run --scale-factor 10 --mode streaming --runs 3

# Run specific queries
pds-benchmark run --queries "1,6,12"

# Generate Dataset
pds-benchmark generate tpch --scale-factor 10 --output "./dataset"

Configuration

CLI Options

pds-benchmark run --help

Key options:

--scale-factor, -s: TPC-H scale factor (default: 1)
--mode: Execution mode - streaming, in-memory, or gpu (default: streaming)
--runs, -r: Number of measurement runs (default: 3)
--queries: Comma-separated query numbers to run
--data-path: Directory for datasets (default: ./data)
--extra-config: JSON string for advanced configuration

Advanced Configuration

Use the --extra-config option to pass advanced settings as a JSON string:

pds-benchmark run --mode gpu --extra-config '{"gpu_device": 1, "gpu_memory_resource": "cuda-pool"}'

Key	Description	Applicable To
`gpu_device`	GPU device ID (e.g., 0, 1)	Polars (GPU mode)
`gpu_memory_resource`	Memory resource type (e.g., `cuda`, `cuda-pool`)	Polars (GPU mode)
`threads`	Number of threads to use	DuckDB
`memory_limit`	Memory limit (e.g., "10GB")	DuckDB

Execution Modes

Streaming Mode (Default)

Memory-efficient processing for large datasets:

pds-benchmark run --mode streaming --scale-factor 100

In-Memory Mode

Load all data into memory for fastest query execution:

pds-benchmark run --mode in-memory --scale-factor 10

GPU Mode

GPU-accelerated processing (requires NVIDIA GPU + CUDA):

pds-benchmark run --mode gpu --scale-factor 10

GPU dependencies are installed automatically when first used. Falls back to CPU if unavailable.

Use Cases

CI/CD Integration

# .github/workflows/benchmark.yml
name: Performance Benchmark
on: [push, pull_request]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v4
        with:
          python-version: "3.11"
      
      - name: Install benchmark
        run: pip install pds-benchmark
      
      - name: Run benchmark
        run: pds-benchmark run --scale-factor 1 --runs 1

Performance Testing

# Quick performance check
pds-benchmark run --scale-factor 1 --runs 1

# Full benchmark suite
pds-benchmark run --scale-factor 10 --runs 5

# Compare execution modes
pds-benchmark run --mode streaming --scale-factor 10 --runs 3
pds-benchmark run --mode in-memory --scale-factor 10 --runs 3

Requirements

Python 3.11+
Polars 1.0+
tqdm
For GPU mode: NVIDIA GPU with CUDA support

System Requirements

Memory: 4GB+ RAM for scale factor 1, 16GB+ for scale factor 10
Storage: 1GB+ free space for datasets
GPU (optional): CUDA 11.8+ compatible GPU for GPU mode

Architecture

The benchmark uses a plugin-based architecture:

src/pds_benchmark/
├── executor.py              # Generic benchmark runner
├── plugins/
│   ├── __init__.py          # BenchmarkPlugin interface (ABC)
│   ├── polars/              # Polars plugin
│   │   ├── __init__.py      # PolarsPlugin (streaming/in-memory/GPU)
│   │   └── queries/tpch/    # TPC-H query implementations
│   ├── duckdb/              # DuckDB plugin
│   │   ├── __init__.py      # DuckDBPlugin
│   │   └── queries/         # Query implementations
│   └── pandas/              # Pandas plugin
│       └── __init__.py      # PandasPlugin

Each plugin implements:

setup(config): Initialize the library
load_table(path): Load data from parquet
execute_query(func, data): Execute a query
get_available_queries(): List available queries
load_query(): Load a specific query

This makes it easy to add support for new libraries.

Output

Results are saved as JSON files with comprehensive metadata:

{
  "metadata": {
    "library": "polars",
    "version": "1.32.0",
    "execution_mode": "streaming",
    "scale_factor": 10,
    "timestamp": "2024-01-15T10:30:00"
  },
  "results": {
    "Q1": {
      "load_median": 0.001,
      "exec_median": 0.145,
      "total_median": 0.146,
      "runs": 3
    }
  }
}

Troubleshooting

Common Issues

Import Error: Ensure Polars is installed

pip install polars>=1.0.0

Dataset Generation Fails: Install tpchgen-cli

pip install tpchgen-cli

GPU Mode Issues: GPU dependencies install automatically when needed

pds-benchmark --verbose run --mode gpu  # Shows fallback details

Permission Errors: Ensure write access to data directory

chmod 755 ./data

Debug Mode

Enable verbose logging:

pds-benchmark --verbose run --scale-factor 1

Contributing

Fork the repository
Create a feature branch
Make your changes
Run tests: pytest
Submit a pull request

License

MIT License - see LICENSE for details.

Benchmark Results

Quick comparison of execution modes (TPC-H scale factor 1 & 10):

# Compare modes easily
python compare_modes.py --scale-factor 1 --queries 1,21 --runs 2

Key Findings:

Streaming mode consistently outperforms in-memory mode at larger scales
Complex queries (Q9, Q18, Q21) show 2-3x better performance in streaming
Simple queries show similar performance between modes
Memory usage is significantly lower in streaming mode

Recommendation: Use streaming mode (default) for production workloads.

Testing

Quick Validation

# Run smoke tests (30-60 seconds)
./run_tests.sh

# Or directly:
python test_runner.py

Comprehensive Testing

# Full test suite (2-5 minutes)
./run_tests.sh --full

# Test specific library
./run_tests.sh --polars
./run_tests.sh --duckdb

# Clean start
./run_tests.sh --clean

The test suite validates:

✅ CLI functionality (help, queries)
✅ Polars TPC-H (streaming, in-memory, GPU fallback)
✅ DuckDB TPC-H and TPC-DS
✅ Multiple scale factors
✅ Error handling

Recommendation: Run ./run_tests.sh after any code changes to ensure everything still works.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Dec 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pds_benchmark-0.1.0.tar.gz (27.1 kB view details)

Uploaded Dec 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pds_benchmark-0.1.0-py3-none-any.whl (54.1 kB view details)

Uploaded Dec 23, 2025 Python 3

File details

Details for the file pds_benchmark-0.1.0.tar.gz.

File metadata

Download URL: pds_benchmark-0.1.0.tar.gz
Upload date: Dec 23, 2025
Size: 27.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pds_benchmark-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b70db61b00ef5467d56357252c9f91250db2d20fc796fcb3e77a0e6105ed4b3a`
MD5	`1c1871b47ead726047d76d412c75d6a9`
BLAKE2b-256	`d95caaed4d16b267022b537972af4823c041b7ce350a54d6205af276f98a5c66`

See more details on using hashes here.

File details

Details for the file pds_benchmark-0.1.0-py3-none-any.whl.

File metadata

Download URL: pds_benchmark-0.1.0-py3-none-any.whl
Upload date: Dec 23, 2025
Size: 54.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pds_benchmark-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`af28badd73fecd1cf255cdabf14f04520e193b3fa2c581eb114510eb5ab8b220`
MD5	`94dc6688bfb5655fbbeeac5341f50f7f`
BLAKE2b-256	`e8166e165f2aa4c58d55e79cd837df5d5bd7b9f47c78c9f95a264ff06bd7b20f`

See more details on using hashes here.

pds-benchmark 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDS Benchmark

Features

Quick Start

Installation

Basic Usage

Configuration

CLI Options

Advanced Configuration

Execution Modes

Streaming Mode (Default)

In-Memory Mode

GPU Mode

Use Cases

CI/CD Integration

Performance Testing

Requirements

System Requirements

Architecture

Output

Troubleshooting

Common Issues

Debug Mode

Contributing

License

Benchmark Results

Testing

Quick Validation

Comprehensive Testing

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes