Skip to main content

A minimal, Polars-focused data processing benchmark suite

Project description

PDS Benchmark

A minimal, Polars-focused data processing benchmark suite for evaluating performance across different execution modes.

Features

  • Polars-focused: Optimized for Polars with support for streaming, in-memory, and GPU execution
  • Minimal dependencies: Lightweight with essential dependencies only
  • Easy installation: Single command installation via pip
  • TPC-H benchmarks: Industry-standard TPC-H queries (1-22)
  • Flexible configuration: Comprehensive CLI options
  • Multiple modes: Streaming, in-memory, and GPU acceleration support

Quick Start

Installation

# Install from PyPI (when published)
pip install pds-benchmark

# Or install from source
git clone https://github.com/pds-benchmark/pds-benchmark
cd pds-benchmark
pip install -e .

Basic Usage

# Run with defaults (Polars streaming, scale factor 1)
pds-benchmark run

# Run with specific options
pds-benchmark run --scale-factor 10 --mode streaming --runs 3

# Run specific queries
pds-benchmark run --queries "1,6,12"

# Generate Dataset
pds-benchmark generate tpch --scale-factor 10 --output "./dataset"

Configuration

CLI Options

pds-benchmark run --help

Key options:

  • --scale-factor, -s: TPC-H scale factor (default: 1)
  • --mode: Execution mode - streaming, in-memory, or gpu (default: streaming)
  • --runs, -r: Number of measurement runs (default: 3)
  • --queries: Comma-separated query numbers to run
  • --data-path: Directory for datasets (default: ./data)
  • --extra-config: JSON string for advanced configuration

Advanced Configuration

Use the --extra-config option to pass advanced settings as a JSON string:

pds-benchmark run --mode gpu --extra-config '{"gpu_device": 1, "gpu_memory_resource": "cuda-pool"}'
Key Description Applicable To
gpu_device GPU device ID (e.g., 0, 1) Polars (GPU mode)
gpu_memory_resource Memory resource type (e.g., cuda, cuda-pool) Polars (GPU mode)
threads Number of threads to use DuckDB
memory_limit Memory limit (e.g., "10GB") DuckDB

Execution Modes

Streaming Mode (Default)

Memory-efficient processing for large datasets:

pds-benchmark run --mode streaming --scale-factor 100

In-Memory Mode

Load all data into memory for fastest query execution:

pds-benchmark run --mode in-memory --scale-factor 10

GPU Mode

GPU-accelerated processing (requires NVIDIA GPU + CUDA):

pds-benchmark run --mode gpu --scale-factor 10

GPU dependencies are installed automatically when first used. Falls back to CPU if unavailable.

Use Cases

CI/CD Integration

# .github/workflows/benchmark.yml
name: Performance Benchmark
on: [push, pull_request]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v4
        with:
          python-version: "3.11"
      
      - name: Install benchmark
        run: pip install pds-benchmark
      
      - name: Run benchmark
        run: pds-benchmark run --scale-factor 1 --runs 1

Performance Testing

# Quick performance check
pds-benchmark run --scale-factor 1 --runs 1

# Full benchmark suite
pds-benchmark run --scale-factor 10 --runs 5

# Compare execution modes
pds-benchmark run --mode streaming --scale-factor 10 --runs 3
pds-benchmark run --mode in-memory --scale-factor 10 --runs 3

Requirements

  • Python 3.11+
  • Polars 1.0+
  • tqdm
  • For GPU mode: NVIDIA GPU with CUDA support

System Requirements

  • Memory: 4GB+ RAM for scale factor 1, 16GB+ for scale factor 10
  • Storage: 1GB+ free space for datasets
  • GPU (optional): CUDA 11.8+ compatible GPU for GPU mode

Architecture

The benchmark uses a plugin-based architecture:

src/pds_benchmark/
├── executor.py              # Generic benchmark runner
├── plugins/
│   ├── __init__.py          # BenchmarkPlugin interface (ABC)
│   ├── polars/              # Polars plugin
│   │   ├── __init__.py      # PolarsPlugin (streaming/in-memory/GPU)
│   │   └── queries/tpch/    # TPC-H query implementations
│   ├── duckdb/              # DuckDB plugin
│   │   ├── __init__.py      # DuckDBPlugin
│   │   └── queries/         # Query implementations
│   └── pandas/              # Pandas plugin
│       └── __init__.py      # PandasPlugin

Each plugin implements:

  • setup(config): Initialize the library
  • load_table(path): Load data from parquet
  • execute_query(func, data): Execute a query
  • get_available_queries(): List available queries
  • load_query(): Load a specific query

This makes it easy to add support for new libraries.

Output

Results are saved as JSON files with comprehensive metadata:

{
  "metadata": {
    "library": "polars",
    "version": "1.32.0",
    "execution_mode": "streaming",
    "scale_factor": 10,
    "timestamp": "2024-01-15T10:30:00"
  },
  "results": {
    "Q1": {
      "load_median": 0.001,
      "exec_median": 0.145,
      "total_median": 0.146,
      "runs": 3
    }
  }
}

Troubleshooting

Common Issues

Import Error: Ensure Polars is installed

pip install polars>=1.0.0

Dataset Generation Fails: Install tpchgen-cli

pip install tpchgen-cli

GPU Mode Issues: GPU dependencies install automatically when needed

pds-benchmark --verbose run --mode gpu  # Shows fallback details

Permission Errors: Ensure write access to data directory

chmod 755 ./data

Debug Mode

Enable verbose logging:

pds-benchmark --verbose run --scale-factor 1

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests: pytest
  5. Submit a pull request

License

MIT License - see LICENSE for details.

Benchmark Results

Quick comparison of execution modes (TPC-H scale factor 1 & 10):

# Compare modes easily
python compare_modes.py --scale-factor 1 --queries 1,21 --runs 2

Key Findings:

  • Streaming mode consistently outperforms in-memory mode at larger scales
  • Complex queries (Q9, Q18, Q21) show 2-3x better performance in streaming
  • Simple queries show similar performance between modes
  • Memory usage is significantly lower in streaming mode

Recommendation: Use streaming mode (default) for production workloads.

Testing

Quick Validation

# Run smoke tests (30-60 seconds)
./run_tests.sh

# Or directly:
python test_runner.py

Comprehensive Testing

# Full test suite (2-5 minutes)
./run_tests.sh --full

# Test specific library
./run_tests.sh --polars
./run_tests.sh --duckdb

# Clean start
./run_tests.sh --clean

The test suite validates:

  • ✅ CLI functionality (help, queries)
  • ✅ Polars TPC-H (streaming, in-memory, GPU fallback)
  • ✅ DuckDB TPC-H and TPC-DS
  • ✅ Multiple scale factors
  • ✅ Error handling

Recommendation: Run ./run_tests.sh after any code changes to ensure everything still works.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pds_benchmark-0.1.0.tar.gz (27.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pds_benchmark-0.1.0-py3-none-any.whl (54.1 kB view details)

Uploaded Python 3

File details

Details for the file pds_benchmark-0.1.0.tar.gz.

File metadata

  • Download URL: pds_benchmark-0.1.0.tar.gz
  • Upload date:
  • Size: 27.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pds_benchmark-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b70db61b00ef5467d56357252c9f91250db2d20fc796fcb3e77a0e6105ed4b3a
MD5 1c1871b47ead726047d76d412c75d6a9
BLAKE2b-256 d95caaed4d16b267022b537972af4823c041b7ce350a54d6205af276f98a5c66

See more details on using hashes here.

File details

Details for the file pds_benchmark-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pds_benchmark-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 54.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pds_benchmark-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 af28badd73fecd1cf255cdabf14f04520e193b3fa2c581eb114510eb5ab8b220
MD5 94dc6688bfb5655fbbeeac5341f50f7f
BLAKE2b-256 e8166e165f2aa4c58d55e79cd837df5d5bd7b9f47c78c9f95a264ff06bd7b20f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page