A minimal, Polars-focused data processing benchmark suite
Project description
PDS Benchmark
A minimal, Polars-focused data processing benchmark suite for evaluating performance across different execution modes.
Features
- Polars-focused: Optimized for Polars with support for streaming, in-memory, and GPU execution
- Minimal dependencies: Lightweight with essential dependencies only
- Easy installation: Single command installation via pip
- TPC-H benchmarks: Industry-standard TPC-H queries (1-22)
- Flexible configuration: Comprehensive CLI options
- Multiple modes: Streaming, in-memory, and GPU acceleration support
Quick Start
Installation
# Install from PyPI (when published)
pip install pds-benchmark
# Or install from source
git clone https://github.com/pds-benchmark/pds-benchmark
cd pds-benchmark
pip install -e .
Basic Usage
# Run with defaults (Polars streaming, scale factor 1)
pds-benchmark run
# Run with specific options
pds-benchmark run --scale-factor 10 --mode streaming --runs 3
# Run specific queries
pds-benchmark run --queries "1,6,12"
# Generate Dataset
pds-benchmark generate tpch --scale-factor 10 --output "./dataset"
Configuration
CLI Options
pds-benchmark run --help
Key options:
--scale-factor, -s: TPC-H scale factor (default: 1)--mode: Execution mode - streaming, in-memory, or gpu (default: streaming)--runs, -r: Number of measurement runs (default: 3)--queries: Comma-separated query numbers to run--data-path: Directory for datasets (default: ./data)--extra-config: JSON string for advanced configuration
Advanced Configuration
Use the --extra-config option to pass advanced settings as a JSON string:
pds-benchmark run --mode gpu --extra-config '{"gpu_device": 1, "gpu_memory_resource": "cuda-pool"}'
| Key | Description | Applicable To |
|---|---|---|
gpu_device |
GPU device ID (e.g., 0, 1) | Polars (GPU mode) |
gpu_memory_resource |
Memory resource type (e.g., cuda, cuda-pool) |
Polars (GPU mode) |
threads |
Number of threads to use | DuckDB |
memory_limit |
Memory limit (e.g., "10GB") | DuckDB |
Execution Modes
Streaming Mode (Default)
Memory-efficient processing for large datasets:
pds-benchmark run --mode streaming --scale-factor 100
In-Memory Mode
Load all data into memory for fastest query execution:
pds-benchmark run --mode in-memory --scale-factor 10
GPU Mode
GPU-accelerated processing (requires NVIDIA GPU + CUDA):
pds-benchmark run --mode gpu --scale-factor 10
GPU dependencies are installed automatically when first used. Falls back to CPU if unavailable.
Use Cases
CI/CD Integration
# .github/workflows/benchmark.yml
name: Performance Benchmark
on: [push, pull_request]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with:
python-version: "3.11"
- name: Install benchmark
run: pip install pds-benchmark
- name: Run benchmark
run: pds-benchmark run --scale-factor 1 --runs 1
Performance Testing
# Quick performance check
pds-benchmark run --scale-factor 1 --runs 1
# Full benchmark suite
pds-benchmark run --scale-factor 10 --runs 5
# Compare execution modes
pds-benchmark run --mode streaming --scale-factor 10 --runs 3
pds-benchmark run --mode in-memory --scale-factor 10 --runs 3
Requirements
- Python 3.11+
- Polars 1.0+
- tqdm
- For GPU mode: NVIDIA GPU with CUDA support
System Requirements
- Memory: 4GB+ RAM for scale factor 1, 16GB+ for scale factor 10
- Storage: 1GB+ free space for datasets
- GPU (optional): CUDA 11.8+ compatible GPU for GPU mode
Architecture
The benchmark uses a plugin-based architecture:
src/pds_benchmark/
├── executor.py # Generic benchmark runner
├── plugins/
│ ├── __init__.py # BenchmarkPlugin interface (ABC)
│ ├── polars/ # Polars plugin
│ │ ├── __init__.py # PolarsPlugin (streaming/in-memory/GPU)
│ │ └── queries/tpch/ # TPC-H query implementations
│ ├── duckdb/ # DuckDB plugin
│ │ ├── __init__.py # DuckDBPlugin
│ │ └── queries/ # Query implementations
│ └── pandas/ # Pandas plugin
│ └── __init__.py # PandasPlugin
Each plugin implements:
setup(config): Initialize the libraryload_table(path): Load data from parquetexecute_query(func, data): Execute a queryget_available_queries(): List available queriesload_query(): Load a specific query
This makes it easy to add support for new libraries.
Output
Results are saved as JSON files with comprehensive metadata:
{
"metadata": {
"library": "polars",
"version": "1.32.0",
"execution_mode": "streaming",
"scale_factor": 10,
"timestamp": "2024-01-15T10:30:00"
},
"results": {
"Q1": {
"load_median": 0.001,
"exec_median": 0.145,
"total_median": 0.146,
"runs": 3
}
}
}
Troubleshooting
Common Issues
Import Error: Ensure Polars is installed
pip install polars>=1.0.0
Dataset Generation Fails: Install tpchgen-cli
pip install tpchgen-cli
GPU Mode Issues: GPU dependencies install automatically when needed
pds-benchmark --verbose run --mode gpu # Shows fallback details
Permission Errors: Ensure write access to data directory
chmod 755 ./data
Debug Mode
Enable verbose logging:
pds-benchmark --verbose run --scale-factor 1
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests:
pytest - Submit a pull request
License
MIT License - see LICENSE for details.
Benchmark Results
Quick comparison of execution modes (TPC-H scale factor 1 & 10):
# Compare modes easily
python compare_modes.py --scale-factor 1 --queries 1,21 --runs 2
Key Findings:
- Streaming mode consistently outperforms in-memory mode at larger scales
- Complex queries (Q9, Q18, Q21) show 2-3x better performance in streaming
- Simple queries show similar performance between modes
- Memory usage is significantly lower in streaming mode
Recommendation: Use streaming mode (default) for production workloads.
Testing
Quick Validation
# Run smoke tests (30-60 seconds)
./run_tests.sh
# Or directly:
python test_runner.py
Comprehensive Testing
# Full test suite (2-5 minutes)
./run_tests.sh --full
# Test specific library
./run_tests.sh --polars
./run_tests.sh --duckdb
# Clean start
./run_tests.sh --clean
The test suite validates:
- ✅ CLI functionality (help, queries)
- ✅ Polars TPC-H (streaming, in-memory, GPU fallback)
- ✅ DuckDB TPC-H and TPC-DS
- ✅ Multiple scale factors
- ✅ Error handling
Recommendation: Run ./run_tests.sh after any code changes to ensure everything still works.
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pds_benchmark-0.1.0.tar.gz.
File metadata
- Download URL: pds_benchmark-0.1.0.tar.gz
- Upload date:
- Size: 27.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b70db61b00ef5467d56357252c9f91250db2d20fc796fcb3e77a0e6105ed4b3a
|
|
| MD5 |
1c1871b47ead726047d76d412c75d6a9
|
|
| BLAKE2b-256 |
d95caaed4d16b267022b537972af4823c041b7ce350a54d6205af276f98a5c66
|
File details
Details for the file pds_benchmark-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pds_benchmark-0.1.0-py3-none-any.whl
- Upload date:
- Size: 54.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af28badd73fecd1cf255cdabf14f04520e193b3fa2c581eb114510eb5ab8b220
|
|
| MD5 |
94dc6688bfb5655fbbeeac5341f50f7f
|
|
| BLAKE2b-256 |
e8166e165f2aa4c58d55e79cd837df5d5bd7b9f47c78c9f95a264ff06bd7b20f
|