Skip to main content

A synthetic pandas query generation tool

Project description

Pandas Query Generator 🐼

Pandas Query Generator (pqg) is a tool designed to help users generate synthetic pandas queries for training machine learning models that estimate query execution costs or predict cardinality.

The distributed Python package is called pqg, and has only been tested on a unix-based system.

web-view Web interface showing query statistics after generation cli-view Generated query output and execution results with the CLI

Installation

You can install the query generator using pip, the Python package manager:

pip install pqg

Alternatively, you can use the local web playground:

cd www && bun install && bunx --bun vite

n.b. This command will require you to have bun installed on your machine.

This will spin up a development server at http://localhost:5173 where you can interact with the playground. You can upload your schemas, tweak query parameters and generate queries.

Usage

The query generator exposes both a command-line tool and library interface.

CLI

Below is the standard output of pqg --help, which elaborates on the various command-line arguments the tool accepts:

usage: pqg [--disable-multi-processing] [--ensure-non-empty] [--filter] [--groupby-aggregation-probability] [--max-groupby-columns] [--max-merges] [--max-projection-columns] [--max-selection-conditions] [--multi-line] --num-queries [--output-file] [--projection-probability] --schema [--selection-probability] [--sort] [--verbose]

Pandas Query Generator CLI

options:
  -h --help Show this help message and exit
  --disable-multi-processing Generate and execute queries in a consecutive fashion (default: False)
  --ensure-non-empty Ensure generated queries return a non-empty result set when executed on sample data (default: False)
  --filter Filter generated queries by specific criteria
  --groupby-aggregation-probability Probability of including groupby aggregation operations (default: 0.5)
  --max-groupby-columns Maximum number of columns in group by operations (default: 5)
  --max-merges Maximum number of table merges allowed (default: 2)
  --max-projection-columns Maximum number of columns to project (default: 5)
  --max-selection-conditions Maximum number of conditions in selection operations (default: 5)
  --multi-line Format queries on multiple lines (default: False)
  --num-queries num_queries The number of queries to generate
  --output-file The name of the file to write the results to
  --projection-probability Probability of including projection operations (default: 0.5)
  --schema schema Path to the relational schema JSON file
  --selection-probability Probability of including selection operations (default: 0.5)
  --sort Whether or not to sort the queries by complexity (default: False)
  --verbose Print extra generation information and statistics (default: False)

The required options, as shown, are --num-queries and --schema. The --num-queries option simply instructs the program to generate a certain amount of queries.

The --schema option is a pointer to a JSON file path that describes meta-information about the data we're generating queries for.

A sample schema looks like this:

{
  "entities": {
    "customer": {
      "primary_key": "C_CUSTKEY",
      "properties": {
        "C_CUSTKEY": { "type": "int", "min": 1, "max": 1000 },
        "C_NAME": { "type": "string", "starting_character": ["A", "B", "C"] },
        "C_STATUS": { "type": "enum", "values": ["active", "inactive"] }
      },
      "foreign_keys": {}
    },
    "order": {
      "primary_key": "O_ORDERKEY",
      "properties": {
        "O_ORDERKEY": { "type": "int", "min": 1, "max": 5000 },
        "O_CUSTKEY": { "type": "int", "min": 1, "max": 1000 },
        "O_TOTALPRICE": { "type": "float", "min": 10.0, "max": 1000.0 },
        "O_ORDERSTATUS": {
          "type": "enum",
          "values": ["pending", "completed", "cancelled"]
        }
      },
      "foreign_keys": {
        "O_CUSTKEY": ["C_CUSTKEY", "customer"]
      }
    }
  }
}

This file can be found in /examples/customer/schema.json, generate a few queries from this schema with pqg --num-queries 100 --schema examples/customer/schema.json --verbose.

Other example schema files can be found under the /examples directory.

Library

We expose various structures that make it easy to generate queries fast:

from pqg import Generator, GenerateOptions, Schema, QueryStructure, QueryPool, QueryFilter

# Assumes `schema.json` exists and conforms to the schema format
schema = Schema.from_file('schema.json')

query_structure = QueryStructure(
  groupby_aggregation_probability=0.5,
  max_groupby_columns=4,
  max_merges=10,
  max_projection_columns=5,
  max_selection_conditions=10,
  projection_probability=0.5,
  selection_probability=0.5
)

generator = Generator(schema, query_structure)

# Generate 1000 queries
generate_options = GenerateOptions(num_queries=1000)
query_pool: QueryPool = generator.generate(generate_options)

# Filter out queries with non-empty result sets
query_pool.filter(QueryFilter.NON_EMPTY)

# Sort queries by complexity
query_pool.sort()

# Output each query
print(*query_pool, sep='\n\n')

Comprehensive internal documentation is generated using the sphinx Python package.

You can generate the documentation using the following command in the project root:

cd docs && uv run sphinx-build -M html source build

...then serve it with:

python3 -m http.server 8000 --directory docs/build/html

This will serve the documentation files at http://localhost:8000. Open it up in your preferred browser to see the generated site.

How does it work?

Check out the paper in the /docs folder for more information!

Prior Art

This version of the Pandas Query Generator is based off of the thorough research work of previous students of COMP 400 at McGill University, namely Edge Satir, Hongxin Huo and Dailun Li.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pqg-0.3.2.tar.gz (24.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pqg-0.3.2-py3-none-any.whl (25.8 kB view details)

Uploaded Python 3

File details

Details for the file pqg-0.3.2.tar.gz.

File metadata

  • Download URL: pqg-0.3.2.tar.gz
  • Upload date:
  • Size: 24.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.4

File hashes

Hashes for pqg-0.3.2.tar.gz
Algorithm Hash digest
SHA256 9a76e67b6c8e9872a2a2f679f5b59ca13abbe00336a3761cdbe41db4690a5c9a
MD5 cd3c1660784e52e3ec15784202f0f16c
BLAKE2b-256 24e862fff14900c91f87b6c0cd3f62182c5858e7fee9b4c7a6bee609462bf899

See more details on using hashes here.

File details

Details for the file pqg-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: pqg-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 25.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.4

File hashes

Hashes for pqg-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 fec400eeaebc87b5130156d65a8d5741da213fd57c130f83253f82c6a9752c97
MD5 843afeca734ed719ecf9ad66164cf54f
BLAKE2b-256 d4f23a6aa4958d048bf16fa93201c3c71f47c11fd23e7a92aa654175e03b5806

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page