Skip to main content

A synthetic pandas query generation tool

Project description

Pandas Query Generator 🐼

Pandas Query Generator (pqg) is a tool designed to help users generate synthetic pandas queries for training machine learning models that estimate query execution costs or predict cardinality.

The binary is called pqg and has only been tested on a unix-based system.

Installation

You can install the query generator using pip, the Python package manager:

pip install pqg

Usage

Below is the standard output of pqg --help, which elaborates on the various command-line arguments the tool accepts:

usage: pqg [--filter] [--groupby-aggregation-probability] [--max-groupby-columns] [--max-merges] [--max-projection-columns] [--max-selection-conditions] [--multi-line] --num-queries [--output-file] [--projection-probability] --schema [--selection-probability] [--sort] [--verbose]

Pandas Query Generator CLI

options:
  -h --help Show this help message and exit
  --filter Filter generated queries by specific criteria
  --groupby-aggregation-probability Probability of including groupby aggregation operations (default: 0.5)
  --max-groupby-columns Maximum number of columns in group by operations (default: 5)
  --max-merges Maximum number of table merges allowed (default: 2)
  --max-projection-columns Maximum number of columns to project (default: 5)
  --max-selection-conditions Maximum number of conditions in selection operations (default: 5)
  --multi-line Format queries on multiple lines (default: False)
  --num-queries num_queries The number of queries to generate
  --output-file The name of the file to write the results to (default: queries.txt)
  --projection-probability Probability of including projection operations (default: 0.5)
  --schema schema Path to the relational schema JSON file
  --selection-probability Probability of including selection operations (default: 0.5)
  --sort Whether or not to sort the queries by complexity (default: False)
  --verbose Print extra generation information and statistics (default: False)

The required options, as shown, are --num-queries and --schema. The --num-queries option simply instructs the program to generate a certain amount of queries.

The --schema option is a pointer to a JSON file path that describes meta-information about the data we're generating queries for.

A sample schema looks like this:

{
  "entities": {
    "customer": {
      "primary_key": "C_CUSTKEY",
      "properties": {
        "C_CUSTKEY": { "type": "int", "min": 1, "max": 1000 },
        "C_NAME": { "type": "string", "starting_character": ["A", "B", "C"] },
        "C_STATUS": { "type": "enum", "values": ["active", "inactive"] }
      },
      "foreign_keys": {}
    },
    "order": {
      "primary_key": "O_ORDERKEY",
      "properties": {
        "O_ORDERKEY": { "type": "int", "min": 1, "max": 5000 },
        "O_CUSTKEY": { "type": "int", "min": 1, "max": 1000 },
        "O_TOTALPRICE": { "type": "float", "min": 10.0, "max": 1000.0 },
        "O_ORDERSTATUS": { "type": "enum", "values": ["pending", "completed", "cancelled"] }
      },
      "foreign_keys": {
        "O_CUSTKEY": ["C_CUSTKEY", "customer"]
      }
    }
  }
}

This file can be found in /examples/customer/schema.json, generate a few queries from this schema with pqg --num-queries 100 --schema examples/customer/schema.json --verbose.

Schemas for these files can be found in their respective directories within /examples.

How does it work?

Check out the paper in the /docs folder for more information!

Prior Art

This version of the Pandas Query Generator is based off of the thorough research work of previous students of COMP 400 at McGill University, namely Edge Satir, Hongxin Huo and Dailun Li.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pqg-0.2.0.tar.gz (22.2 kB view details)

Uploaded Source

Built Distribution

pqg-0.2.0-py3-none-any.whl (23.9 kB view details)

Uploaded Python 3

File details

Details for the file pqg-0.2.0.tar.gz.

File metadata

  • Download URL: pqg-0.2.0.tar.gz
  • Upload date:
  • Size: 22.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.30

File hashes

Hashes for pqg-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c174ce09d9687d1c848877a1faa6162d2b8a6ac72fb959c7d3d8af1e44f8c57a
MD5 8fd489be94890d355b6a01b55e266518
BLAKE2b-256 787b017c2799654d6c938ce2cf76649c7498c56bd3d6e59bd99c9078214817b0

See more details on using hashes here.

File details

Details for the file pqg-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: pqg-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 23.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.30

File hashes

Hashes for pqg-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e68bb3129f323527ee649df6b1df44a9a01e4220ec2970ba2b32761b2a1d3906
MD5 b435bd73c4150240a4f8b54079a3a688
BLAKE2b-256 bb65d43dff85ae634559284aba3f04dd5136a19c6bd254a30d84e6c6195a6206

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page