Skip to main content

Fuzzy Data Benchmark

Project description

Build Status codecov PyPI version Downloads Chidata Group Twitter URL

fuzzydata

The fuzzydata Workflow Generator

The fuzzydata workflow generator enables:

  • Abstract specification of Dataframe-based Workflows
  • Generation of randomized tables and workflows
  • Loading and replay of workflows on multiple clients

Fuzzydata is currently designed to run using the following clients:

fuzzydata is designed to be extensible, you may implement your own client. Please see the existing clients in fuzzydata/clients for ways to extend the abstract Artifact, Operation and Workflow classes for your client.

Installation

Manual build/install using pip.

pip install fuzzydata

fuzzydata Does not install modin or SQLAlchemy by default, but this can be specified as an install option:

pip install fuzzydata[modin|sql|all]

Usage

Some examples of fuzzydata usage are in the examples directory. You can also run the fuzzydata command to get a list of command-line options supported in fuzzydata

$ fuzzydata --help
usage: fuzzydata [-h] [--wf_client WF_CLIENT] [--output_dir OUTPUT_DIR] [--wf_name WF_NAME]
              [--columns COLUMNS] [--rows ROWS] [--versions VERSIONS] [--bfactor BFACTOR]
              [--matfreq MATFREQ] [--npp NPP] [--log LOG] [--replay_dir REPLAY_DIR]
              [--wf_options WF_OPTIONS] [--exclude_ops EXCLUDE_OPS] [--scale_artifact SCALE_ARTIFACT]

optional arguments:
  -h, --help            show this help message and exit
  --wf_client WF_CLIENT
                        Workflow Client to be used (Default pandas). Available Workflows: pandas|modin|sql
  --output_dir OUTPUT_DIR
                        Location of Output datasets to be stored
  --wf_name WF_NAME     prefix for each workflow to be generated dir to be the path prefix for these files.
  --columns COLUMNS     Number of columns in the base version
  --rows ROWS           Number of rows in the base version
  --versions VERSIONS   Number of artifact versions to generate
  --bfactor BFACTOR     Workflow Branching factor, 0.1 is linear, 100 is star-like
  --matfreq MATFREQ     Materialization frequency, i.e. how many operations before writing out an artifact
  --log LOG             Set Logging Level
  --replay_dir REPLAY_DIR
                        Replay existing workflow in directory
  --wf_options WF_OPTIONS
                        JSON-encoded workflow engine options like sql_string or modin_engine
  --exclude_ops EXCLUDE_OPS
                        JSON-encoded list of ops to exclude e.g. ["pivot"]
  --scale_artifact SCALE_ARTIFACT
                        JSON-encoded dict of {artifact_label: new_size} to be scaled up e.g. {"artifact_0"
                        : 1000000}

Documentation

Download our paper here.

If you use fuzzydata in your research, please consider citing our paper:

@inproceedings{10.1145/3531348.3532178,
author = {Rehman, Mohammed Suhail and Elmore, Aaron},
title = {FuzzyData: A Scalable Workload Generator for Testing Dataframe Workflow Systems},
year = {2022},
isbn = {9781450393539},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3531348.3532178},
doi = {10.1145/3531348.3532178},
booktitle = {Proceedings of the 2022 Workshop on 9th International Workshop of Testing Database Systems},
pages = {17–24},
numpages = {8},
location = {Philadelphia, PA, USA},
series = {DBTest '22}
}

License

MIT License

Contributing to fuzzydata

Check out the current roadmap in docs/roadmap.md. You are always welcome to develop a new client for fuzzydata.

Contact

Suhail Rehman / ChiData Group @ Uchicago CS

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzydata-0.0.10.tar.gz (29.5 kB view details)

Uploaded Source

Built Distribution

fuzzydata-0.0.10-py3-none-any.whl (33.4 kB view details)

Uploaded Python 3

File details

Details for the file fuzzydata-0.0.10.tar.gz.

File metadata

  • Download URL: fuzzydata-0.0.10.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.1

File hashes

Hashes for fuzzydata-0.0.10.tar.gz
Algorithm Hash digest
SHA256 6c0d376030c979d5346caa2ba5ccdbe034a5b761e9d4408b164b974af6a9408e
MD5 91aa5c21fe60b455b406e11f1a345efb
BLAKE2b-256 76a5f23515801fdea176f0679187ce139e00a25e1d9f12de5a662983c49d3118

See more details on using hashes here.

File details

Details for the file fuzzydata-0.0.10-py3-none-any.whl.

File metadata

  • Download URL: fuzzydata-0.0.10-py3-none-any.whl
  • Upload date:
  • Size: 33.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.1

File hashes

Hashes for fuzzydata-0.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 c8a2dd4c8e4b424217a2b56f36b76d2ca473ea2d82c90745d9779a2b7cc39924
MD5 0634494df866ac518fbc50da388e0c47
BLAKE2b-256 5fc83a37667b17496893a5e6d57f1c27cbc4408cecca190e0c30828b6962ee30

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page