Fuzzy Data Benchmark
Project description
The fuzzydata Workflow Generator
The fuzzydata
workflow generator enables:
- Abstract specification of Dataframe-based Workflows
- Generation of randomized tables and workflows
- Loading and replay of workflows on multiple clients
Fuzzydata is currently designed to run using the following clients:
fuzzydata
is designed to be extensible, you may implement your own client.
Please see the existing clients in fuzzydata/clients for ways to extend the abstract Artifact
, Operation
and Workflow
classes for your client.
Installation
Manual build/install using pip.
pip install fuzzydata
fuzzydata
Does not install modin
or SQLAlchemy
by default, but this can be specified as an install option:
pip install fuzzydata[modin|sql|all]
Usage
Some examples of fuzzydata usage are in the examples
directory. You can also run the fuzzydata
command
to get a list of command-line options supported in fuzzydata
$ fuzzydata --help
usage: fuzzydata [-h] [--wf_client WF_CLIENT] [--output_dir OUTPUT_DIR] [--wf_name WF_NAME]
[--columns COLUMNS] [--rows ROWS] [--versions VERSIONS] [--bfactor BFACTOR]
[--matfreq MATFREQ] [--npp NPP] [--log LOG] [--replay_dir REPLAY_DIR]
[--wf_options WF_OPTIONS] [--exclude_ops EXCLUDE_OPS] [--scale_artifact SCALE_ARTIFACT]
optional arguments:
-h, --help show this help message and exit
--wf_client WF_CLIENT
Workflow Client to be used (Default pandas). Available Workflows: pandas|modin|sql
--output_dir OUTPUT_DIR
Location of Output datasets to be stored
--wf_name WF_NAME prefix for each workflow to be generated dir to be the path prefix for these files.
--columns COLUMNS Number of columns in the base version
--rows ROWS Number of rows in the base version
--versions VERSIONS Number of artifact versions to generate
--bfactor BFACTOR Workflow Branching factor, 0.1 is linear, 100 is star-like
--matfreq MATFREQ Materialization frequency, i.e. how many operations before writing out an artifact
--log LOG Set Logging Level
--replay_dir REPLAY_DIR
Replay existing workflow in directory
--wf_options WF_OPTIONS
JSON-encoded workflow engine options like sql_string or modin_engine
--exclude_ops EXCLUDE_OPS
JSON-encoded list of ops to exclude e.g. ["pivot"]
--scale_artifact SCALE_ARTIFACT
JSON-encoded dict of {artifact_label: new_size} to be scaled up e.g. {"artifact_0"
: 1000000}
Documentation
A preprint of our paper to appear at DBTest'22 is here
License
Contributing to fuzzydata
Check out the current roadmap in docs/roadmap.md. You are always welcome to develop a new client for fuzzydata.
Contact
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for fuzzydata-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3b0c597402d7c72962d4ffb6972592fecde4eb5f63ad2f0afa7175e1e0ce010b |
|
MD5 | 9705cf9d8bc7844999cc1955122c5daa |
|
BLAKE2b-256 | e78fd172785cf59794b1329e65e1968be449d006afe085a77535a80a1b833bec |