Analytic generation of datasets with specified statistical characteristics.
Project description
Overview
The AutoGen (analyticsdf) is a Python library that allows you to generate synthetic data with any statistical characteristics desired.
Features
This library provides a set of functionality to enable the specification and generation of a wide range of datasets with specified statistical characteristics. Specification includes the predictor matrix and the response vector.
Some common congifuration:
- High correlation and multi-collinearity among predictor variables
- Interaction effects between variables
- Skewed distributions of predictor and response variables
- Nonlinear relationships between predictor and response variables
Check the Analyticsdf documentation for more details.
Inspirations
- Sklearn Make Datasets functionality
- MIT Synthetic Data Vault project
- MIT Data to AI Lab
- datacebo
- 2016 IEEE conference paper, The Synthetic Data Vault.
Install
The beta package of this library is publicly available on both PyPI and Anaconda. Install analyticsdf using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.
pip install analyticsdf
conda install -c faye-yufan analyticsdf
Getting Started
Import the dataset generation class from the package, and play with the class functions.
from analyticsdf.analyticsdataframe import AnalyticsDataframe
ad = AnalyticsDataframe(1000, 6)
ad.predictor_matrix.head()
The predictor matrix is initialized with all null values. Now let's update the predictors with some distributions:
for var in ['X1', 'X2', 'X3', 'X4', 'X5']:
ad.update_predictor_uniform(var, 0, 100)
ad.update_predictor_categorical('X6', ["Red", "Yellow", "Blue"], [0.3, 0.4, 0.3])
Once we have a dataframe desired and would like to visualize it, we can do:
df_visualization_bi(ad)
Next Steps
We plan to integrate an user interface to the library, aiming to let users configure, manipulate, and view datasets more easily.
Code Contributors
License
AutoGen is released under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file analyticsdf-0.0.8.3.tar.gz
.
File metadata
- Download URL: analyticsdf-0.0.8.3.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e4a58df7a2657b151c71c65d8d9d0869823f540b2c4cc405924803b5dfdf6429 |
|
MD5 | 6c69983eec9578b17651f143f7ae2472 |
|
BLAKE2b-256 | 56cd248c9cf566818c81678122477394daee3d95893bd3afd5175ce67bdbdd7e |