Skip to main content

DaSy DataSynthesizer - Create synthetic data for machine learning research

Project description

dasy-ml

DaSy DataSynthesizer - Create synthetic data with desired statistical properties for machine learning research.

Install

pip install dasy-ml
import dasy

Introduction

When researching machine learning algorithms, we often want to know how they behave on data with specific properties. For example: linearly separable, correlated, isotropic, etc. This library aims to provide functionality to construct synthetic datasets with any desired statistical properties, so researchers can easily study how algorithms respond to different types of data.

Why is this useful for machine learning research compared to using existing datasets?

  • Existing datasets may lack certain statistical properties you want to test your algorithm against.
  • You may not have enough information about where an existing dataset comes from. For example, is it IID?
  • You may want to test against many different types of data.
  • You may want to arbitrarily adjust the size of the dataset.

Note: this is not a library for adding synthetic data to an existing dataset - there are already many other libraries that do this.

Examples

Above, the input X data is simply sampled from a Gaussian centered at the origin. Then, the data is labeled by creating random centroids and labeling each point according to its nearest centroid (similar to the first step in k-means). On the left with only 2 classes, the classes are linearly separable. With 3 or more classes, they are no longer linearly separable and the boundaries essentially form a Voronoi diagram.

DataSynthesizers and Labelers

The core of synthetic-data are DataSynthesizers and Labelers.

DataSynthesizers sample inputs X from the feature-space.

Labelers take inputs X and assign labels y to them.

These are very general classes. The procedure for creating X typically involves sampling from some probability distribution. Assigning labels may be a deterministic or probabilistic function. Each x or y may be created independently but does not have to be, for example if created through a Markov process.

Discussion of Kinds of Data

Independent vs. Non-Independent Data

Time-Series Data

Data for Classification Problems

Deterministic vs. Probabilistic Labels

If for any given input x, the label must always be a specific value, then the labels are deterministic. In other words, the label y=f(x), where f is a pure function. Typically, y is encoded as a one-hot vector.

On the other hand, if a given input x may be assigned different labels, then labels are probabilistic. Here, y is drawn from the possible classes according to some probability distribution p(x), representing the probability of each class for the given input.

Theoretically, it is possible to achieve 100% accuracy on a deterministic classification problem. This is impossible in a probabilistic classification problem.

Noisy Labels

Linearly Separable Data

Data for Regression Problems

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dasy-ml-0.0.2.tar.gz (4.6 kB view hashes)

Uploaded Source

Built Distribution

dasy_ml-0.0.2-py3-none-any.whl (6.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page