DaSy DataSynthesizer - Create synthetic data for machine learning research
Project description
dasy-ml
DaSy DataSynthesizer - Create synthetic data with desired statistical properties for machine learning research.
Install
pip install dasy-ml
import dasy
Introduction
When researching machine learning algorithms, we often want to know how they behave on data with specific properties. For example: linearly separable, correlated, isotropic, etc. This library aims to provide functionality to construct synthetic datasets with any desired statistical properties, so researchers can easily study how algorithms respond to different types of data.
Why is this useful for machine learning research compared to using existing datasets?
- Existing datasets may lack certain statistical properties you want to test your algorithm against.
- You may not have enough information about where an existing dataset comes from. For example, is it IID?
- You may want to test against many different types of data.
- You may want to arbitrarily adjust the size of the dataset.
Note: this is not a library for adding synthetic data to an existing dataset - there are already many other libraries that do this.
Examples
Above, the input X data is simply sampled from a Gaussian centered at the origin. Then, the data is labeled by creating random centroids and labeling each point according to its nearest centroid (similar to the first step in k-means). On the left with only 2 classes, the classes are linearly separable. With 3 or more classes, they are no longer linearly separable and the boundaries essentially form a Voronoi diagram.
DataSynthesizers and Labelers
The core of synthetic-data are DataSynthesizers and Labelers.
DataSynthesizers sample inputs X from the feature-space.
Labelers take inputs X and assign labels y to them.
These are very general classes. The procedure for creating X typically involves sampling from some probability distribution. Assigning labels may be a deterministic or probabilistic function. Each x or y may be created independently but does not have to be, for example if created through a Markov process.
Discussion of Kinds of Data
Independent vs. Non-Independent Data
Time-Series Data
Data for Classification Problems
Deterministic vs. Probabilistic Labels
If for any given input x, the label must always be a specific value, then the labels are deterministic. In other words, the label y=f(x), where f is a pure function. Typically, y is encoded as a one-hot vector.
On the other hand, if a given input x may be assigned different labels, then labels are probabilistic. Here, y is drawn from the possible classes according to some probability distribution p(x), representing the probability of each class for the given input.
Theoretically, it is possible to achieve 100% accuracy on a deterministic classification problem. This is impossible in a probabilistic classification problem.
Noisy Labels
Linearly Separable Data
Data for Regression Problems
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for dasy_ml-0.0.2.post1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8c802e68e9ba759ac21a7fbcf90f3fc3b017dd579944b986ee10698d43a3efa7 |
|
MD5 | 5bd860b5e871a6cc74d7a24c57f99cc1 |
|
BLAKE2b-256 | 65f8de05f6d6d2a3a6a3cac52418f751af082520172c44dd15a64db456441107 |