A package to evaluate how close a synthetic data set is to real data.
Project description
TabularDataSynthesizer
The tabular data synthezier project has as a primary goal to support the general synthesis of tabular data, whatever its shape or form. Currently, the synthesizer supports the following data types:
- Nominal
- Ordinal
- Continuous
- Dates (approached as continous data)
TODO:
- Datetime
- Free text
The tabular data synthetization process consists of several steps:
- Tokenizing the data for the relevant columns. The 'relevant' columns in this case are determined by the columns with dtypes category and object. These values are tokenized using the pd.factorize class, which maps each value to an integer. We save this and the inverse mapping. This tokenization step allows us to input everyday data, that has textual columns as well.
- The second step consists of a numerical representation to a representation that can be used by a neural network. In short, this means getting all values in the range [-1, 1]. There are several implementations of this. For continues values, there is three ways at the moment.
- Gaussian Mixture Models. A combination of several gaussians are fit to the data of a single column and can represent the data when it does not follow a typical gaussian shape, which is the assumption of more neural networks.
- Bayesian Gaussian Mixture Models. The BGMM is an adaptation of the Gaussian Mixture Models, that, in short, allows for a varying number of components to be learned. This method takes quite a bit longer to fix, but should typically give a bit better results.
- Scaler. Futhermore, we can use normalizations and standardization to get the data in the required ranges. However, this often has caveats for the neural network, since the resulting distributions are not typically Gaussians.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file tabular-data-synthesizer-0.2.1.tar.gz
.
File metadata
- Download URL: tabular-data-synthesizer-0.2.1.tar.gz
- Upload date:
- Size: 13.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.1 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 787f42d866eefa1385401d7dcf06854d7672077eac70abd15458a210db306ff0 |
|
MD5 | a2377a5320ff79666a671f7601a80fc5 |
|
BLAKE2b-256 | f608afd82e317c5fdb5756b0172b1803d69ab3d5130390825b60085cd54b4fc7 |