Skip to main content

A package to evaluate how close a synthetic data set is to real data.

Project description

TabularDataSynthesizer

The tabular data synthezier project has as a primary goal to support the general synthesis of tabular data, whatever its shape or form. Currently, the synthesizer supports the following data types:

  • Nominal
  • Ordinal
  • Continuous
  • Dates (approached as continous data)

TODO:

  • Datetime
  • Free text

The tabular data synthetization process consists of several steps:

  1. Tokenizing the data for the relevant columns. The 'relevant' columns in this case are determined by the columns with dtypes category and object. These values are tokenized using the pd.factorize class, which maps each value to an integer. We save this and the inverse mapping. This tokenization step allows us to input everyday data, that has textual columns as well.
  2. The second step consists of a numerical representation to a representation that can be used by a neural network. In short, this means getting all values in the range [-1, 1]. There are several implementations of this. For continues values, there is three ways at the moment.
    1. Gaussian Mixture Models. A combination of several gaussians are fit to the data of a single column and can represent the data when it does not follow a typical gaussian shape, which is the assumption of more neural networks.
    2. Bayesian Gaussian Mixture Models. The BGMM is an adaptation of the Gaussian Mixture Models, that, in short, allows for a varying number of components to be learned. This method takes quite a bit longer to fix, but should typically give a bit better results.
    3. Scaler. Futhermore, we can use normalizations and standardization to get the data in the required ranges. However, this often has caveats for the neural network, since the resulting distributions are not typically Gaussians.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabular-data-synthesizer-0.2.1.tar.gz (13.5 kB view details)

Uploaded Source

File details

Details for the file tabular-data-synthesizer-0.2.1.tar.gz.

File metadata

  • Download URL: tabular-data-synthesizer-0.2.1.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.1 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.1

File hashes

Hashes for tabular-data-synthesizer-0.2.1.tar.gz
Algorithm Hash digest
SHA256 787f42d866eefa1385401d7dcf06854d7672077eac70abd15458a210db306ff0
MD5 a2377a5320ff79666a671f7601a80fc5
BLAKE2b-256 f608afd82e317c5fdb5756b0172b1803d69ab3d5130390825b60085cd54b4fc7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page