Skip to main content

Reversible Data Transforms

Project description


This repository is part of The Synthetic Data Vault Project, a project from DataCebo.

Development Status PyPi Shield Unit Tests Downloads Coverage Status Slack

Overview

RDT (Reversible Data Transforms) is a Python library that transforms raw data into fully numerical data, ready for data science. The transforms are reversible, allowing you to convert from numerical data back into your original format.

Install

Install RDT using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.

pip install rdt
conda install -c conda-forge rdt

For more information about using reversible data transformations, visit the RDT Documentation.

Quickstart

In this short series of tutorials we will guide you through a series of steps that will help you getting started using RDT to transform columns, tables and datasets.

Load the demo data

After you have installed RDT, you can get started using the demo dataset.

from rdt import get_demo

customers = get_demo()

This dataset contains some randomly generated values that describe the customers of an online marketplace.

  last_login email_optin credit_card  age  dollars_spent
0 2021-06-26       False        VISA   29          99.99
1 2021-02-10       False        VISA   18            NaN
2        NaT       False        AMEX   21           2.50
3 2020-09-26        True         NaN   45          25.00
4 2020-12-22         NaN    DISCOVER   32          19.99

Let's transform this data so that each column is converted to full, numerical data ready for data science.

Creating the HyperTransformer & config

The HyperTransformer is capable of transforming multi-column datasets.

from rdt import HyperTransformer

ht = HyperTransformer()

The HyperTransformer needs to know about the columns in your dataset and which transformers to apply to each. These are described by a config. We can ask the HyperTransformer to automatically detect it based on the data we plan to use.

ht.detect_initial_config(data=customers)

This will create and set the config.

Config:
{
    "sdtypes": {
        "last_login": "datetime",
        "email_optin": "boolean",
        "credit_card": "categorical",
        "age": "numerical",
        "dollars_spent": "numerical"
    },
    "transformers": {
        "last_login": "UnixTimestampEncoder()",
        "email_optin": "BinaryEncoder()",
        "credit_card": "FrequencyEncoder()",
        "age": "FloatFormatter()",
        "dollars_spent": "FloatFormatter()"
    }
}

The sdtypes dictionary describes the semantic data types of each of your columns and the transformers dictionary describes which transformer to use for each column. You can customize the transformers and their settings. (See the Transformers Glossary for more information).

Fitting & using the HyperTransformer

The HyperTransformer references the config while learning the data during the fit stage.

ht.fit(customers)

Once the transformer is fit, it's ready to use. Use the transform method to transform all columns of your dataset at once.

transformed_data = ht.transform(customers)
   last_login.value  email_optin.value  credit_card.value  age.value  dollars_spent.value
0      1.624666e+18                0.0                0.2         29                99.99
1      1.612915e+18                0.0                0.2         18                36.87
2      1.611814e+18                0.0                0.5         21                 2.50
3      1.601078e+18                1.0                0.7         45                25.00
4      1.608595e+18                0.0                0.9         32                19.99

The HyperTransformer applied the assigned transformer to each individual column. Each column now contains fully numerical data that you can use for your project!

When you're done with your project, you can also transform the data back to the original format using the reverse_transform method.

original_format_data = ht.reverse_transform(transformed_data)
  last_login email_optin credit_card  age  dollars_spent
0        NaT       False        VISA   29          99.99
1 2021-02-10       False        VISA   18            NaN
2        NaT       False        AMEX   21            NaN
3 2020-09-26        True         NaN   45          25.00
4 2020-12-22       False    DISCOVER   32          19.99

What's Next?

To learn more about reversible data transformations, visit the RDT Documentation.




The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:

  • 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
  • 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
  • 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rdt_identity-1.9.2.tar.gz (119.9 kB view details)

Uploaded Source

Built Distribution

rdt_identity-1.9.2-py2.py3-none-any.whl (11.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file rdt_identity-1.9.2.tar.gz.

File metadata

  • Download URL: rdt_identity-1.9.2.tar.gz
  • Upload date:
  • Size: 119.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/42.0 requests/2.31.0 requests-toolbelt/1.0.0 urllib3/2.0.6 tqdm/4.66.1 importlib-metadata/6.8.0 keyring/24.2.0 rfc3986/2.0.0 colorama/0.4.6 CPython/3.10.13

File hashes

Hashes for rdt_identity-1.9.2.tar.gz
Algorithm Hash digest
SHA256 09e3480ce6c2e8013e79b453158fbec75e1b35224150c93388c9fcb1a0dcf8f9
MD5 998042726c5d01c2816c186287b5e664
BLAKE2b-256 4afbf63acaab097ef1ac330a232c39caca356eb2203ca1429be2232838d12127

See more details on using hashes here.

File details

Details for the file rdt_identity-1.9.2-py2.py3-none-any.whl.

File metadata

  • Download URL: rdt_identity-1.9.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 11.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/42.0 requests/2.31.0 requests-toolbelt/1.0.0 urllib3/2.0.6 tqdm/4.66.1 importlib-metadata/6.8.0 keyring/24.2.0 rfc3986/2.0.0 colorama/0.4.6 CPython/3.10.13

File hashes

Hashes for rdt_identity-1.9.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 abfedef987094bf7448a391b9679f20b82dc519a657518a1561149b6543c08ca
MD5 bc9340c56ba20d0e756e3403104f3533
BLAKE2b-256 83d46989baa59c0258ac1be0be27a04ed2a2fb4df772cc4202df0680b416fd48

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page