Skip to main content

Reversible Data Transforms

Project description


This repository is part of The Synthetic Data Vault Project, a project from DataCebo.

Development Status PyPi Shield Unit Tests Downloads Coverage Status Forum

Overview

RDT (Reversible Data Transforms) is a Python library that transforms raw data into fully numerical data, ready for data science. The transforms are reversible, allowing you to convert from numerical data back into your original format.

Install

Install RDT using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.

pip install rdt
conda install -c conda-forge rdt

For more information about using reversible data transformations, visit the RDT Documentation.

Quickstart

In this short series of tutorials we will guide you through a series of steps that will help you getting started using RDT to transform columns, tables and datasets.

Load the demo data

After you have installed RDT, you can get started using the demo dataset.

from rdt import get_demo

customers = get_demo()

This dataset contains some randomly generated values that describe the customers of an online marketplace.

  last_login email_optin credit_card  age  dollars_spent
0 2021-06-26       False        VISA   29          99.99
1 2021-02-10       False        VISA   18            NaN
2        NaT       False        AMEX   21           2.50
3 2020-09-26        True         NaN   45          25.00
4 2020-12-22         NaN    DISCOVER   32          19.99

Let's transform this data so that each column is converted to full, numerical data ready for data science.

Creating the HyperTransformer & config

The HyperTransformer is capable of transforming multi-column datasets.

from rdt import HyperTransformer

ht = HyperTransformer()

The HyperTransformer needs to know about the columns in your dataset and which transformers to apply to each. These are described by a config. We can ask the HyperTransformer to automatically detect it based on the data we plan to use.

ht.detect_initial_config(data=customers)

This will create and set the config.

Config:
{
    "sdtypes": {
        "last_login": "datetime",
        "email_optin": "boolean",
        "credit_card": "categorical",
        "age": "numerical",
        "dollars_spent": "numerical"
    },
    "transformers": {
        "last_login": "UnixTimestampEncoder()",
        "email_optin": "BinaryEncoder()",
        "credit_card": "FrequencyEncoder()",
        "age": "FloatFormatter()",
        "dollars_spent": "FloatFormatter()"
    }
}

The sdtypes dictionary describes the semantic data types of each of your columns and the transformers dictionary describes which transformer to use for each column. You can customize the transformers and their settings. (See the Transformers Glossary for more information).

Fitting & using the HyperTransformer

The HyperTransformer references the config while learning the data during the fit stage.

ht.fit(customers)

Once the transformer is fit, it's ready to use. Use the transform method to transform all columns of your dataset at once.

transformed_data = ht.transform(customers)
   last_login.value  email_optin.value  credit_card.value  age.value  dollars_spent.value
0      1.624666e+18                0.0                0.2         29                99.99
1      1.612915e+18                0.0                0.2         18                36.87
2      1.611814e+18                0.0                0.5         21                 2.50
3      1.601078e+18                1.0                0.7         45                25.00
4      1.608595e+18                0.0                0.9         32                19.99

The HyperTransformer applied the assigned transformer to each individual column. Each column now contains fully numerical data that you can use for your project!

When you're done with your project, you can also transform the data back to the original format using the reverse_transform method.

original_format_data = ht.reverse_transform(transformed_data)
  last_login email_optin credit_card  age  dollars_spent
0        NaT       False        VISA   29          99.99
1 2021-02-10       False        VISA   18            NaN
2        NaT       False        AMEX   21            NaN
3 2020-09-26        True         NaN   45          25.00
4 2020-12-22       False    DISCOVER   32          19.99

What's Next?

To learn more about reversible data transformations, visit the RDT Documentation.




The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:

  • 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
  • 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
  • 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rdt-1.20.1.dev0.tar.gz (65.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rdt-1.20.1.dev0-py3-none-any.whl (74.6 kB view details)

Uploaded Python 3

File details

Details for the file rdt-1.20.1.dev0.tar.gz.

File metadata

  • Download URL: rdt-1.20.1.dev0.tar.gz
  • Upload date:
  • Size: 65.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rdt-1.20.1.dev0.tar.gz
Algorithm Hash digest
SHA256 ec6ac2abdec3bd86194115f5952225c255f9d1ddb17c6b5c78f0b9e46abff64b
MD5 7b150f76ac325d8be8d7252907ad0b43
BLAKE2b-256 6a1c0a3ee40cd0ca74bc13847d7dff537574c87ec008fd5a96a7eb0e27a1931a

See more details on using hashes here.

Provenance

The following attestation bundles were made for rdt-1.20.1.dev0.tar.gz:

Publisher: release.yml on sdv-dev/RDT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rdt-1.20.1.dev0-py3-none-any.whl.

File metadata

  • Download URL: rdt-1.20.1.dev0-py3-none-any.whl
  • Upload date:
  • Size: 74.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rdt-1.20.1.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 b203b4a4a1d2a9f50b854ee9f2fb1e2a1be2cb85e8a9bbe6b0b802db8a1ef50e
MD5 ff650afa8c9de85acfbba1f3ae4f152e
BLAKE2b-256 0361873ed38b940ed792b9d981f2688b822fdabc7ef5db4564d5b930916cc3fa

See more details on using hashes here.

Provenance

The following attestation bundles were made for rdt-1.20.1.dev0-py3-none-any.whl:

Publisher: release.yml on sdv-dev/RDT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page