Skip to main content

Replicate structure of private or protected data for testing.

Project description

data-duper

data-duper is a tool to replicate the structure of private or protected data for testing.

PyPI version GitHub license CI/CD python: ≥3.8 code style: black imports: isort

What does it solve?

When testing the data handling of software, it is best to use data as similar to the real data as possible - without revealing sensitive information to the test environment. This is where data-duper comes into play. It allows you to create an authentic replicate of your private or protected data.

How to get it?

The source code is currently hosted on GitHub at: https://github.com/kjanker/data-duper.

Binary installers for the latest released version are available at the Python Package Index (PyPI).

What does it do?

data-duper works like a learning model. You train the duper on your real data and, afterwards, generate a new data set of arbitrary size. The new data set - or dupe - has the same structure as the real data, i.e., columns, dtypes, as well as string composition and distribution of numerical values. Occurrences of NA values are ignored by default but can optionally be included as well.

Methods

  • numerical values (float, int, datetime) are drawn from an interpolated empirical distribution
  • identifier strings of fixed length and structure are replicated with regular expressions
  • features with only few values (category, bool) are redrawn according to their occurrence

Limitations

  • value distributions are replicated as draw probability. Thus, for small dupe sets the realized distribution may differ slightly
  • correlations between columns are not replicated (this ensures real data is better obscured)
  • descriptive strings like notes, names, etc are not obscured but reshuffled

How can I use it?

You simply initialize a new Duper instance, fit it on your real data df_real, and make a data dupe df_dupe of desired size n.

from duper import Duper

duper = Duper()
duper.fit(df=df_real)
df_dupe = duper.make(size=10000)

Open issues

  • include optional correlations between selected rows
  • improve algorithm of regex duper

Get in touch

Don't hesitate to contact me if you like the idea and want to get in touch.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data-duper-0.1.1.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_duper-0.1.1-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file data-duper-0.1.1.tar.gz.

File metadata

  • Download URL: data-duper-0.1.1.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.1

File hashes

Hashes for data-duper-0.1.1.tar.gz
Algorithm Hash digest
SHA256 cd9dde55291a3bc0a432b62b32faaba993b664e7a0c3216bfe36339c7f9c3830
MD5 9d7523cebf13134fc74d180b4fad561c
BLAKE2b-256 58fed82367793e769277a64301c165bde6f4c4e233217a9f070b5b27486c1710

See more details on using hashes here.

File details

Details for the file data_duper-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: data_duper-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.1

File hashes

Hashes for data_duper-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 df760e0c2d4aa1501d1aed19210d348b611727d0488fc95f2405bc0aa133d065
MD5 a531c0b05eb32045fe1e7cac909420b3
BLAKE2b-256 e6a6c400932c519f6724431521082804dbb62dd45c64d63ad8cd4c9f2453b4ae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page