Skip to main content

An anonymization tool for production databases

Project description

pynonymizer pynonymizer on PyPI Downloads License

pynonymizer is a universal tool for translating sensitive production database dumps into anonymized copies.

This can help you support GDPR/Data Protection in your organization without compromizing on quality testing data.

Why are anonymized databases important?

The primary source of information on how your database is used is in your production database. In most situations, the production dataset is usually significantly larger than any development copy, and would contain a wider range of data.

From time to time, it is prudent to run a new feature or stage a test against this dataset, rather than one that is artificially created by developers or by testing frameworks. Anonymized databases allow us to use the structures present in production, while stripping them of any personally identifiable data that would consitute a breach of privacy for end-users and subsequently a breach of GDPR.

With Anonymized databases, copies can be processed regularly, and distributed easily, leaving your developers and testers with a rich source of information on the volume and general makeup of the system in production. It can be used to run better staging environments, integration tests, and even simulate database migrations.

below is an excerpt from an anonymized database:

id salutation firstname surname email dob
1 Dr. Bernard Gough tnelson@powell.com 2000-07-03
2 Mr. Molly Bennett clarkeharriet@price-fry.com 2014-05-19
3 Mrs. Chelsea Reid adamsamber@clayton.com 1974-09-08
4 Dr. Grace Armstrong tracy36@wilson-matthews.com 1963-12-15
5 Dr. Stanley James christine15@stewart.net 1976-09-16
6 Dr. Mark Walsh dgardner@ward.biz 2004-08-28
7 Mrs. Josephine Chambers hperry@allen.com 1916-04-04
8 Dr. Stephen Thomas thompsonheather@smith-stevens.com 1995-04-17
9 Ms. Damian Thompson yjones@cox.biz 2016-10-02
10 Miss Geraldine Harris porteralice@francis-patel.com 1910-09-28
11 Ms. Gemma Jones mandylewis@patel-thomas.net 1990-06-03
12 Dr. Glenn Carr garnervalerie@farrell-parsons.biz 1998-04-19

How does it work?

pynonymizer replaces personally identifiable data in your database with realistic pseudorandom data, from the Faker library or from other functions. There are a wide variety of data types available which should suit the column in question, for example:

  • unique_email
  • company
  • file_path
  • [...]

Pynonymizer's main data replacement mechanism fake_update is a random selection from a small pool of data (--seed-rows controls the available Faker data). This process is chosen for compatibility and speed of operation, but does not guarantee uniqueness. This may or may not suit your exact use-case. For a full list of data generation strategies, see the docs on strategyfiles

Examples

You can see strategyfile examples for existing database, such as wordpress or adventureworks sample database, in the the examples folder.

Process outline

  1. Restore from dumpfile to temporary database.
  2. Anonymize temporary database with strategy.
  3. Dump resulting data to file.
  4. Drop temporary database.

If this workflow doesnt work for you, see process control to see if it can be adjusted to suit your needs.

mysql

  • mysql/mysqldump Must be in $PATH
  • Local or remote mysql >= 5.5
  • Supported Inputs:
    • Plain SQL over stdout
    • Plain SQL file .sql
    • GZip-compressed SQL file .gz
  • Supported Outputs:
    • Plain SQL over stdout
    • Plain SQL file .sql
    • GZip-compressed SQL file .gz
    • LZMA-compressed SQL file .xz

mssql

  • Requires extra dependencies: install package pynonymizer[mssql]
  • MSSQL >= 2008
  • For RESTORE_DB/DUMP_DB operations, the database server must be running locally with pynonymizer. This is because MSSQL RESTORE and BACKUP instructions are received by the database, so piping a local backup to a remote server is not possible.
  • The anonymize process can be performed on remote servers, but you are responsible for creating/managing the target database.
  • Supported Inputs:
    • Local backup file
  • Supported Outputs:
    • Local backup file

postgres

  • psql/pg_dump Must be in $PATH
  • Local or remote postgres server
  • Supported Inputs:
    • Plain SQL over stdout
    • Plain SQL file .sql
    • GZip-compressed SQL file .gz
  • Supported Outputs:
    • Plain SQL over stdout
    • Plain SQL file .sql
    • GZip-compressed SQL file .gz
    • LZMA-compressed SQL file .xz

Getting Started

Usage

CLI

  1. Write a strategyfile for your database
  2. Check out the help for a description of options pynonymizer --help
  3. Start Anonymizing!

Package

Pynonymizer can also be invoked programmatically / from other python code. See the module entrypoint pynonymizer or pynonymizer/pynonymize.py

import pynonymizer

pynonymizer.run(input_path="./backup.sql", strategyfile_path="./strategy.yml" [...] )

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pynonymizer-2.2.1.tar.gz (31.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pynonymizer-2.2.1-py3-none-any.whl (38.9 kB view details)

Uploaded Python 3

File details

Details for the file pynonymizer-2.2.1.tar.gz.

File metadata

  • Download URL: pynonymizer-2.2.1.tar.gz
  • Upload date:
  • Size: 31.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for pynonymizer-2.2.1.tar.gz
Algorithm Hash digest
SHA256 c719240d918bf800367bf6b9079d0b1d2f8d0035951a0fc43a7bd7939ece72f7
MD5 569e7bffc12ea59a053ad718409877b1
BLAKE2b-256 f7949c9a5f185dae0a9faf75d60c5e4ff1db3dc5112657106fb63342effb3254

See more details on using hashes here.

File details

Details for the file pynonymizer-2.2.1-py3-none-any.whl.

File metadata

  • Download URL: pynonymizer-2.2.1-py3-none-any.whl
  • Upload date:
  • Size: 38.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for pynonymizer-2.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bdb757995b8028c14cce076521b0988c018110b167f471ec9fa6633b689a49a7
MD5 ea78fafa656223c08e79f480614da76b
BLAKE2b-256 2548cdf28cfa040ac07dc3a11b36b15c0f725ef351c2b1858663943e529bb44b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page