Skip to main content

Data Anonymizer for Django

Project description

Django Scrubber

Build Status PyPI Downloads

django_scrubber is a django app meant to help you anonymize your project's database data. It destructively alters data directly on the DB and therefore should not be used on production.

The main use case is providing developers with realistic data to use during development, without having to distribute your customers' or users' potentially sensitive information. To accomplish this, django_scrubber should be plugged in a step during the creation of your database dumps.

Simply mark the fields you want to anonymize and call the scrub_data management command. Data will be replaced based on different scrubbers (see below), which define how the anonymous content will be generated.

If you want to be sure that you don't forget any fields in the ongoing development progress, you can use the management command scrub_validation in your CI/CD pipeline to check for any missing fields.

Installation

Simply run:

pip install django-scrubber

And add django_scrubber to your django INSTALLED_APPS. I.e.: in settings.py add:

INSTALLED_APPS = [
  ...
  'django_scrubber',
  ...
]

Scrubbing data

In order to scrub data, i.e.: to replace DB data with anonymized versions, django-scrubber must know which models and fields it should act on, and how the data should be replaced.

There are a few different ways to select which data should be scrubbed, namely: explicitly per model field; or globally per name or field type.

Adding scrubbers directly to model, matching scrubbers to fields by name:

class MyModel(Model):
    somefield = CharField()

    class Scrubbers:
        somefield = scrubbers.Hash('somefield')

Adding scrubbers globally, either by field name or field type:

# (in settings.py)

SCRUBBER_GLOBAL_SCRUBBERS = {
    'name': scrubbers.Hash,
    EmailField: scrubbers.Hash,
}

Model scrubbers override field-name scrubbers, which in turn override field-type scrubbers.

To disable global scrubbing in some specific model, simply set the respective field scrubber to None.

Scrubbers defined for non-existing fields will raise a warning but not fail the scubbing process.

Which mechanism will be used to scrub the selected data is determined by using one of the provided scrubbers in django_scrubber.scrubbers. See below for a list. Alternatively, values may be anything that can be used as a value in a QuerySet.update() call (like Func instances, string literals, etc), or any callable that returns such an object when called with a Field object as argument.

By default, django_scrubber will affect all models from all registered apps. This may lead to issues with third-party apps if the global scrubbers are too general. This can be avoided with the SCRUBBER_APPS_LIST setting. Using this, you might for instance split your INSTALLED_APPS into multiple SYSTEM_APPS and LOCAL_APPS, then set SCRUBBER_APPS_LIST = LOCAL_APPS, to scrub only your own apps.

Finally just run ./manage.py scrub_data to destructively scrub the registered fields.

Arguments to the scrub_data command

--model Scrub only a single model (format <app_label>.<model_name>)

--keep-sessions Will NOT truncate all (by definition critical) session data.

--remove-fake-data Will truncate the database table storing preprocessed data for the Faker library.

Built-In scrubbers

Empty/Null

The simplest scrubbers: replace the field's content with the empty string or NULL, respectively.

class Scrubbers:
    somefield = scrubbers.Empty
    someother = scrubbers.Null

These scrubbers have no options.

Keeper

When running the validation or want to work in strict mode, you maybe want to actively decide to keep certain data instead of scrubbing them. In this case, you can just define scrubbers.Keep.

class Scrubbers:
    non_critical_field = scrubbers.Keep

These scrubber doesn't have any options.

Hash

Simple hashing of content:

class Scrubbers:
    somefield = scrubbers.Hash  # will use the field itself as source
    someotherfield = scrubbers.Hash('somefield')  # can optionally pass a different field name as hashing source

Currently, this uses the MD5 hash which is supported in a wide variety of DB engines. Additionally, since security is not the main objective, a shorter hash length has a lower risk of being longer than whatever field it is supposed to replace.

Lorem

Simple scrubber meant to replace TextField with a static block of text. Has no options.

class Scrubbers:
    somefield = scrubbers.Lorem

Concat

Wrapper around django.db.functions.Concat to enable simple concatenation of scrubbers. This is useful if you want to ensure a fields uniqueness through composition of, for instance, the Hash and Faker (see below) scrubbers.

The following will generate random email addresses by hashing the user-part and using faker for the domain part:

class Scrubbers:
    email = scrubbers.Concat(scrubbers.Hash('email'), models.Value('@'), scrubbers.Faker('domain_name'))

Faker

Replaces content with the help of faker.

class Scrubbers:
    first_name = scrubbers.Faker('first_name')
    last_name = scrubbers.Faker('last_name')
    past_date = scrubbers.Faker('past_date', start_date="-30d", tzinfo=None)

The replacements are done on the database-level and should therefore be able to cope with large amounts of data with reasonable performance.

The Faker scrubber requires at least one argument: the faker provider used to generate random data. All faker providers are supported, and you can also register your own custom providers.
Any remaining arguments will be passed through to that provider. Please refer to the faker docs if a provider accepts arguments and what to do with them.

Locales

Faker will be initialized with the current django LANGUAGE_CODE and will populate the DB with localized data. If you want localized scrubbing, simply set it to some other value.

Idempotency

By default, the faker instance used to populate the DB uses a fixed random seed, in order to ensure different scrubbings of the same data generate the same output. This is particularly useful if the scrubbed data is imported as a dump by developers, since changing data during troubleshooting would otherwise be confusing.

This behaviour can be changed by setting SCRUBBER_RANDOM_SEED=None, which ensures every scrubbing will generate random source data.

Limitations

Scrubbing unique fields may lead to IntegrityErrors, since there is no guarantee that the random content will not be repeated. Playing with different settings for SCRUBBER_RANDOM_SEED and SCRUBBER_ENTRIES_PER_PROVIDER may alleviate the problem. Unfortunately, for performance reasons, the source data for scrubbing with faker is added to the database, and arbitrarily increasing SCRUBBER_ENTRIES_PER_PROVIDER will significantly slow down scrubbing (besides still not guaranteeing uniqueness).

When using django < 2.1 and working on sqlite a bug within django causes field-specific scrubbing ( e.g. date_object) to fail. Please consider using a different database backend or upgrade to the latest django version.

Scrubbing third-party models

Sometimes you just don't have control over some code, but you still want to scrub the data of a given model.

A good example is the Django user model. It contains sensitive data, and you would have to overwrite the whole model just to add the scrubber metaclass.

That's the way to go:

  1. Define your Scrubber class somewhere in your codebase (like a scrubbers.py)
# scrubbers.py
class UserScrubbers:
    scrubbers.Faker('de_DE')
    first_name = scrubbers.Faker('first_name')
    last_name = scrubbers.Faker('last_name')
    username = scrubbers.Faker('uuid4')
    password = scrubbers.Faker('sha1')
    last_login = scrubbers.Null
    email = scrubbers.Concat(first_name, models.Value('.'), last_name, models.Value('@'),
                             models.Value(settings.SCRUBBER_DOMAIN))
  1. Set up a mapping between your third-party model and your scrubber class
# settings.py
SCRUBBER_MAPPING = {
    "auth.User": "apps.account.scrubbers.UserScrubbers",
}

Settings

SCRUBBER_GLOBAL_SCRUBBERS:

Dictionary of global scrubbers. Keys should be either field names as strings or field type classes. Values should be one of the scrubbers provided in django_scrubber.scrubbers.

Example:

SCRUBBER_GLOBAL_SCRUBBERS = {
    'name': scrubbers.Hash,
    EmailField: scrubbers.Hash,
}

SCRUBBER_RANDOM_SEED:

The seed used when generating random content by the Faker scrubber. Setting this to None means each scrubbing will generate different data.

(default: 42)

SCRUBBER_ENTRIES_PER_PROVIDER:

Number of entries to use as source for Faker scrubber. Increasing this value will increase the randomness of generated data, but decrease performance.

(default: 1000)

SCRUBBER_SKIP_UNMANAGED:

Do not attempt to scrub models which are not managed by the ORM.

(default: True)

SCRUBBER_APPS_LIST:

Only scrub models belonging to these specific django apps. If unset, will scrub all installed apps.

(default: None)

SCRUBBER_ADDITIONAL_FAKER_PROVIDERS:

Add additional fake providers to be used by Faker. Must be noted as full dotted path to the provider class.

(default: {*()}, empty set)

SCRUBBER_FAKER_LOCALE:

Set an alternative locale for Faker used during the scrubbing process.

(default: None, falls back to Django's default locale)

SCRUBBER_MAPPING:

Define a class and a mapper which does not have to live inside the given model. Useful, if you have no control over the models code you'd like to scrub.

SCRUBBER_MAPPING = {
    "auth.User": "my_app.scrubbers.UserScrubbers",
}

(default: {})

SCRUBBER_STRICT_MODE:

When strict mode is activated, you have to define a scrubbing policy for every field of every type defined in SCRUBBER_REQUIRED_FIELD_TYPES. If you have unscrubbed fields and this flag is active, you can't run python manage.py scrub_data.

(default: False)

SCRUBBER_REQUIRED_FIELD_TYPES:

Defaults to all text-based Django model fields. Usually, privacy-relevant data is only stored in text-fields, numbers and booleans (usually) can't contain sensitive personal data. These fields will be checked when running python manage.py scrub_validation.

(default: (models.CharField, models.TextField, models.URLField, models.JSONField, models.GenericIPAddressField, models.EmailField,))

SCRUBBER_REQUIRED_FIELD_MODEL_WHITELIST:

Whitelists a list of models which will not be checked during scrub_validation and when activating the strict mode. Defaults to the non-privacy-related Django base models. Items can either be full model names (e.g. auth.Group) or regular expression patterns matching against the full model name (e.g. re.compile(auth.*) to whitelist all auth models).

(default: ('auth.Group', 'auth.Permission', 'contenttypes.ContentType', 'sessions.Session', 'sites.Site', 'django_scrubber.FakeData', 'db.TestModel',))

(default: {})

Logging

Scrubber uses the default django logger. The logger name is django_scrubber.scrubbers. So if you want to log - for example - to the console, you could set up the logger like this:

LOGGING['loggers']['django_scrubber'] = {
    'handlers': ['console'],
    'propagate': True,
    'level': 'DEBUG',
}

Making a new release

bumpversion is used to manage releases.

Add your changes to the CHANGELOG and run bumpversion <major|minor|patch>, then push (including tags)

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

[3.0.0] - 2024-09-10

Breaking

  • Removed SCRUBBER_VALIDATION_WHITELIST in favour of SCRUBBER_REQUIRED_FIELD_MODEL_WHITELIST - Thanks @GitRon

Changed

  • Added Django test model db.TestModel to default whitelist of SCRUBBER_REQUIRED_FIELD_MODEL_WHITELIST - Thanks @GitRon
  • Removed support for the mock package in unit tests
  • Adjusted some default settings

[2.1.1] - 2024-08-20

Changed

  • Fixed an issue where the management command scrub_validation could fail even though all models were skipped - Thanks @GitRon

[2.1.0] - 2024-08-20

Changed

  • Added support for Django version 5.1 - Thanks @GitRon
  • Added SCRUBBER_VALIDATION_WHITELIST and excluded Django core test model - Thanks @GitRon

[2.0.0] - 2024-06-27

Changed

  • BREAKING: Remove support for Django below version 4.2
  • BREAKING: Remove support for Python below version 3.8
  • BREAKING: Minimum required Faker version is now 20.0.0, released 11/2023
  • Added support for Django version 5.0
  • Added support for Python version 3.12
  • Add docker compose setup to run tests

[1.3.0] - 2024-06-05

Added

  • Add support for regular expressions in SCRUBBER_REQUIRED_FIELD_MODEL_WHITELIST - Thanks @fbinz

[1.2.2] - 2023-11-04

Changed

  • Set default_auto_field for django-scrubber to django.db.models.AutoField to prevent overrides from Django settings - Thanks @GitRon

[1.2.1] - 2023-11-03

Invalid

[1.2.0] - 2023-04-01

Changed

  • Added scrubber validation - Thanks @GitRon
  • Added strict mode - Thanks @GitRon

[1.1.0] - 2022-07-11

Changed

  • Invalid fields on scrubbers will no longer raise exception but just trigger warnings now
  • Author list completed

[1.0.0] - 2022-07-11

Changed

  • Meta data for python package improved - Thanks @GitRon

[0.9.0] - 2022-06-27

Added

[0.8.0] - 2022-05-01

Added

  • Add keep-sessions argument to scrub_data command. Will NOT truncate all (by definition critical) session data. Thanks @GitRon
  • Add remove-fake-data argument to scrub_data command. Will truncate the database table storing preprocessed data for the Faker library. Thanks @GitRon
  • Add Django 3.2 and 4.0 to test matrix

Changed

  • Remove Python 3.6 from test matrix
  • Remove Django 2.2 and 3.1 from test matrix

[0.7.0] - 2022-02-24

Changed

  • Remove upper boundary for Faker as they release non-breaking major upgrades way too often, please pin a working release in your own app

[0.6.2] - 2022-02-08

Changed

  • Support faker 12.x

[0.6.1] - 2022-01-25

Changed

  • Support faker 11.x

[0.6.0] - 2021-10-18

Added

  • Add support to override Faker locale in scrubber settings

Changed

  • Publish coverage only on main repository

[0.5.6] - 2021-10-08

Changed

  • Pin psycopg2 in CI ti 2.8.6 as 2.9+ is incompatible with Django 2.2

[0.5.5] - 2021-10-08

Changed

  • Support faker 9.x

[0.5.4] - 2021-04-13

Changed

  • Support faker 8.x

[0.5.3] - 2021-02-04

Changed

  • Support faker 6.x

[0.5.2] - 2021-01-12

Changed

  • Add tests for Python 3.9
  • Add tests for Django 3.1
  • Support faker 5.x
  • Update dev package requirements

[0.5.1] - 2020-10-16

Changed

  • Fix travis setup

[0.5.0] - 2020-10-16

Added

  • Support for django-model-utils 4.x.x

Changed

  • Add compatibility for Faker 3.x.x, remove support for Faker < 0.8.0
  • Remove support for Python 2.7 and 3.5
  • Remove support for Django 1.x

[0.4.4] - 2019-12-11

Fixed

  • add the same version restrictions on faker to setup.py

[0.4.3] - 2019-12-04

Added

  • add empty and null scrubbers

Changed

  • make Lorem scrubber lazy, matching README

Fixed

  • set more stringent version requirements (faker >= 3 breaks builds)

[0.4.1] - 2019-11-16

Fixed

  • correctly clear fake data model to fix successive calls to scrub_data (thanks Benedikt Bauer)

[0.4.0] - 2019-11-13

Added

  • Faker scrubber now supports passing arbitrary arguments to faker providers and also non-text fields (thanks Benedikt Bauer and Ronny Vedrilla)

[0.3.1] - 2018-09-10

Fixed

  • #9 Hash scrubber choking on fields with max_length=None - Thanks to Charlie Denton

[0.3.0] - 2018-09-06

Added

  • Finally added some basic tests (thanks Marco De Felice)
  • Hash scrubber can now also be used on sqlite

Changed

  • BREAKING: scrubbers that are lazily initialized now receive Field instances as parameters, instead of field names. If you have custom scrubbers depending on the previous behavior, these should be updated. Accessing the field's name from the object instance is trivial: field_instance.name. E.g.: if you have some_field = MyCustomScrubber in any of your models' Scrubbers, this class must accept a Field instance as first parameter. Note that explicitly intializing any of the built-in scrubbers with field names is still supported, so if you were just using built-in scrubbers, you should not be affected by this change.
  • related to the above, FuncField derived classes can now do connection-based setup by implementing the connection_setup method. This is mostly useful for doing different things based on the DB vendor, and is used to implement MD5() on sqlite (see added feature above)
  • Ignore proxy models when scrubbing (thanks Marco De Felice)
  • Expand tests to include python 3.7 and django 2.1

[0.2.1] - 2018-08-14

Added

  • Option to scrub only one model from the management command
  • Support loading additional faker providers by config setting SCRUBBER_ADDITIONAL_FAKER_PROVIDERS

Changed

[0.2.0] - 2018-08-13

Added

  • scrubbers.Concat to make simple concatenation of scrubbers possible

[0.1.4] - 2018-08-13

Changed

  • Make our README look beautiful on PyPI

[0.1.3] - 2018-08-13

Fixed

[0.1.2] - 2018-06-22

Changed

  • Use bumpversion and travis to make new releases
  • rename project: django_scrubber → django-scrubber

[0.1.0] - 2018-06-22

Added

  • Initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

django_scrubber-3.0.0.tar.gz (27.6 kB view details)

Uploaded Source

Built Distribution

django_scrubber-3.0.0-py2.py3-none-any.whl (20.3 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file django_scrubber-3.0.0.tar.gz.

File metadata

  • Download URL: django_scrubber-3.0.0.tar.gz
  • Upload date:
  • Size: 27.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for django_scrubber-3.0.0.tar.gz
Algorithm Hash digest
SHA256 6a9d15469af55070396e593f621138606f1ea76752c5395d9bd3c44b0ba3b176
MD5 731de8368ce134a22d672ab545c42c4e
BLAKE2b-256 d5cdeb6a3ade089ab6a823d357d0cb811581c8dcf02f0e2d31f9a1748c2a5d77

See more details on using hashes here.

File details

Details for the file django_scrubber-3.0.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for django_scrubber-3.0.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 d115189976848e6f13442405bcdcdc7ca16e4da76ddd64bd5f574250162c80e0
MD5 82f6ddc28e1bbbb4dd804df5d17f1b1b
BLAKE2b-256 1663a74fe2a3d7bdfe36781f4941fcef202794c4ad43cafe1e8a3665a1f91444

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page