Skip to main content

Fast data anonymization with Polars

Project description

CloakData - Data Anonymizer

PyPI Python CI License

A flexible data anonymization library built on Polars, designed for privacy, compliance, and testing with low overhead.


What's New in 2.0.0

  • Improved masking and transformation consistency
  • Standardized built-in methods to return pl.Expr
  • Added round_date
  • Improved parameter handling and null safety
  • Expanded test coverage
  • Reorganized built-in methods under src/cloakdata/native_methods/

Features

  • Masking: full, partial, emails, numbers
  • Replacement: static values, exact mapping, contains-based rules
  • Sequential IDs: numeric and alphabetical
  • Generalization: age, date, number ranges
  • Randomization: choices, digits, shuffle, date offsets
  • Conditional rules with nested logic
  • Custom runtime methods with register_method(...)
  • Built on Polars for fast vectorized execution

How It Works

  1. Load data into a Polars DataFrame
  2. Define rules in a config dictionary
  3. Call anonymize(df, config)
  4. Receive a transformed DataFrame

Quickstart

import polars as pl

from cloakdata import anonymize

df = pl.DataFrame(
    {
        "name": ["Alice Smith", "Bob Jones"],
        "email": ["alice@example.com", "bob@example.com"],
        "age": [25, 42],
    }
)

config = {
    "columns": {
        "name": {"method": "initials_only"},
        "email": {"method": "mask_email"},
        "age": {"method": "generalize_age"},
    }
}

out = anonymize(df, config)
print(out)

Example Config

{
  "columns": {
    "name": { "method": "initials_only" },
    "email": { "method": "mask_email" },
    "phone": { "method": "mask_number" },
    "cpf": {
      "method": "replace_with_random_digits",
      "params": { "digits": 11 }
    },
    "status": {
      "method": "replace_exact",
      "params": { "mapping": { "active": "A", "inactive": "I" } }
    },
    "id_seq": { "method": "sequential_numeric", "params": { "prefix": "ID" } },
    "ref_code": { "method": "sequential_alpha", "params": { "prefix": "REF" } },
    "comments": { "method": "truncate", "params": { "length": 5 } },
    "age": { "method": "generalize_age" },
    "birth_date": { "method": "generalize_date", "params": { "mode": "month" } },
    "state": { "method": "random_choice", "params": { "choices": ["SP", "RJ", "MG", "BA"] } },
    "last_access": { "method": "date_offset", "params": { "min_days": -2, "max_days": 2 } },
    "feedback": { "method": "shuffle" }
  }
}

Conditional Rules

Single condition:

"cpf": {
  "method": "replace_with_random_digits",
  "params": { "digits": 11 },
  "condition": {
    "column": "status",
    "operator": "equals",
    "value": "active"
  }
}

Multiple rules per column:

"city": [
  { "method": "replace_with_value", "params": { "value": "X" } },
  {
    "method": "mask_partial",
    "params": { "visible_start": 1, "visible_end": 1 },
    "condition": { "column": "country", "operator": "equals", "value": "BR" }
  }
]

Nested conditions:

"age": {
  "method": "generalize_age",
  "condition": {
    "all": [
      { "column": "country", "operator": "equals", "value": "BR" },
      {
        "any": [
          { "column": "status", "operator": "equals", "value": "active" },
          { "column": "status", "operator": "equals", "value": "archived" }
        ]
      }
    ]
  }
}

Supported operators: equals, not_equals, in, not_in, gt, gte, lt, lte, contains, not_contains

Supported groups: all, any, not


Example Input to Output

Input:

name email age status
Alice Smith alice@example.com 25 active
Bob Jones bob@example.com 42 inactive

Config:

{
  "columns": {
    "name": { "method": "initials_only" },
    "email": { "method": "mask_email" },
    "age": { "method": "generalize_age" },
    "cpf": {
      "method": "replace_with_random_digits",
      "params": { "digits": 8 },
      "condition": {
        "column": "status",
        "operator": "equals",
        "value": "active"
      }
    }
  }
}

Output:

name email age cpf
A.S. xxxxx@example.com 20-29 48291034
B.J. xxxxx@example.com 40-49 null

Examples

Runnable scripts live under examples/.


Supported Methods

Method Description
full_mask Fixed mask or literal
mask_email Masks the local part of an email
mask_number Keeps leading characters and masks the rest
mask_credit_card Masks card digits while preserving the last visible digits
mask_partial Masks the middle while preserving visible edges
truncate Truncates strings to a fixed length
initials_only Converts names to initials
replace_with_value Replaces all values with a static value
hash_value Generates deterministic hashes, with optional salt
redact_regex Redacts regex matches inside free text
replace_with_hash_bucket Replaces values with deterministic hash buckets
replace_exact Replaces exact values using a mapping
replace_by_contains Replaces values that contain substrings
replace_with_random_digits Generates deterministic digit strings
sequential_numeric Sequential numeric pseudonyms
sequential_alpha Sequential alphabetic pseudonyms
generalize_age Groups ages into ranges
generalize_date Reduces date and datetime granularity
generalize_number_range Buckets numeric values into fixed intervals
generalize_zip_code Preserves a visible postal-code prefix and masks the rest
coarsen_datetime Coarsens timestamps into buckets, minute-of-day buckets (time-only or full datetime), hour, part-of-day, weekday, weekend/weekday, or configurable business-hours labels
top_k_bucket Keeps the top-k most frequent categories and buckets the rest
random_choice Picks deterministic values from a fixed set
noise_numeric Adds deterministic numeric noise within configured bounds
shuffle Shuffles values while keeping row count
date_offset Applies deterministic date offsets
clip_range Constrains numeric values to configured min/max bounds
round_number Rounds numeric values
round_date Rounds dates to month or year start
coalesce_cols Returns the first non-null value across columns
null_if_matches Converts known placeholders or regex matches into null

Notes

  • hash_value is deterministic and better when you need stable one-way pseudonymization.
  • replace_with_hash_bucket is deterministic bucketing, not unique pseudonymization. Different input values can land in the same bucket when the number of unique values is greater than the configured number of buckets.

Project Structure

src/
  cloakdata/
    native_methods/  # Built-in methods organized by domain
tests/               # Pytest suite
examples/            # Runnable examples
README.md
pyproject.toml

Built-in methods live under src/cloakdata/native_methods/ and are registered automatically with @native_method.


Installation

pip install cloakdata

Or with uv:

uv add cloakdata

Development

git clone https://github.com/youruser/cloakdata
cd cloakdata
uv sync --extra dev
pre-commit install
pytest -v

Roadmap

  • Regex-based redaction
  • Hashing strategies
  • Parallel processing for large datasets

Contributing

See CONTRIBUTING.md for setup, coding standards, how to add a new anonymization method, and the PR checklist.

Notice

See NOTICE for attribution details.

License

MIT Copyright Jeferson Peter

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloakdata-3.1.0.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cloakdata-3.1.0-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file cloakdata-3.1.0.tar.gz.

File metadata

  • Download URL: cloakdata-3.1.0.tar.gz
  • Upload date:
  • Size: 5.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.30 {"installer":{"name":"uv","version":"0.9.30","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for cloakdata-3.1.0.tar.gz
Algorithm Hash digest
SHA256 44ff4eb228ecd2b340e60a03c4f6bfc7ddb31d3f9de77d8b3ea8bd69017db01e
MD5 848c348db2e77837b88caadf978eaeba
BLAKE2b-256 4be4fbd22d0ac9d15858fb9290dd6dcf06e1d1674fe665d7fbbbbf7ef64861b5

See more details on using hashes here.

File details

Details for the file cloakdata-3.1.0-py3-none-any.whl.

File metadata

  • Download URL: cloakdata-3.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.30 {"installer":{"name":"uv","version":"0.9.30","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for cloakdata-3.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d25120bde73b1bdc7647d71fe8a0df0dc82306748eb08d0bd652c5d296fec0a8
MD5 840650b9587a665095509d80bce7fe93
BLAKE2b-256 2ea92335b107b4e13e514f5a3ea664faea4d8b768d4690ff91d2bfc35a824ed9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page