Skip to main content

Fast data anonymization with Polars

Project description

CloakData - Data Anonymizer

PyPI Python CI License

A flexible data anonymization library built on Polars, designed for privacy, compliance, and testing with low overhead.


Current Highlights

  • Built-in methods are organized by domain under src/cloakdata/native_methods/
  • Native methods are registered automatically with @native_method
  • Practical support for masking, replacement, generalization, randomization, and data cleanup
  • Config-driven anonymization with conditional rules
  • Runnable examples for the main built-in methods
  • Built on Polars for fast vectorized execution

Features

  • Masking:
    • full_mask
    • mask_email
    • mask_number
    • mask_credit_card
    • mask_cpf
    • mask_partial
  • Replacement and pseudonymization:
    • replace_with_value
    • replace_exact
    • replace_by_contains
    • replace_with_random_digits
    • hash_value
    • replace_with_hash_bucket
    • redact_regex
  • Generalization:
    • generalize_age
    • generalize_date
    • generalize_number_range
    • generalize_zip_code
    • top_k_bucket
    • coarsen_datetime
  • Randomization and transforms:
    • random_choice
    • shuffle
    • noise_numeric
    • date_offset
    • round_number
    • round_date
    • clip_range
  • Utilities:
    • coalesce_cols
    • null_if_matches
  • Sequential pseudonyms:
    • sequential_numeric
    • sequential_alpha
  • Conditional rules with nested logic
  • Custom runtime methods with register_method(...)

How It Works

  1. Load data into a Polars DataFrame
  2. Define rules in a config dictionary
  3. Call anonymize(df, config)
  4. Receive a transformed DataFrame

Quickstart

import polars as pl

from cloakdata import anonymize

df = pl.DataFrame(
    {
        "name": ["Alice Smith", "Bob Jones"],
        "email": ["alice@example.com", "bob@example.com"],
        "age": [25, 42],
    }
)

config = {
    "columns": {
        "name": {"method": "initials_only"},
        "email": {"method": "mask_email"},
        "age": {"method": "generalize_age"},
    }
}

out = anonymize(df, config)
print(out)

Example Config

{
  "columns": {
    "name": { "method": "initials_only" },
    "email": { "method": "mask_email" },
    "email_hash": {
      "method": "hash_value",
      "params": { "salt": "team-2026" }
    },
    "phone": { "method": "mask_number", "params": { "keep": 3 } },
    "cpf": {
      "method": "mask_cpf",
      "params": { "keep_last": 2 }
    },
    "status": {
      "method": "replace_exact",
      "params": { "mapping": { "active": "A", "inactive": "I" } }
    },
    "notes": {
      "method": "redact_regex",
      "params": {
        "pattern": "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}",
        "replacement": "[EMAIL]"
      }
    },
    "id_seq": { "method": "sequential_numeric", "params": { "prefix": "ID" } },
    "ref_code": { "method": "sequential_alpha", "params": { "prefix": "REF" } },
    "comments": { "method": "truncate", "params": { "length": 5 } },
    "age": { "method": "generalize_age" },
    "birth_date": { "method": "generalize_date", "params": { "mode": "month" } },
    "zip_code": { "method": "generalize_zip_code", "params": { "visible_prefix": 3 } },
    "state": { "method": "random_choice", "params": { "choices": ["SP", "RJ", "MG", "BA"] } },
    "last_access": { "method": "date_offset", "params": { "min_days": -2, "max_days": 2 } },
    "event_time": { "method": "coarsen_datetime", "params": { "mode": "part_of_day" } },
    "feedback": { "method": "shuffle" },
    "score": { "method": "clip_range", "params": { "min": 0, "max": 100 } }
  }
}

Conditional Rules

Single condition:

"cpf": {
  "method": "mask_cpf",
  "params": { "keep_last": 2 },
  "condition": {
    "column": "status",
    "operator": "equals",
    "value": "active"
  }
}

Multiple rules per column:

"city": [
  { "method": "replace_with_value", "params": { "value": "X" } },
  {
    "method": "mask_partial",
    "params": { "visible_start": 1, "visible_end": 1 },
    "condition": { "column": "country", "operator": "equals", "value": "BR" }
  }
]

Nested conditions:

"age": {
  "method": "generalize_age",
  "condition": {
    "all": [
      { "column": "country", "operator": "equals", "value": "BR" },
      {
        "any": [
          { "column": "status", "operator": "equals", "value": "active" },
          { "column": "status", "operator": "equals", "value": "archived" }
        ]
      }
    ]
  }
}

Supported operators: equals, not_equals, in, not_in, gt, gte, lt, lte, contains, not_contains

Supported groups: all, any, not


Example Input to Output

Input:

name email age status
Alice Smith alice@example.com 25 active
Bob Jones bob@example.com 42 inactive

Config:

{
  "columns": {
    "name": { "method": "initials_only" },
    "email": { "method": "mask_email" },
    "age": { "method": "generalize_age" },
    "cpf": {
      "method": "mask_cpf",
      "params": { "keep_last": 2 },
      "condition": {
        "column": "status",
        "operator": "equals",
        "value": "active"
      }
    }
  }
}

Output:

name email age cpf
A.S. xxxxx@example.com 20-29 *********01
B.J. xxxxx@example.com 40-49 null

Examples

Runnable scripts live under examples/.


Supported Methods

Method Description
full_mask Fixed mask or literal
mask_email Masks the local part of an email
mask_number Keeps leading characters and masks the rest
mask_credit_card Masks card digits while preserving the last visible digits
mask_cpf Masks Brazilian CPF values while preserving the final visible digits
mask_partial Masks the middle while preserving visible edges
truncate Truncates strings to a fixed length
initials_only Converts names to initials
replace_with_value Replaces all values with a static value
hash_value Generates deterministic hashes, with optional salt
redact_regex Redacts regex matches inside free text
replace_with_hash_bucket Replaces values with deterministic hash buckets
replace_exact Replaces exact values using a mapping
replace_by_contains Replaces values that contain substrings
replace_with_random_digits Generates deterministic digit strings
sequential_numeric Sequential numeric pseudonyms
sequential_alpha Sequential alphabetic pseudonyms
generalize_age Groups ages into ranges
generalize_date Reduces date and datetime granularity
generalize_number_range Buckets numeric values into fixed intervals
generalize_zip_code Preserves a visible postal-code prefix and masks the rest
coarsen_datetime Coarsens timestamps into buckets, minute-of-day buckets (time-only or full datetime), hour, part-of-day, weekday, weekend/weekday, or configurable business-hours labels
top_k_bucket Keeps the top-k most frequent categories and buckets the rest
random_choice Picks deterministic values from a fixed set
noise_numeric Adds deterministic numeric noise within configured bounds
shuffle Shuffles values while keeping row count
date_offset Applies deterministic date offsets
clip_range Constrains numeric values to configured min/max bounds
round_number Rounds numeric values
round_date Rounds dates to month or year start
coalesce_cols Returns the first non-null value across columns
null_if_matches Converts known placeholders or regex matches into null

Notes

  • hash_value is deterministic and better when you need stable one-way pseudonymization.
  • replace_with_hash_bucket is deterministic bucketing, not unique pseudonymization. Different input values can land in the same bucket when the number of unique values is greater than the configured number of buckets.

Project Structure

src/
  cloakdata/
    native_methods/  # Built-in methods organized by domain
tests/               # Pytest suite
examples/            # Runnable examples
README.md
pyproject.toml

Built-in methods live under src/cloakdata/native_methods/ and are registered automatically with @native_method.


Installation

pip install cloakdata

Or with uv:

uv add cloakdata

Development

git clone https://github.com/Jeferson-Peter/cloakdata
cd cloakdata
uv sync --extra dev
pre-commit install
pytest -v

Choosing Methods

  • Use hash_value when you need stable one-way pseudonymization.
  • Use replace_with_hash_bucket when you need deterministic grouping and collisions are acceptable.
  • Use generalize_date when you want period-style date abstraction such as month, quarter, or year.
  • Use round_date when you want canonical rounded dates such as month-start or year-start.
  • Use coarsen_datetime when you want timestamp abstraction such as hour buckets, part-of-day labels, weekdays, or business-hours labels.
  • Use null_if_matches before anonymization when your source data contains placeholders such as N/A, unknown, or regex-shaped junk values.

Contributing

See CONTRIBUTING.md for setup, coding standards, how to add a new anonymization method, and the PR checklist.

Notice

See NOTICE for attribution details.

License

MIT Copyright Jeferson Peter

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloakdata-3.2.1.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cloakdata-3.2.1-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file cloakdata-3.2.1.tar.gz.

File metadata

  • Download URL: cloakdata-3.2.1.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.30 {"installer":{"name":"uv","version":"0.9.30","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for cloakdata-3.2.1.tar.gz
Algorithm Hash digest
SHA256 24a9e7ddc7bb509641379d804f944bb80689843ea6fa3cba45cabdd6557601c5
MD5 c9d70980db9cd8a0baf05386be45d276
BLAKE2b-256 c659f4ede0df81564397a9a17838ed22b8fa7b4b14819ac8a8a81d3240a91c63

See more details on using hashes here.

File details

Details for the file cloakdata-3.2.1-py3-none-any.whl.

File metadata

  • Download URL: cloakdata-3.2.1-py3-none-any.whl
  • Upload date:
  • Size: 7.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.30 {"installer":{"name":"uv","version":"0.9.30","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for cloakdata-3.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8778fea7fcef0cfe102113aac156b67e92dee73b018ba27e3dd96a8c310f8456
MD5 7e8216a87cdf9cbb554130f0b6d8c87e
BLAKE2b-256 58e11429e0e4b8f180c68a9ec60edc9c44188293c93112449743d29fb2f95e34

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page