Skip to main content

Fast data anonymization with Polars

Project description

๐Ÿ” CloakData โ€” Data Anonymizer

PyPI Python CI License

A flexible and extensible data anonymization library built on Polars. Designed for privacy, compliance, and testing with minimal overhead.


๐Ÿงพ Whatโ€™s New (2.0.0)

  • ๐Ÿ”ง Improved all masking and transformation methods for consistency and safety.
  • โœจ Standardized method signatures to return pl.Expr for better composability.
  • ๐Ÿ—“๏ธ Added round_date to round dates to month or year start.
  • ๐Ÿ›ก๏ธ Improved parameter handling (defaults, null-safety, predictable behavior).
  • ๐Ÿงช Refactored and validated tests to ensure stability across changes.
  • ๐Ÿ“š Improved documentation and moved detailed examples into examples/.

โœจ Features

  • ๐Ÿ”’ Masking: full, partial, emails, phone numbers.
  • ๐Ÿ”„ Replacement: static values, dictionaries, substrings.
  • ๐Ÿ”ข Sequential IDs: numeric or alphabetical.
  • โœ‚๏ธ Truncation & initials extraction.
  • ๐Ÿ“Š Generalization: ages into ranges, dates into month/year.
  • ๐ŸŽฒ Randomization: choices, digits, shuffling.
  • ๐Ÿ“… Date offsetting with reproducible seeds.
  • ๐Ÿงฉ Conditional rules โ€” multi-rules, nested (all/any/not), logical groups (and/or).
  • โšก Built on Polars โ†’ fast & scalable.

โš™๏ธ How it works

  1. Load your dataset into a Polars DataFrame.
  2. Define anonymization rules in a simple JSON config.
  3. Call anonymize(df, config) โ†’ get a safe anonymized DataFrame.

๐Ÿงช Example Config

{
  "columns": {
    "name": { "method": "initials_only" },
    "email": { "method": "mask_email" },
    "phone": { "method": "mask_number" },
    "cpf": {
      "method": "replace_with_random_digits",
      "params": { "digits": 11 }
    },
    "status": {
      "method": "replace_exact",
      "params": { "mapping": { "active": "A", "inactive": "I" } }
    },
    "id_seq": { "method": "sequential_numeric", "params": { "prefix": "ID" } },
    "ref_code": { "method": "sequential_alpha", "params": { "prefix": "REF" } },
    "comments": { "method": "truncate", "params": { "length": 5 } },
    "age": { "method": "generalize_age" },
    "birth_date": { "method": "generalize_date", "params": { "mode": "month_year" } },
    "state": { "method": "random_choice", "params": { "choices": ["SP","RJ","MG","BA"] } },
    "last_access": { "method": "date_offset", "params": { "min_days": -2, "max_days": 2 } },
    "feedback": { "method": "shuffle" }
  }
}

๐Ÿง  Conditional Rules

Apply transformations only when conditions are met.

Single condition

"cpf": {
  "method": "replace_with_random_digits",
  "params": { "digits": 11 },
  "condition": {
    "column": "status",
    "operator": "equals",
    "value": "active"
  }
}

Multiple rules per column

"city": [
  { "method": "replace_with_value", "params": { "value": "X" } },
  {
    "method": "mask_partial",
    "params": { "visible_start": 1, "visible_end": 1 },
    "condition": { "column": "country", "operator": "equals", "value": "BR" }
  }
]

Nested conditions

"age": {
  "method": "generalize_age",
  "condition": {
    "all": [
      { "column": "country", "operator": "equals", "value": "BR" },
      { "any": [
          { "column": "status", "operator": "equals", "value": "active" },
          { "column": "status", "operator": "equals", "value": "archived" }
        ]
      }
    ]
  }
}

Operators supported: equals, not_equals, in, not_in, gt, gte, lt, lte, contains, not_contains Groups: all, any, not Logical: and, or


๐Ÿ” Example Input โ†’ Output

Input DataFrame:

name email age status
Alice Smith alice@example.com 25 active
Bob Jones bob@example.com 42 inactive

Config:

{
  "columns": {
    "name": { "method": "initials_only" },
    "email": { "method": "mask_email" },
    "age": { "method": "generalize_age" },
    "cpf": {
      "method": "replace_with_random_digits",
      "params": { "digits": 8 },
      "condition": {
        "column": "status",
        "operator": "equals",
        "value": "active"
      }
    }
  }
}

Output DataFrame:

name email age cpf
A.S. xxxxx@example.com 20-29 48291034
B.J. xxxxx@example.com 40-49 (null)

๐Ÿงฉ Examples

Runnable, self-contained scripts are in the examples/ folder.


๐Ÿ“Š Supported Methods

Method Description Example Input โ†’ Output
full_mask Fixed mask or literal; supports char, len, mask_literal, match_length, preserve_nulls. 12345 โ†’ ***** / XXXXXXXX / REDACTED
mask_email Masks local part; supports mask, fallback_domain, preserve_nulls. john@example.com โ†’ xxxxx@example.com
mask_number Keep first N digits, then mask the rest (configurable keep, mask, len, preserve_nulls) 123456789 โ†’ 123*****
98765 + keep=2, mask="X" โ†’ 98XXX
42 + keep=2, len=4, mask="#" โ†’ 42####
mask_partial Partial masking with configurable visibility abcdef โ†’ a****f (visible_start=1, visible_end=1)
replace_with_value Replace entire column with a static value (dtype preserved). Optionally keep nulls with preserve_nulls=True. Requires value. ["a", None, "b"] + value="X" โ†’ "X","X","X" โ€ข preserve_nulls=True โ†’ "X", None, "X" โ€ข value=123 โ†’ 123,123,123
replace_exact Replace values that exactly match keys in a mapping. Values not in the mapping are unchanged. Dtype is inferred from replacements (no forced Utf8). ["a","b","c"] + {"a":"X"} โ†’ ["X","b","c"] โ€ข [1,2,3] + {1:99,3:-1} โ†’ [99,2,-1] โ€ข [True,False] + {True:False} โ†’ [False,False]
replace_by_contains Replace values when they contain given substrings. Literal by default; first match wins; nulls preserved. Options: mapping, substr+replacement, case_sensitive, use_regex. ["foo","bar","baz"] + {"ba":"X"} โ†’ ["foo","X","X"] โ€ข case_sensitive=False: "Hello" + {"hello":"X"} โ†’ "X" โ€ข use_regex=True: {"\\d{3}":"HIT"} on "id=123" โ†’ "HIT"
replace_with_random_digits Replace values with randomly generated digit strings (fixed length) 11111 โ†’ 80239
sequential_numeric Sequential numeric pseudonyms with optional prefix (prefix=None โ†’ raw integers, default "val") ["Alice","Bob","Alice"] โ†’ ["val 1","val 2","val 1"]
sequential_alpha Sequential alphabetic pseudonyms with optional prefix; duplicates get the same label; order by first appearance ["Alice","Bob","Alice"] โ†’ ["val A","val B","val A"]
truncate Truncates strings to a maximum length (nulls preserved unless configured) "Porto Alegre" โ†’ "Port"
initials_only Convert names to initials John Doe โ†’ J.D.
generalize_age Group ages into ranges 25 โ†’ 20-29
generalize_date Generalize dates/datetimes by reducing granularity (year, month, quarter, semester, week, date, datetime) 2025-07-20 โ†’ 2025-07 ; 2025-07-20 โ†’ 2025-Q3
generalize_number_range Bucket numeric values into fixed-size ranges (e.g. 0โ€“9, 10โ€“19) 42 โ†’ 40-49
random_choice Replace values with a deterministic pseudo-random choice from a fixed set (null-safe) Sรฃo Paulo โ†’ X / Y (with seed)
shuffle Shuffle column values (row order preserved) [A, B, C] โ†’ [B, C, A]
date_offset Apply a deterministic pseudo-random day offset within a configurable range (null-safe) 2025-07-20 โ†’ 2025-07-18
coalesce_cols Return the first non-null value from a list of columns, respecting priority order (None, Y) โ†’ Y
round_number Round numeric values to a configurable number of decimal places 3.14159 (digits=2) โ†’ 3.14
round_date Round dates down to the start of a month or year 2025-07-29 โ†’ 2025-07-01

๐Ÿ“‚ Project Structure

src/
 โ””โ”€โ”€ cloakdata/           # Core library
tests/                    # Test suite (pytest + Polars)
examples/                 # Sample CSVs & configs
README.md                 # Project docs
pyproject.toml            # Build system (uv/hatch)

โšก Installation

pip install cloakdata

Or with uv:

uv add cloakdata

๐Ÿš€ Quickstart

import polars as pl
from cloakdata import anonymize

df = pl.DataFrame({
    "name": ["Alice Smith", "Bob Jones"],
    "email": ["alice@example.com", "bob@example.com"],
    "age": [25, 42]
})

config = {
    "columns": {
        "name": { "method": "initials_only" },
        "email": { "method": "mask_email" },
        "age": { "method": "generalize_age" }
    }
}

out = anonymize(df, config)
print(out)

๐Ÿ› ๏ธ Development

git clone https://github.com/youruser/cloakdata
cd cloakdata
uv sync
pre-commit install
pytest -v

๐Ÿ”ฎ Roadmap

  • Regex-based redaction
  • Hashing strategies (SHA256, bcrypt)
  • Parallel processing for large datasets

๐Ÿค Contributing

We love contributions! See CONTRIBUTING.md for setup, coding standards, how to add a new anonymization method, tests and the PR checklist.

๐Ÿ“„ Notice

See NOTICE for attribution details.

๐Ÿ“œ License

MIT ยฉ Jeferson Peter

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloakdata-2.0.0.tar.gz (6.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cloakdata-2.0.0-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file cloakdata-2.0.0.tar.gz.

File metadata

  • Download URL: cloakdata-2.0.0.tar.gz
  • Upload date:
  • Size: 6.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for cloakdata-2.0.0.tar.gz
Algorithm Hash digest
SHA256 39c288039018574d08ac8beaea589246e39874dff72ee0c7697db503c62a94d5
MD5 660eca4395aabcc28701eee99630e412
BLAKE2b-256 f91cf4426c8d31aa3b43ab434f0d1f7dbefb5ac52ccd47ad353ff7b5f8055ca0

See more details on using hashes here.

File details

Details for the file cloakdata-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: cloakdata-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for cloakdata-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 21a4583d961df93918e4d54cedd9c9d2655fec5e10f4227d828c6ccbdd44ea7e
MD5 d31d23135a093de755d5851d2be7aa6d
BLAKE2b-256 5d557979dff24764a8512ea6900e12dae8877915533cb488bc4c35b263e5cd6e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page