Fast data anonymization with Polars
Project description
๐ CloakData โ Data Anonymizer
A flexible and extensible data anonymization library built on Polars. Designed for privacy, compliance, and testing with minimal overhead.
๐งพ Whatโs New (2.0.0)
- ๐ง Improved all masking and transformation methods for consistency and safety.
- โจ Standardized method signatures to return
pl.Exprfor better composability. - ๐๏ธ Added
round_dateto round dates to month or year start. - ๐ก๏ธ Improved parameter handling (defaults, null-safety, predictable behavior).
- ๐งช Refactored and validated tests to ensure stability across changes.
- ๐ Improved documentation and moved detailed examples into examples/.
โจ Features
- ๐ Masking: full, partial, emails, phone numbers.
- ๐ Replacement: static values, dictionaries, substrings.
- ๐ข Sequential IDs: numeric or alphabetical.
- โ๏ธ Truncation & initials extraction.
- ๐ Generalization: ages into ranges, dates into month/year.
- ๐ฒ Randomization: choices, digits, shuffling.
- ๐ Date offsetting with reproducible seeds.
- ๐งฉ Conditional rules โ multi-rules, nested (
all/any/not), logical groups (and/or). - โก Built on Polars โ fast & scalable.
โ๏ธ How it works
- Load your dataset into a Polars
DataFrame. - Define anonymization rules in a simple JSON config.
- Call
anonymize(df, config)โ get a safe anonymized DataFrame.
๐งช Example Config
{
"columns": {
"name": { "method": "initials_only" },
"email": { "method": "mask_email" },
"phone": { "method": "mask_number" },
"cpf": {
"method": "replace_with_random_digits",
"params": { "digits": 11 }
},
"status": {
"method": "replace_exact",
"params": { "mapping": { "active": "A", "inactive": "I" } }
},
"id_seq": { "method": "sequential_numeric", "params": { "prefix": "ID" } },
"ref_code": { "method": "sequential_alpha", "params": { "prefix": "REF" } },
"comments": { "method": "truncate", "params": { "length": 5 } },
"age": { "method": "generalize_age" },
"birth_date": { "method": "generalize_date", "params": { "mode": "month_year" } },
"state": { "method": "random_choice", "params": { "choices": ["SP","RJ","MG","BA"] } },
"last_access": { "method": "date_offset", "params": { "min_days": -2, "max_days": 2 } },
"feedback": { "method": "shuffle" }
}
}
๐ง Conditional Rules
Apply transformations only when conditions are met.
Single condition
"cpf": {
"method": "replace_with_random_digits",
"params": { "digits": 11 },
"condition": {
"column": "status",
"operator": "equals",
"value": "active"
}
}
Multiple rules per column
"city": [
{ "method": "replace_with_value", "params": { "value": "X" } },
{
"method": "mask_partial",
"params": { "visible_start": 1, "visible_end": 1 },
"condition": { "column": "country", "operator": "equals", "value": "BR" }
}
]
Nested conditions
"age": {
"method": "generalize_age",
"condition": {
"all": [
{ "column": "country", "operator": "equals", "value": "BR" },
{ "any": [
{ "column": "status", "operator": "equals", "value": "active" },
{ "column": "status", "operator": "equals", "value": "archived" }
]
}
]
}
}
Operators supported:
equals, not_equals, in, not_in, gt, gte, lt, lte, contains, not_contains
Groups: all, any, not
Logical: and, or
๐ Example Input โ Output
Input DataFrame:
| name | age | status | |
|---|---|---|---|
| Alice Smith | alice@example.com | 25 | active |
| Bob Jones | bob@example.com | 42 | inactive |
Config:
{
"columns": {
"name": { "method": "initials_only" },
"email": { "method": "mask_email" },
"age": { "method": "generalize_age" },
"cpf": {
"method": "replace_with_random_digits",
"params": { "digits": 8 },
"condition": {
"column": "status",
"operator": "equals",
"value": "active"
}
}
}
}
Output DataFrame:
| name | age | cpf | |
|---|---|---|---|
| A.S. | xxxxx@example.com | 20-29 | 48291034 |
| B.J. | xxxxx@example.com | 40-49 | (null) |
๐งฉ Examples
Runnable, self-contained scripts are in the examples/ folder.
- Masking: masking/full_mask.py, masking/mask_email.py
- Replacement: replacement/replace_with_value.py
- Generalization: generalization/generalize_age.py
- Dates: dates/date_offset.py
- Randomization: randomization/random_choice.py
- Utilities: utilities/coalesce_cols.py
๐ Supported Methods
| Method | Description | Example Input โ Output |
|---|---|---|
full_mask |
Fixed mask or literal; supports char, len, mask_literal, match_length, preserve_nulls. |
12345 โ ***** / XXXXXXXX / REDACTED |
mask_email |
Masks local part; supports mask, fallback_domain, preserve_nulls. |
john@example.com โ xxxxx@example.com |
mask_number |
Keep first N digits, then mask the rest (configurable keep, mask, len, preserve_nulls) |
123456789 โ 123***** 98765 + keep=2, mask="X" โ 98XXX 42 + keep=2, len=4, mask="#" โ 42#### |
mask_partial |
Partial masking with configurable visibility | abcdef โ a****f (visible_start=1, visible_end=1) |
replace_with_value |
Replace entire column with a static value (dtype preserved). Optionally keep nulls with preserve_nulls=True. Requires value. |
["a", None, "b"] + value="X" โ "X","X","X" โข preserve_nulls=True โ "X", None, "X" โข value=123 โ 123,123,123 |
replace_exact |
Replace values that exactly match keys in a mapping. Values not in the mapping are unchanged. Dtype is inferred from replacements (no forced Utf8). | ["a","b","c"] + {"a":"X"} โ ["X","b","c"] โข [1,2,3] + {1:99,3:-1} โ [99,2,-1] โข [True,False] + {True:False} โ [False,False] |
replace_by_contains |
Replace values when they contain given substrings. Literal by default; first match wins; nulls preserved. Options: mapping, substr+replacement, case_sensitive, use_regex. |
["foo","bar","baz"] + {"ba":"X"} โ ["foo","X","X"] โข case_sensitive=False: "Hello" + {"hello":"X"} โ "X" โข use_regex=True: {"\\d{3}":"HIT"} on "id=123" โ "HIT" |
replace_with_random_digits |
Replace values with randomly generated digit strings (fixed length) | 11111 โ 80239 |
sequential_numeric |
Sequential numeric pseudonyms with optional prefix (prefix=None โ raw integers, default "val") |
["Alice","Bob","Alice"] โ ["val 1","val 2","val 1"] |
sequential_alpha |
Sequential alphabetic pseudonyms with optional prefix; duplicates get the same label; order by first appearance | ["Alice","Bob","Alice"] โ ["val A","val B","val A"] |
truncate |
Truncates strings to a maximum length (nulls preserved unless configured) | "Porto Alegre" โ "Port" |
initials_only |
Convert names to initials | John Doe โ J.D. |
generalize_age |
Group ages into ranges | 25 โ 20-29 |
generalize_date |
Generalize dates/datetimes by reducing granularity (year, month, quarter, semester, week, date, datetime) |
2025-07-20 โ 2025-07 ; 2025-07-20 โ 2025-Q3 |
generalize_number_range |
Bucket numeric values into fixed-size ranges (e.g. 0โ9, 10โ19) | 42 โ 40-49 |
random_choice |
Replace values with a deterministic pseudo-random choice from a fixed set (null-safe) | Sรฃo Paulo โ X / Y (with seed) |
shuffle |
Shuffle column values (row order preserved) | [A, B, C] โ [B, C, A] |
date_offset |
Apply a deterministic pseudo-random day offset within a configurable range (null-safe) | 2025-07-20 โ 2025-07-18 |
coalesce_cols |
Return the first non-null value from a list of columns, respecting priority order | (None, Y) โ Y |
round_number |
Round numeric values to a configurable number of decimal places | 3.14159 (digits=2) โ 3.14 |
round_date |
Round dates down to the start of a month or year | 2025-07-29 โ 2025-07-01 |
๐ Project Structure
src/
โโโ cloakdata/ # Core library
tests/ # Test suite (pytest + Polars)
examples/ # Sample CSVs & configs
README.md # Project docs
pyproject.toml # Build system (uv/hatch)
โก Installation
pip install cloakdata
Or with uv:
uv add cloakdata
๐ Quickstart
import polars as pl
from cloakdata import anonymize
df = pl.DataFrame({
"name": ["Alice Smith", "Bob Jones"],
"email": ["alice@example.com", "bob@example.com"],
"age": [25, 42]
})
config = {
"columns": {
"name": { "method": "initials_only" },
"email": { "method": "mask_email" },
"age": { "method": "generalize_age" }
}
}
out = anonymize(df, config)
print(out)
๐ ๏ธ Development
git clone https://github.com/youruser/cloakdata
cd cloakdata
uv sync
pre-commit install
pytest -v
๐ฎ Roadmap
- Regex-based redaction
- Hashing strategies (SHA256, bcrypt)
- Parallel processing for large datasets
๐ค Contributing
We love contributions! See CONTRIBUTING.md for setup, coding standards, how to add a new anonymization method, tests and the PR checklist.
๐ Notice
See NOTICE for attribution details.
๐ License
MIT ยฉ Jeferson Peter
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cloakdata-2.0.0.tar.gz.
File metadata
- Download URL: cloakdata-2.0.0.tar.gz
- Upload date:
- Size: 6.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39c288039018574d08ac8beaea589246e39874dff72ee0c7697db503c62a94d5
|
|
| MD5 |
660eca4395aabcc28701eee99630e412
|
|
| BLAKE2b-256 |
f91cf4426c8d31aa3b43ab434f0d1f7dbefb5ac52ccd47ad353ff7b5f8055ca0
|
File details
Details for the file cloakdata-2.0.0-py3-none-any.whl.
File metadata
- Download URL: cloakdata-2.0.0-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
21a4583d961df93918e4d54cedd9c9d2655fec5e10f4227d828c6ccbdd44ea7e
|
|
| MD5 |
d31d23135a093de755d5851d2be7aa6d
|
|
| BLAKE2b-256 |
5d557979dff24764a8512ea6900e12dae8877915533cb488bc4c35b263e5cd6e
|