Fast data anonymization with Polars
Project description
๐ CloakData โ Data Anonymizer
A flexible and extensible data anonymization library built on Polars. Designed for privacy, compliance, and testing with minimal overhead.
๐งพ Whatโs New (1.1.0)
- โ Added Conditional Rules with multi-rule support per column.
- โ
Added nested conditions:
all,any,not. - โ
Logical operators supported:
and,or. - โ Extended test coverage for conditions.
- ๐งน Internal refactors & style improvements (ruff).
โจ Features
- ๐ Masking: full, partial, emails, phone numbers.
- ๐ Replacement: static values, dictionaries, substrings.
- ๐ข Sequential IDs: numeric or alphabetical.
- โ๏ธ Truncation & initials extraction.
- ๐ Generalization: ages into ranges, dates into month/year.
- ๐ฒ Randomization: choices, digits, shuffling.
- ๐ Date offsetting with reproducible seeds.
- ๐งฉ Conditional rules โ multi-rules, nested (
all/any/not), logical groups (and/or). - โก Built on Polars โ fast & scalable.
โ๏ธ How it works
- Load your dataset into a Polars
DataFrame. - Define anonymization rules in a simple JSON config.
- Call
anonymize(df, config)โ get a safe anonymized DataFrame.
๐งช Example Config
{
"columns": {
"name": { "method": "initials_only" },
"email": { "method": "mask_email" },
"phone": { "method": "mask_number" },
"cpf": {
"method": "replace_with_random_digits",
"params": { "digits": 11 }
},
"status": {
"method": "replace_exact",
"params": { "mapping": { "active": "A", "inactive": "I" } }
},
"id_seq": { "method": "sequential_numeric", "params": { "prefix": "ID" } },
"ref_code": { "method": "sequential_alpha", "params": { "prefix": "REF" } },
"comments": { "method": "truncate", "params": { "length": 5 } },
"age": { "method": "generalize_age" },
"birth_date": { "method": "generalize_date", "params": { "mode": "month_year" } },
"state": { "method": "random_choice", "params": { "choices": ["SP","RJ","MG","BA"] } },
"last_access": { "method": "date_offset", "params": { "min_days": -2, "max_days": 2 } },
"feedback": { "method": "shuffle" }
}
}
๐ง Conditional Rules
Apply transformations only when conditions are met.
Single condition
"cpf": {
"method": "replace_with_random_digits",
"params": { "digits": 11 },
"condition": {
"column": "status",
"operator": "equals",
"value": "active"
}
}
Multiple rules per column
"city": [
{ "method": "replace_with_value", "params": { "value": "X" } },
{
"method": "mask_partial",
"params": { "visible_start": 1, "visible_end": 1 },
"condition": { "column": "country", "operator": "equals", "value": "BR" }
}
]
Nested conditions
"age": {
"method": "generalize_age",
"condition": {
"all": [
{ "column": "country", "operator": "equals", "value": "BR" },
{ "any": [
{ "column": "status", "operator": "equals", "value": "active" },
{ "column": "status", "operator": "equals", "value": "archived" }
]
}
]
}
}
Operators supported:
equals, not_equals, in, not_in, gt, gte, lt, lte, contains, not_contains
Groups: all, any, not
Logical: and, or
๐ Example Input โ Output
Input DataFrame:
| name | age | status | |
|---|---|---|---|
| Alice Smith | alice@example.com | 25 | active |
| Bob Jones | bob@example.com | 42 | inactive |
Config:
{
"columns": {
"name": { "method": "initials_only" },
"email": { "method": "mask_email" },
"age": { "method": "generalize_age" },
"cpf": {
"method": "replace_with_random_digits",
"params": { "digits": 8 },
"condition": {
"column": "status",
"operator": "equals",
"value": "active"
}
}
}
}
Output DataFrame:
| name | age | cpf | |
|---|---|---|---|
| A.S. | xxxxx@example.com | 20-29 | 48291034 |
| B.J. | xxxxx@example.com | 40-49 | (null) |
๐งฉ Examples by Method
Below are minimal examples of how each anonymization method works.
All examples assume:
import polars as pl
from cloakdata import anonymize
๐ Masking
Full mask
df = pl.DataFrame({"ssn": ["123-45-6789", "987-65-4321"]})
config = {"columns": {"ssn": {"method": "full_mask"}}}
print(anonymize(df, config))
Mask email
df = pl.DataFrame({"email": ["john@example.com", "invalid"]})
config = {"columns": {"email": {"method": "mask_email"}}}
print(anonymize(df, config))
Mask number
df = pl.DataFrame({"phone": ["123456789", "987654321"]})
config = {"columns": {"phone": {"method": "mask_number"}}}
print(anonymize(df, config))
Mask partial
df = pl.DataFrame({"code": ["abcdef", "12345"]})
config = {"columns": {"code": {"method": "mask_partial", "params": {"visible_start": 2, "visible_end": 2}}}}
print(anonymize(df, config))
๐ Replacement
Static value
df = pl.DataFrame({"city": ["NY", "LA"]})
config = {"columns": {"city": {"method": "replace_with_value", "params": {"value": "Unknown"}}}}
print(anonymize(df, config))
Exact mapping
df = pl.DataFrame({"status": ["active", "inactive"]})
config = {"columns": {"status": {"method": "replace_exact", "params": {"mapping": {"active": "A", "inactive": "I"}}}}}
print(anonymize(df, config))
Substring mapping
df = pl.DataFrame({"text": ["error: 404", "ok"]})
config = {"columns": {"text": {"method": "replace_by_contains", "params": {"mapping": {"error": "ERR"}}}}}
print(anonymize(df, config))
๐ข Sequential IDs
df = pl.DataFrame({"user": ["Alice", "Bob", "Charlie"]})
config = {"columns": {
"user": {"method": "sequential_numeric", "params": {"prefix": "U"}}
}}
print(anonymize(df, config))
โ๏ธ Truncation & Initials
df = pl.DataFrame({"name": ["Alice Smith", "Bob Jones"]})
config = {"columns": {
"short": {"method": "truncate", "params": {"length": 3}},
"initials": {"method": "initials_only"}
}}
print(anonymize(df, config))
๐ Generalization
df = pl.DataFrame({"age": [25, 42], "date": ["2025-07-20", "2025-01-15"], "salary": [2300, 12500]})
config = {"columns": {
"age": {"method": "generalize_age"},
"date": {"method": "generalize_date", "params": {"mode": "year"}},
"salary": {"method": "generalize_number_range", "params": {"interval": 5000}}
}}
print(anonymize(df, config))
๐ฒ Randomization
df = pl.DataFrame({
"state": ["SP", "RJ", "MG"],
"cpf": ["11111", "22222", "33333"],
"col": ["A", "B", "C"]
})
config = {"columns": {
"state": {"method": "random_choice", "params": {"choices": ["AA", "BB"], "seed": 42}},
"cpf": {"method": "replace_with_random_digits", "params": {"digits": 5}},
"col": {"method": "shuffle", "params": {"seed": 42}}
}}
print(anonymize(df, config))
๐ Dates
df = pl.DataFrame({"d": ["2025-07-29", "2025-07-30"]})
config = {"columns": {
"offset": {"method": "date_offset", "params": {"min_days": -2, "max_days": 2, "seed": 42}},
"rounded": {"method": "round_date", "params": {"mode": "month"}}
}}
print(anonymize(df, config))
๐งฉ Utilities
df = pl.DataFrame({"a": [None, "X"], "b": ["Y", None], "n": [3.14159, 2.71828]})
config = {"columns": {
"coalesced": {"method": "coalesce_cols", "params": {"columns": ["a", "b"]}},
"rounded": {"method": "round_number", "params": {"digits": 2}}
}}
print(anonymize(df, config))
๐ Supported Methods
| Method | Description | Example Input โ Output |
|---|---|---|
full_mask |
Replace all values with ***** |
12345 โ ***** |
mask_email |
Hide local part of email, keep domain | john@example.com โ xxxxx@example.com |
mask_number |
Keep first 3 chars, mask rest | 123456789 โ 123***** |
mask_partial |
Show start & end, mask middle | abcdef โ ab**ef |
replace_with_value |
Replace with a static value | NY โ Unknown |
replace_exact |
Replace exact matches by mapping | active โ A |
replace_by_contains |
Replace if substring exists | error: 404 โ ERR |
sequential_numeric |
Sequential numeric pseudonyms | Alice, Bob โ U 1, U 2 |
sequential_alpha |
Sequential alphabetic pseudonyms | Alice, Bob โ U A, U B |
truncate |
Truncate strings to fixed length | Alexander โ Alex |
initials_only |
Convert names to initials | John Doe โ J.D. |
generalize_age |
Group ages in 10y ranges | 25 โ 20-29 |
generalize_date |
Reduce granularity (year or month_year) | 2025-07-20 โ 2025 |
generalize_number_range |
Bucketize numbers by interval | 23 โ 20-29 |
random_choice |
Randomly pick value from list | SP โ AA or BB |
replace_with_random_digits |
Random digits with fixed length | 11111 โ 80239 |
shuffle |
Shuffle column values | [A,B,C] โ [B,C,A] |
date_offset |
Random offset within day range | 2025-07-20 โ 2025-07-18 |
coalesce_cols |
Take first non-null from multiple cols | (None, Y) โ Y |
round_number |
Round numeric values to fixed decimals | 3.14159 โ 3.14 |
round_date |
Round date down to month or year start | 2025-07-29 โ 2025-07-01 |
๐ Project Structure
src/
โโโ cloakdata/ # Core library
tests/ # Test suite (pytest + Polars)
examples/ # Sample CSVs & configs
README.md # Project docs
pyproject.toml # Build system (uv/hatch)
โก Installation
pip install cloakdata
Or with uv:
uv add cloakdata
๐ Quickstart
import polars as pl
from cloakdata import anonymize
df = pl.DataFrame({
"name": ["Alice Smith", "Bob Jones"],
"email": ["alice@example.com", "bob@example.com"],
"age": [25, 42]
})
config = {
"columns": {
"name": { "method": "initials_only" },
"email": { "method": "mask_email" },
"age": { "method": "generalize_age" }
}
}
out = anonymize(df, config)
print(out)
๐ ๏ธ Development
git clone https://github.com/youruser/cloakdata
cd cloakdata
uv sync
pre-commit install
pytest -v
๐ฎ Roadmap
- Regex-based redaction
- Hashing strategies (SHA256, bcrypt)
- Parallel processing for large datasets
๐ค Contributing
We love contributions! See CONTRIBUTING.md for setup, coding standards, how to add a new anonymization method, tests and the PR checklist.
๐ Notice
See NOTICE for attribution details.
๐ License
MIT ยฉ Jeferson Peter
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cloakdata-1.1.0.tar.gz.
File metadata
- Download URL: cloakdata-1.1.0.tar.gz
- Upload date:
- Size: 6.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1798939f73f6a618af8e6b1143db5e7b0c1253dcd99af09bca0d327ed6be346
|
|
| MD5 |
690345acc98c3e590d069c5706e9169b
|
|
| BLAKE2b-256 |
3b61fe2cc6d7d7290e826857360af1c0a0feaee485fe549f86225fa5167a0e5a
|
File details
Details for the file cloakdata-1.1.0-py3-none-any.whl.
File metadata
- Download URL: cloakdata-1.1.0-py3-none-any.whl
- Upload date:
- Size: 7.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a25b7d5bc357cac59365dc92863c2488bb54ed2cc593a60d3ca4d7d44e94f79e
|
|
| MD5 |
b372e2d6876a6526c966173d76422784
|
|
| BLAKE2b-256 |
c4a5301ae509b3b74dca8f8d641c196d4f96323a8619e0336bc4602628c02c90
|