Fast data anonymization with Polars
Project description
CloakData - Data Anonymizer
A flexible data anonymization library built on Polars, designed for privacy, compliance, and testing with low overhead.
Current Highlights
- Built-in methods are organized by domain under
src/cloakdata/native_methods/ - Native methods are registered automatically with
@native_method - Practical support for masking, replacement, generalization, randomization, and data cleanup
- Config-driven anonymization with conditional rules
- Runnable examples for the main built-in methods
- Built on Polars for fast vectorized execution
Features
- Masking:
full_maskmask_emailmask_numbermask_credit_cardmask_cpfmask_partial
- Replacement and pseudonymization:
replace_with_valuereplace_exactreplace_by_containsreplace_with_random_digitshash_valuereplace_with_hash_bucketredact_regex
- Generalization:
generalize_agegeneralize_dategeneralize_number_rangegeneralize_zip_codetop_k_bucketcoarsen_datetime
- Randomization and transforms:
random_choiceshufflenoise_numericdate_offsetround_numberround_dateclip_range
- Utilities:
coalesce_colsnull_if_matches
- Sequential pseudonyms:
sequential_numericsequential_alpha
- Conditional rules with nested logic
- Custom runtime methods with
register_method(...)
How It Works
- Load data into a Polars
DataFrame - Define rules in a config dictionary
- Call
anonymize(df, config) - Receive a transformed
DataFrame
Quickstart
import polars as pl
from cloakdata import anonymize
df = pl.DataFrame(
{
"name": ["Alice Smith", "Bob Jones"],
"email": ["alice@example.com", "bob@example.com"],
"age": [25, 42],
}
)
config = {
"columns": {
"name": {"method": "initials_only"},
"email": {"method": "mask_email"},
"age": {"method": "generalize_age"},
}
}
out = anonymize(df, config)
print(out)
Example Config
{
"columns": {
"name": { "method": "initials_only" },
"email": { "method": "mask_email" },
"email_hash": {
"method": "hash_value",
"params": { "salt": "team-2026" }
},
"phone": { "method": "mask_number", "params": { "keep": 3 } },
"cpf": {
"method": "mask_cpf",
"params": { "keep_last": 2 }
},
"status": {
"method": "replace_exact",
"params": { "mapping": { "active": "A", "inactive": "I" } }
},
"notes": {
"method": "redact_regex",
"params": {
"pattern": "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}",
"replacement": "[EMAIL]"
}
},
"id_seq": { "method": "sequential_numeric", "params": { "prefix": "ID" } },
"ref_code": { "method": "sequential_alpha", "params": { "prefix": "REF" } },
"comments": { "method": "truncate", "params": { "length": 5 } },
"age": { "method": "generalize_age" },
"birth_date": { "method": "generalize_date", "params": { "mode": "month" } },
"zip_code": { "method": "generalize_zip_code", "params": { "visible_prefix": 3 } },
"state": { "method": "random_choice", "params": { "choices": ["SP", "RJ", "MG", "BA"] } },
"last_access": { "method": "date_offset", "params": { "min_days": -2, "max_days": 2 } },
"event_time": { "method": "coarsen_datetime", "params": { "mode": "part_of_day" } },
"feedback": { "method": "shuffle" },
"score": { "method": "clip_range", "params": { "min": 0, "max": 100 } }
}
}
Conditional Rules
Single condition:
"cpf": {
"method": "mask_cpf",
"params": { "keep_last": 2 },
"condition": {
"column": "status",
"operator": "equals",
"value": "active"
}
}
Multiple rules per column:
"city": [
{ "method": "replace_with_value", "params": { "value": "X" } },
{
"method": "mask_partial",
"params": { "visible_start": 1, "visible_end": 1 },
"condition": { "column": "country", "operator": "equals", "value": "BR" }
}
]
Nested conditions:
"age": {
"method": "generalize_age",
"condition": {
"all": [
{ "column": "country", "operator": "equals", "value": "BR" },
{
"any": [
{ "column": "status", "operator": "equals", "value": "active" },
{ "column": "status", "operator": "equals", "value": "archived" }
]
}
]
}
}
Supported operators:
equals, not_equals, in, not_in, gt, gte, lt, lte, contains, not_contains
Supported groups:
all, any, not
Example Input to Output
Input:
| name | age | status | |
|---|---|---|---|
| Alice Smith | alice@example.com | 25 | active |
| Bob Jones | bob@example.com | 42 | inactive |
Config:
{
"columns": {
"name": { "method": "initials_only" },
"email": { "method": "mask_email" },
"age": { "method": "generalize_age" },
"cpf": {
"method": "mask_cpf",
"params": { "keep_last": 2 },
"condition": {
"column": "status",
"operator": "equals",
"value": "active"
}
}
}
}
Output:
| name | age | cpf | |
|---|---|---|---|
| A.S. | xxxxx@example.com | 20-29 | *********01 |
| B.J. | xxxxx@example.com | 40-49 | null |
Examples
Runnable scripts live under examples/.
- Masking:
examples/masking/full_mask.py,examples/masking/mask_email.py,examples/masking/mask_number.py,examples/masking/mask_partials.py,examples/masking/mask_credit_card.py,examples/masking/mask_cpf.py,examples/masking/truncate.py - Replacement:
examples/replace/replace_with_value.py,examples/replace/replace_exact.py,examples/replace/replace_by_contains.py,examples/replace/replace_with_random_digits.py,examples/replace/hash_value.py,examples/replace/redact_regex.py,examples/replace/replace_with_hash_bucket.py - Generalization:
examples/generalize/generalize_age.py,examples/generalize/generalize_date.py,examples/generalize/generalize_number_range.py,examples/generalize/generalize_zip_code.py,examples/generalize/top_k_bucket.py,examples/generalize/coarsen_datetime.py - Randomization:
examples/random/random_choice.py,examples/random/noise_numeric.py,examples/random/shuffle.py,examples/random/date_offset.py - Numeric transforms:
examples/round/round_number.py,examples/round/round_date.py,examples/round/clip_range.py - Sequential:
examples/sequential/sequential_numeric.py,examples/sequential/sequential_alpha.py - Utilities:
examples/utils/coalesce.py,examples/utils/null_if_matches.py
Supported Methods
| Method | Description |
|---|---|
full_mask |
Fixed mask or literal |
mask_email |
Masks the local part of an email |
mask_number |
Keeps leading characters and masks the rest |
mask_credit_card |
Masks card digits while preserving the last visible digits |
mask_cpf |
Masks Brazilian CPF values while preserving the final visible digits |
mask_partial |
Masks the middle while preserving visible edges |
truncate |
Truncates strings to a fixed length |
initials_only |
Converts names to initials |
replace_with_value |
Replaces all values with a static value |
hash_value |
Generates deterministic hashes, with optional salt |
redact_regex |
Redacts regex matches inside free text |
replace_with_hash_bucket |
Replaces values with deterministic hash buckets |
replace_exact |
Replaces exact values using a mapping |
replace_by_contains |
Replaces values that contain substrings |
replace_with_random_digits |
Generates deterministic digit strings |
sequential_numeric |
Sequential numeric pseudonyms |
sequential_alpha |
Sequential alphabetic pseudonyms |
generalize_age |
Groups ages into ranges |
generalize_date |
Reduces date and datetime granularity |
generalize_number_range |
Buckets numeric values into fixed intervals |
generalize_zip_code |
Preserves a visible postal-code prefix and masks the rest |
coarsen_datetime |
Coarsens timestamps into buckets, minute-of-day buckets (time-only or full datetime), hour, part-of-day, weekday, weekend/weekday, or configurable business-hours labels |
top_k_bucket |
Keeps the top-k most frequent categories and buckets the rest |
random_choice |
Picks deterministic values from a fixed set |
noise_numeric |
Adds deterministic numeric noise within configured bounds |
shuffle |
Shuffles values while keeping row count |
date_offset |
Applies deterministic date offsets |
clip_range |
Constrains numeric values to configured min/max bounds |
round_number |
Rounds numeric values |
round_date |
Rounds dates to month or year start |
coalesce_cols |
Returns the first non-null value across columns |
null_if_matches |
Converts known placeholders or regex matches into null |
Notes
hash_valueis deterministic and better when you need stable one-way pseudonymization.replace_with_hash_bucketis deterministic bucketing, not unique pseudonymization. Different input values can land in the same bucket when the number of unique values is greater than the configured number of buckets.
Project Structure
src/
cloakdata/
native_methods/ # Built-in methods organized by domain
tests/ # Pytest suite
examples/ # Runnable examples
README.md
pyproject.toml
Built-in methods live under src/cloakdata/native_methods/ and are registered automatically with @native_method.
Installation
pip install cloakdata
Or with uv:
uv add cloakdata
Development
git clone https://github.com/Jeferson-Peter/cloakdata
cd cloakdata
uv sync --extra dev
pre-commit install
pytest -v
Choosing Methods
- Use
hash_valuewhen you need stable one-way pseudonymization. - Use
replace_with_hash_bucketwhen you need deterministic grouping and collisions are acceptable. - Use
generalize_datewhen you want period-style date abstraction such as month, quarter, or year. - Use
round_datewhen you want canonical rounded dates such as month-start or year-start. - Use
coarsen_datetimewhen you want timestamp abstraction such as hour buckets, part-of-day labels, weekdays, or business-hours labels. - Use
null_if_matchesbefore anonymization when your source data contains placeholders such asN/A,unknown, or regex-shaped junk values.
Contributing
See CONTRIBUTING.md for setup, coding standards, how to add a new anonymization method, and the PR checklist.
Notice
See NOTICE for attribution details.
License
MIT Copyright Jeferson Peter
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cloakdata-3.2.1.tar.gz.
File metadata
- Download URL: cloakdata-3.2.1.tar.gz
- Upload date:
- Size: 6.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.30 {"installer":{"name":"uv","version":"0.9.30","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24a9e7ddc7bb509641379d804f944bb80689843ea6fa3cba45cabdd6557601c5
|
|
| MD5 |
c9d70980db9cd8a0baf05386be45d276
|
|
| BLAKE2b-256 |
c659f4ede0df81564397a9a17838ed22b8fa7b4b14819ac8a8a81d3240a91c63
|
File details
Details for the file cloakdata-3.2.1-py3-none-any.whl.
File metadata
- Download URL: cloakdata-3.2.1-py3-none-any.whl
- Upload date:
- Size: 7.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.30 {"installer":{"name":"uv","version":"0.9.30","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8778fea7fcef0cfe102113aac156b67e92dee73b018ba27e3dd96a8c310f8456
|
|
| MD5 |
7e8216a87cdf9cbb554130f0b6d8c87e
|
|
| BLAKE2b-256 |
58e11429e0e4b8f180c68a9ec60edc9c44188293c93112449743d29fb2f95e34
|