Skip to main content

Fast data anonymization with Polars

Project description

๐Ÿ” CloakData โ€” Data Anonymizer

PyPI Python CI License

A flexible and extensible data anonymization library built on Polars. Designed for privacy, compliance, and testing with minimal overhead.


โœจ Features

  • ๐Ÿ”’ Masking: full, partial, emails, phone numbers.
  • ๐Ÿ”„ Replacement: static values, dictionaries, substrings.
  • ๐Ÿ”ข Sequential IDs: numeric or alphabetical.
  • โœ‚๏ธ Truncation & initials extraction.
  • ๐Ÿ“Š Generalization: ages into ranges, dates into month/year.
  • ๐ŸŽฒ Randomization: choices, digits, shuffling.
  • ๐Ÿ“… Date offsetting with reproducible seeds.
  • ๐Ÿงฉ Conditional rules based on other columns.
  • โšก Built on Polars โ†’ fast & scalable.

โš™๏ธ How it works

  1. Load your dataset into a Polars DataFrame.
  2. Define anonymization rules in a simple JSON config.
  3. Call anonymize(df, config) โ†’ get a safe anonymized DataFrame.

๐Ÿงช Example Config

{
  "columns": {
    "name": { "method": "initials_only" },
    "email": { "method": "mask_email" },
    "phone": { "method": "mask_number" },
    "cpf": {
      "method": "replace_with_random_digits",
      "params": { "digits": 11 }
    },
    "status": {
      "method": "replace_exact",
      "params": { "mapping": { "active": "A", "inactive": "I" } }
    },
    "id_seq": { "method": "sequential_numeric", "params": { "prefix": "ID" } },
    "ref_code": { "method": "sequential_alpha", "params": { "prefix": "REF" } },
    "comments": { "method": "truncate", "params": { "length": 5 } },
    "age": { "method": "generalize_age" },
    "birth_date": { "method": "generalize_date", "params": { "mode": "month_year" } },
    "state": { "method": "random_choice", "params": { "choices": ["SP","RJ","MG","BA"] } },
    "last_access": { "method": "date_offset", "params": { "min_days": -2, "max_days": 2 } },
    "feedback": { "method": "shuffle" }
  }
}

๐Ÿง  Conditional Rules

Apply transformations only when conditions are met:

"cpf": {
  "method": "replace_with_random_digits",
  "params": { "digits": 11 },
  "condition": {
    "column": "status",
    "operator": "equals",
    "value": "active"
  }
}

Supported operators

Operator Description
equals Equal to
not_equals Not equal to
in Value in list
not_in Value not in list
gt / gte Greater than / greater or equal
lt / lte Less than / less or equal
contains Substring exists in string
not_contains Substring does not exist in string

๐Ÿ” Example Input โ†’ Output

Input DataFrame:

name email age status
Alice Smith alice@example.com 25 active
Bob Jones bob@example.com 42 inactive

Config:

{
  "columns": {
    "name": { "method": "initials_only" },
    "email": { "method": "mask_email" },
    "age": { "method": "generalize_age" },
    "cpf": {
      "method": "replace_with_random_digits",
      "params": { "digits": 8 },
      "condition": {
        "column": "status",
        "operator": "equals",
        "value": "active"
      }
    }
  }
}

Output DataFrame:

name email age cpf
A.S. xxxxx@example.com 20-29 48291034
B.J. xxxxx@example.com 40-49 (null)

๐Ÿงฉ Examples by Method

Below are minimal examples of how each anonymization method works.

All examples assume:

import polars as pl
from cloakdata import anonymize

๐Ÿ”’ Masking

Full mask

df = pl.DataFrame({"ssn": ["123-45-6789", "987-65-4321"]})
config = {"columns": {"ssn": {"method": "full_mask"}}}
print(anonymize(df, config))

Mask email

df = pl.DataFrame({"email": ["john@example.com", "invalid"]})
config = {"columns": {"email": {"method": "mask_email"}}}
print(anonymize(df, config))

Mask number

df = pl.DataFrame({"phone": ["123456789", "987654321"]})
config = {"columns": {"phone": {"method": "mask_number"}}}
print(anonymize(df, config))

Mask partial

df = pl.DataFrame({"code": ["abcdef", "12345"]})
config = {"columns": {"code": {"method": "mask_partial", "params": {"visible_start": 2, "visible_end": 2}}}}
print(anonymize(df, config))

๐Ÿ”„ Replacement

Static value

df = pl.DataFrame({"city": ["NY", "LA"]})
config = {"columns": {"city": {"method": "replace_with_value", "params": {"value": "Unknown"}}}}
print(anonymize(df, config))

Exact mapping

df = pl.DataFrame({"status": ["active", "inactive"]})
config = {"columns": {"status": {"method": "replace_exact", "params": {"mapping": {"active": "A", "inactive": "I"}}}}}
print(anonymize(df, config))

Substring mapping

df = pl.DataFrame({"text": ["error: 404", "ok"]})
config = {"columns": {"text": {"method": "replace_by_contains", "params": {"mapping": {"error": "ERR"}}}}}
print(anonymize(df, config))

๐Ÿ”ข Sequential IDs

df = pl.DataFrame({"user": ["Alice", "Bob", "Charlie"]})
config = {"columns": {
    "user": {"method": "sequential_numeric", "params": {"prefix": "U"}}
}}
print(anonymize(df, config))

โœ‚๏ธ Truncation & Initials

df = pl.DataFrame({"name": ["Alice Smith", "Bob Jones"]})
config = {"columns": {
    "short": {"method": "truncate", "params": {"length": 3}},
    "initials": {"method": "initials_only"}
}}
print(anonymize(df, config))

๐Ÿ“Š Generalization

df = pl.DataFrame({"age": [25, 42], "date": ["2025-07-20", "2025-01-15"], "salary": [2300, 12500]})
config = {"columns": {
    "age": {"method": "generalize_age"},
    "date": {"method": "generalize_date", "params": {"mode": "year"}},
    "salary": {"method": "generalize_number_range", "params": {"interval": 5000}}
}}
print(anonymize(df, config))

๐ŸŽฒ Randomization

df = pl.DataFrame({
    "state": ["SP", "RJ", "MG"],
    "cpf": ["11111", "22222", "33333"],
    "col": ["A", "B", "C"]
})

config = {"columns": {
    "state": {"method": "random_choice", "params": {"choices": ["AA", "BB"], "seed": 42}},
    "cpf": {"method": "replace_with_random_digits", "params": {"digits": 5}},
    "col": {"method": "shuffle", "params": {"seed": 42}}
}}

print(anonymize(df, config))

๐Ÿ“… Dates

df = pl.DataFrame({"d": ["2025-07-29", "2025-07-30"]})
config = {"columns": {
    "offset": {"method": "date_offset", "params": {"min_days": -2, "max_days": 2, "seed": 42}},
    "rounded": {"method": "round_date", "params": {"mode": "month"}}
}}
print(anonymize(df, config))

๐Ÿงฉ Utilities

df = pl.DataFrame({"a": [None, "X"], "b": ["Y", None], "n": [3.14159, 2.71828]})
config = {"columns": {
    "coalesced": {"method": "coalesce_cols", "params": {"columns": ["a", "b"]}},
    "rounded": {"method": "round_number", "params": {"digits": 2}}
}}
print(anonymize(df, config))

๐Ÿ“Š Supported Methods

Method Description Example Input โ†’ Output
full_mask Replace all values with ***** 12345 โ†’ *****
mask_email Hide local part of email, keep domain john@example.com โ†’ xxxxx@example.com
mask_number Keep first 3 chars, mask rest 123456789 โ†’ 123*****
mask_partial Show start & end, mask middle abcdef โ†’ ab**ef
replace_with_value Replace with a static value NY โ†’ Unknown
replace_exact Replace exact matches by mapping active โ†’ A
replace_by_contains Replace if substring exists error: 404 โ†’ ERR
sequential_numeric Sequential numeric pseudonyms Alice, Bob โ†’ U 1, U 2
sequential_alpha Sequential alphabetic pseudonyms Alice, Bob โ†’ U A, U B
truncate Truncate strings to fixed length Alexander โ†’ Alex
initials_only Convert names to initials John Doe โ†’ J.D.
generalize_age Group ages in 10y ranges 25 โ†’ 20-29
generalize_date Reduce granularity (year or month_year) 2025-07-20 โ†’ 2025
generalize_number_range Bucketize numbers by interval 23 โ†’ 20-29
random_choice Randomly pick value from list SP โ†’ AA or BB
replace_with_random_digits Random digits with fixed length 11111 โ†’ 80239
shuffle Shuffle column values [A,B,C] โ†’ [B,C,A]
date_offset Random offset within day range 2025-07-20 โ†’ 2025-07-18
coalesce_cols Take first non-null from multiple cols (None, Y) โ†’ Y
round_number Round numeric values to fixed decimals 3.14159 โ†’ 3.14
round_date Round date down to month or year start 2025-07-29 โ†’ 2025-07-01

๐Ÿ“‚ Project Structure

src/
 โ””โ”€โ”€ cloakdata/           # Core library
tests/                    # Test suite (pytest + Polars)
examples/                 # Sample CSVs & configs
README.md                 # Project docs
pyproject.toml            # Build system (uv/hatch)

โšก Installation

pip install cloakdata

Or with uv:

uv add cloakdata

๐Ÿš€ Quickstart

import polars as pl
from cloakdata import anonymize

df = pl.DataFrame({
    "name": ["Alice Smith", "Bob Jones"],
    "email": ["alice@example.com", "bob@example.com"],
    "age": [25, 42]
})

config = {
    "columns": {
        "name": { "method": "initials_only" },
        "email": { "method": "mask_email" },
        "age": { "method": "generalize_age" }
    }
}

out = anonymize(df, config)
print(out)

๐Ÿ› ๏ธ Development

git clone https://github.com/youruser/cloakdata
cd cloakdata
uv sync
pre-commit install
pytest -v

๐Ÿ”ฎ Roadmap

  • Regex-based redaction
  • Hashing strategies (SHA256, bcrypt)
  • Parallel processing for large datasets

๐Ÿค Contributing

We love contributions! See CONTRIBUTING.md for setup, coding standards, how to add a new anonymization method, tests and the PR checklist.

๐Ÿ“„ Notice

See NOTICE for attribution details.

๐Ÿ“œ License

MIT ยฉ Jeferson Peter

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloakdata-1.0.1.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cloakdata-1.0.1-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file cloakdata-1.0.1.tar.gz.

File metadata

  • Download URL: cloakdata-1.0.1.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for cloakdata-1.0.1.tar.gz
Algorithm Hash digest
SHA256 0faaceb798ceeadf1e5515ea50c18fc222282611436d2e8bfb65ff4402e7401f
MD5 d30bf14fc679dbd06c5c3a9158cad0d0
BLAKE2b-256 af12789b4525a531caa0dfad4801c314d986fd3e979582683cead0e4d66dc7c2

See more details on using hashes here.

File details

Details for the file cloakdata-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: cloakdata-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for cloakdata-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c68aeb631abb61239ea8b18cb44dcdf4eac1241382897bed65bd1aa47b692769
MD5 845847911b75553907134609e67b8e05
BLAKE2b-256 a9c7ab894aa4141427bd28629a65200183be3ff2f9389df4d227477bec9c2b35

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page