Skip to main content

Fast data anonymization with Polars

Project description

๐Ÿ” CloakData โ€” Data Anonymizer

PyPI Python CI License

A flexible and extensible data anonymization library built on Polars. Designed for privacy, compliance, and testing with minimal overhead.


๐Ÿงพ Whatโ€™s New (1.1.0)

  • โœ… Added Conditional Rules with multi-rule support per column.
  • โœ… Added nested conditions: all, any, not.
  • โœ… Logical operators supported: and, or.
  • โœ… Extended test coverage for conditions.
  • ๐Ÿงน Internal refactors & style improvements (ruff).

โœจ Features

  • ๐Ÿ”’ Masking: full, partial, emails, phone numbers.
  • ๐Ÿ”„ Replacement: static values, dictionaries, substrings.
  • ๐Ÿ”ข Sequential IDs: numeric or alphabetical.
  • โœ‚๏ธ Truncation & initials extraction.
  • ๐Ÿ“Š Generalization: ages into ranges, dates into month/year.
  • ๐ŸŽฒ Randomization: choices, digits, shuffling.
  • ๐Ÿ“… Date offsetting with reproducible seeds.
  • ๐Ÿงฉ Conditional rules โ€” multi-rules, nested (all/any/not), logical groups (and/or).
  • โšก Built on Polars โ†’ fast & scalable.

โš™๏ธ How it works

  1. Load your dataset into a Polars DataFrame.
  2. Define anonymization rules in a simple JSON config.
  3. Call anonymize(df, config) โ†’ get a safe anonymized DataFrame.

๐Ÿงช Example Config

{
  "columns": {
    "name": { "method": "initials_only" },
    "email": { "method": "mask_email" },
    "phone": { "method": "mask_number" },
    "cpf": {
      "method": "replace_with_random_digits",
      "params": { "digits": 11 }
    },
    "status": {
      "method": "replace_exact",
      "params": { "mapping": { "active": "A", "inactive": "I" } }
    },
    "id_seq": { "method": "sequential_numeric", "params": { "prefix": "ID" } },
    "ref_code": { "method": "sequential_alpha", "params": { "prefix": "REF" } },
    "comments": { "method": "truncate", "params": { "length": 5 } },
    "age": { "method": "generalize_age" },
    "birth_date": { "method": "generalize_date", "params": { "mode": "month_year" } },
    "state": { "method": "random_choice", "params": { "choices": ["SP","RJ","MG","BA"] } },
    "last_access": { "method": "date_offset", "params": { "min_days": -2, "max_days": 2 } },
    "feedback": { "method": "shuffle" }
  }
}

๐Ÿง  Conditional Rules

Apply transformations only when conditions are met.

Single condition

"cpf": {
  "method": "replace_with_random_digits",
  "params": { "digits": 11 },
  "condition": {
    "column": "status",
    "operator": "equals",
    "value": "active"
  }
}

Multiple rules per column

"city": [
  { "method": "replace_with_value", "params": { "value": "X" } },
  {
    "method": "mask_partial",
    "params": { "visible_start": 1, "visible_end": 1 },
    "condition": { "column": "country", "operator": "equals", "value": "BR" }
  }
]

Nested conditions

"age": {
  "method": "generalize_age",
  "condition": {
    "all": [
      { "column": "country", "operator": "equals", "value": "BR" },
      { "any": [
          { "column": "status", "operator": "equals", "value": "active" },
          { "column": "status", "operator": "equals", "value": "archived" }
        ]
      }
    ]
  }
}

Operators supported: equals, not_equals, in, not_in, gt, gte, lt, lte, contains, not_contains Groups: all, any, not Logical: and, or


๐Ÿ” Example Input โ†’ Output

Input DataFrame:

name email age status
Alice Smith alice@example.com 25 active
Bob Jones bob@example.com 42 inactive

Config:

{
  "columns": {
    "name": { "method": "initials_only" },
    "email": { "method": "mask_email" },
    "age": { "method": "generalize_age" },
    "cpf": {
      "method": "replace_with_random_digits",
      "params": { "digits": 8 },
      "condition": {
        "column": "status",
        "operator": "equals",
        "value": "active"
      }
    }
  }
}

Output DataFrame:

name email age cpf
A.S. xxxxx@example.com 20-29 48291034
B.J. xxxxx@example.com 40-49 (null)

๐Ÿงฉ Examples by Method

Below are minimal examples of how each anonymization method works.

All examples assume:

import polars as pl
from cloakdata import anonymize

๐Ÿ”’ Masking

Full mask

df = pl.DataFrame({"ssn": ["123-45-6789", "987-65-4321"]})
config = {"columns": {"ssn": {"method": "full_mask"}}}
print(anonymize(df, config))

Mask email

df = pl.DataFrame({"email": ["john@example.com", "invalid"]})
config = {"columns": {"email": {"method": "mask_email"}}}
print(anonymize(df, config))

Mask number

df = pl.DataFrame({"phone": ["123456789", "987654321"]})
config = {"columns": {"phone": {"method": "mask_number"}}}
print(anonymize(df, config))

Mask partial

df = pl.DataFrame({"code": ["abcdef", "12345"]})
config = {"columns": {"code": {"method": "mask_partial", "params": {"visible_start": 2, "visible_end": 2}}}}
print(anonymize(df, config))

๐Ÿ”„ Replacement

Static value

df = pl.DataFrame({"city": ["NY", "LA"]})
config = {"columns": {"city": {"method": "replace_with_value", "params": {"value": "Unknown"}}}}
print(anonymize(df, config))

Exact mapping

df = pl.DataFrame({"status": ["active", "inactive"]})
config = {"columns": {"status": {"method": "replace_exact", "params": {"mapping": {"active": "A", "inactive": "I"}}}}}
print(anonymize(df, config))

Substring mapping

df = pl.DataFrame({"text": ["error: 404", "ok"]})
config = {"columns": {"text": {"method": "replace_by_contains", "params": {"mapping": {"error": "ERR"}}}}}
print(anonymize(df, config))

๐Ÿ”ข Sequential IDs

df = pl.DataFrame({"user": ["Alice", "Bob", "Charlie"]})
config = {"columns": {
    "user": {"method": "sequential_numeric", "params": {"prefix": "U"}}
}}
print(anonymize(df, config))

โœ‚๏ธ Truncation & Initials

df = pl.DataFrame({"name": ["Alice Smith", "Bob Jones"]})
config = {"columns": {
    "short": {"method": "truncate", "params": {"length": 3}},
    "initials": {"method": "initials_only"}
}}
print(anonymize(df, config))

๐Ÿ“Š Generalization

df = pl.DataFrame({"age": [25, 42], "date": ["2025-07-20", "2025-01-15"], "salary": [2300, 12500]})
config = {"columns": {
    "age": {"method": "generalize_age"},
    "date": {"method": "generalize_date", "params": {"mode": "year"}},
    "salary": {"method": "generalize_number_range", "params": {"interval": 5000}}
}}
print(anonymize(df, config))

๐ŸŽฒ Randomization

df = pl.DataFrame({
    "state": ["SP", "RJ", "MG"],
    "cpf": ["11111", "22222", "33333"],
    "col": ["A", "B", "C"]
})

config = {"columns": {
    "state": {"method": "random_choice", "params": {"choices": ["AA", "BB"], "seed": 42}},
    "cpf": {"method": "replace_with_random_digits", "params": {"digits": 5}},
    "col": {"method": "shuffle", "params": {"seed": 42}}
}}

print(anonymize(df, config))

๐Ÿ“… Dates

df = pl.DataFrame({"d": ["2025-07-29", "2025-07-30"]})
config = {"columns": {
    "offset": {"method": "date_offset", "params": {"min_days": -2, "max_days": 2, "seed": 42}},
    "rounded": {"method": "round_date", "params": {"mode": "month"}}
}}
print(anonymize(df, config))

๐Ÿงฉ Utilities

df = pl.DataFrame({"a": [None, "X"], "b": ["Y", None], "n": [3.14159, 2.71828]})
config = {"columns": {
    "coalesced": {"method": "coalesce_cols", "params": {"columns": ["a", "b"]}},
    "rounded": {"method": "round_number", "params": {"digits": 2}}
}}
print(anonymize(df, config))

๐Ÿ“Š Supported Methods

Method Description Example Input โ†’ Output
full_mask Replace all values with ***** 12345 โ†’ *****
mask_email Hide local part of email, keep domain john@example.com โ†’ xxxxx@example.com
mask_number Keep first 3 chars, mask rest 123456789 โ†’ 123*****
mask_partial Show start & end, mask middle abcdef โ†’ ab**ef
replace_with_value Replace with a static value NY โ†’ Unknown
replace_exact Replace exact matches by mapping active โ†’ A
replace_by_contains Replace if substring exists error: 404 โ†’ ERR
sequential_numeric Sequential numeric pseudonyms Alice, Bob โ†’ U 1, U 2
sequential_alpha Sequential alphabetic pseudonyms Alice, Bob โ†’ U A, U B
truncate Truncate strings to fixed length Alexander โ†’ Alex
initials_only Convert names to initials John Doe โ†’ J.D.
generalize_age Group ages in 10y ranges 25 โ†’ 20-29
generalize_date Reduce granularity (year or month_year) 2025-07-20 โ†’ 2025
generalize_number_range Bucketize numbers by interval 23 โ†’ 20-29
random_choice Randomly pick value from list SP โ†’ AA or BB
replace_with_random_digits Random digits with fixed length 11111 โ†’ 80239
shuffle Shuffle column values [A,B,C] โ†’ [B,C,A]
date_offset Random offset within day range 2025-07-20 โ†’ 2025-07-18
coalesce_cols Take first non-null from multiple cols (None, Y) โ†’ Y
round_number Round numeric values to fixed decimals 3.14159 โ†’ 3.14
round_date Round date down to month or year start 2025-07-29 โ†’ 2025-07-01

๐Ÿ“‚ Project Structure

src/
 โ””โ”€โ”€ cloakdata/           # Core library
tests/                    # Test suite (pytest + Polars)
examples/                 # Sample CSVs & configs
README.md                 # Project docs
pyproject.toml            # Build system (uv/hatch)

โšก Installation

pip install cloakdata

Or with uv:

uv add cloakdata

๐Ÿš€ Quickstart

import polars as pl
from cloakdata import anonymize

df = pl.DataFrame({
    "name": ["Alice Smith", "Bob Jones"],
    "email": ["alice@example.com", "bob@example.com"],
    "age": [25, 42]
})

config = {
    "columns": {
        "name": { "method": "initials_only" },
        "email": { "method": "mask_email" },
        "age": { "method": "generalize_age" }
    }
}

out = anonymize(df, config)
print(out)

๐Ÿ› ๏ธ Development

git clone https://github.com/youruser/cloakdata
cd cloakdata
uv sync
pre-commit install
pytest -v

๐Ÿ”ฎ Roadmap

  • Regex-based redaction
  • Hashing strategies (SHA256, bcrypt)
  • Parallel processing for large datasets

๐Ÿค Contributing

We love contributions! See CONTRIBUTING.md for setup, coding standards, how to add a new anonymization method, tests and the PR checklist.

๐Ÿ“„ Notice

See NOTICE for attribution details.

๐Ÿ“œ License

MIT ยฉ Jeferson Peter

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloakdata-1.1.0.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cloakdata-1.1.0-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file cloakdata-1.1.0.tar.gz.

File metadata

  • Download URL: cloakdata-1.1.0.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.19

File hashes

Hashes for cloakdata-1.1.0.tar.gz
Algorithm Hash digest
SHA256 e1798939f73f6a618af8e6b1143db5e7b0c1253dcd99af09bca0d327ed6be346
MD5 690345acc98c3e590d069c5706e9169b
BLAKE2b-256 3b61fe2cc6d7d7290e826857360af1c0a0feaee485fe549f86225fa5167a0e5a

See more details on using hashes here.

File details

Details for the file cloakdata-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: cloakdata-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.19

File hashes

Hashes for cloakdata-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a25b7d5bc357cac59365dc92863c2488bb54ed2cc593a60d3ca4d7d44e94f79e
MD5 b372e2d6876a6526c966173d76422784
BLAKE2b-256 c4a5301ae509b3b74dca8f8d641c196d4f96323a8619e0336bc4602628c02c90

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page