Skip to main content

Privacy-by-Design para dados tabulares — LGPD compliance em Python.

Project description

datalock

datalock is a Python library for privacy-by-design with tabular data. LGPD/GDPR compliance, automatic PII detection and masking, AES-256-GCM encrypted storage (.dlk format), expressive data manipulation over Polars, and transparent canary data for leak tracing.

pip install datalock
import datalock as dd
import os

SALT = os.environ["DATALOCK_SALT"]
KEY  = os.environ["DATALOCK_KEY"]

df       = dd.read("clientes.csv")              # any format → pl.DataFrame
df_safe  = dd.mask(df, salt=SALT)               # detect + mask PII (LGPD)
dd.store(df_safe, "clientes.dlk", key=KEY)      # AES-256-GCM encrypted
df_back  = dd.read("clientes.dlk", key=KEY)     # decrypt and read back

Renamed from logus-lgpd. The import logus as lg alias still works.


What datalock does

Capability Function
Read any tabular format dd.read()
Detect PII automatically dd.scan()
Mask PII (HMAC-SHA256) dd.mask()
Save with AES-256-GCM dd.store()
Expressive manipulation dd.where(), dd.groupby(), dd.add_column()
Full pipeline in one call dd.process()
Data quality validation dd.validate()
Database with masking dd.db()
Directory PII inventory dd.scan_directory()
Canary leak tracing dd.store(..., canary=True)
Masked text (free-form) dd.mask_text(..., strategy="semantic")
Data contracts dd.contract()
Privacy metrics dd.check.kanon(), dd.check.risk()

Installation

pip install datalock                    # core
pip install "datalock[sql]"             # + SQL via DuckDB
pip install "datalock[excel]"           # + Excel (.xlsx)
pip install "datalock[synthetic]"       # + Faker for richer synthetic data
pip install "datalock[full]"            # everything

Requires: Python ≥ 3.10, Polars ≥ 1.0, pandas ≥ 2.0, pyarrow ≥ 14.0


Quick Start

import datalock as dd
import os

SALT = os.environ["DATALOCK_SALT"]
KEY  = os.environ["DATALOCK_KEY"]

# Backward compat — both work
import logus as lg   # still works

Read any file format

df = dd.read("clientes.csv")
df = dd.read("clientes.parquet")
df = dd.read("clientes.dlk", key=KEY)

# Big data — no OOM
df   = dd.read("big.parquet", head=100_000)
df   = dd.read("big.parquet", sample=500_000)
info = dd.read("big.parquet", header_only=True)
df   = dd.read("big.parquet", n_chunks=5, chunks=[2, 4])
for chunk in dd.read("big.parquet", n_chunks=10, iter_chunks=True):
    process(chunk)

Detect and mask PII

reports = dd.scan(df)
df_safe = dd.mask(df, salt=SALT)
df_safe = dd.mask(df.lazy(), salt=SALT)    # LazyFrame stays lazy

# Custom PII patterns (company-specific identifiers)
reports = dd.scan(df, custom_patterns={
    "num_contrato": r"^CTR-[0-9]{8}$",
    "matricula":    r"^[0-9]{6}-[A-Z]$",
})

Save encrypted (.dlk)

dd.store(df, "clientes.dlk", key=KEY)
dd.store(df, "clientes.dlk", key=KEY, salt=SALT)
dd.store(df, "clientes.dlk", key=KEY, expires_at="2025-12-31")

# Asymmetric — share without sharing the key
priv, pub = dd.generate_keypair("ec")
dd.store(df, "clientes.dlk", public_key=pub)
df = dd.read("clientes.dlk", private_key=priv)

Canary data (transparent leak tracing)

# Inject canary rows silently — user never sees them
dd.store(df, "clientes.dlk", key=KEY, canary=True)
df_back = dd.read("clientes.dlk", key=KEY)
# df_back.shape == df.shape  — canary rows stripped automatically

# If "canary.1ba472d8@datalock.internal" appears in a breach dump:
dd.canary_check("canary.1ba472d8@datalock.internal")
# → {"pipeline_id": "crm_jan2025", "filepath": "clientes.dlk", ...}

Mask text (free-form strings)

text = "Cliente CPF 111.444.777-35, email joao@empresa.com"

dd.mask_text(text, salt=SALT, strategy="redact")
# → "Cliente [CPF], [EMAIL]"

dd.mask_text(text, salt=SALT, strategy="hash")
# → "Cliente 3f2a8b1c9d4e7f0a, 9e1d3c7f2a845b61"

dd.mask_text(text, salt=SALT, strategy="semantic")
# → "Cliente 478.622.984-97, roberto.santos@gmail.com"
# Real-looking fake data (CPF mathematically valid, no faker needed)

Scan a directory for PII

inventory = dd.scan_directory("./dados/", recursive=True)
print(inventory.summary())
inventory.to_html("inventario_pii.html")
inventory.to_json("inventario_pii.json")

for path, fi in inventory.items():
    if fi.max_risk == "high":
        print(f"HIGH RISK: {path}{list(fi.pii_columns.keys())}")

Manipulate data

dd.where(df, uf="SP")
dd.where(df, renda_mensal=(5_000, 15_000))
dd.groupby(df, "uf", {"n": ("*","count"), "media": ("renda","mean")})
dd.add_column(df,
    imposto = dd.col("renda_mensal") * 0.275,
    faixa   = dd.when(dd.col("renda_mensal") > 10_000, "alta")
                .when(dd.col("renda_mensal") > 5_000, "media")
                .otherwise("baixa"),
)
dd.shift(df, 1)        # lag — previous period value
dd.lead(df, 1)         # next period value
dd.explode(df, "tags") # list column → multiple rows

Contracts, validation, database

# Data contract
contrato = dd.contract({
    "cpf":   {"type":"str","not_null":True,"pii":"CPF","mask":"hash"},
    "renda": {"type":"float","min":0,"max":500_000},
})
result = contrato.apply(df, salt=SALT)
contrato.save("schema.contract.json")

# Database
banco = dd.db("postgresql://user:pass@host/db", salt=SALT)
df    = dd.read(banco, "clientes")
banco.upsert(df_new, "clientes", on="cpf")

# Audit webhook
dd.configure(audit_webhook="https://hooks.slack.com/...")

The .dlk format

Binary container: AES-256-GCM + HKDF-SHA256 + Parquet/zstd. Magic bytes: b"DLOCK" (v1.0.1+). Prior b"LOGUS" files still readable.

info = dd.inspect("clientes.dlk", key=KEY)
# → {"shape":[150000,12], "columns":[...], "column_stats":{...}, "expires_at":...}

Backward compatibility

# All of these still work after the rename:
import logus as lg
lg.mask(df, salt=SALT)   # identical to dd.mask()
lg.read("f.lgs", key=KEY)  # .lgs files still read correctly

License

AGPL-3.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datalock-1.0.1.tar.gz (243.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datalock-1.0.1-py3-none-any.whl (270.9 kB view details)

Uploaded Python 3

File details

Details for the file datalock-1.0.1.tar.gz.

File metadata

  • Download URL: datalock-1.0.1.tar.gz
  • Upload date:
  • Size: 243.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for datalock-1.0.1.tar.gz
Algorithm Hash digest
SHA256 43463cefb041d79841d201144ec001ecce8b4c64c195e44184f1faafacc56e0c
MD5 7e54ed19aa3a9327e52289719481775d
BLAKE2b-256 9614412c3297dba27165985f857c01ef64ece3400081f4a983947f6a30ab53f9

See more details on using hashes here.

File details

Details for the file datalock-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: datalock-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 270.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for datalock-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 60c55cb04384b0e7c30c054735b082ac043ad74776b9b1d21e5bbee45d4ba4b5
MD5 69d0dd8fa78ce89f43f10c20edccd3f8
BLAKE2b-256 668f1668026fc1d10e4c0fae1c34dc729b820180f87524e1d8319094a28b4da6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page