Privacy-by-Design para dados tabulares — LGPD compliance em Python.
Project description
datalock
datalock is a Python library for privacy-by-design with tabular data.
LGPD/GDPR compliance, automatic PII detection and masking, AES-256-GCM encrypted
storage (.dlk format), expressive data manipulation over Polars, and transparent
canary data for leak tracing.
pip install datalock
import datalock as dd
import os
SALT = os.environ["DATALOCK_SALT"]
KEY = os.environ["DATALOCK_KEY"]
df = dd.read("clientes.csv") # any format → pl.DataFrame
df_safe = dd.mask(df, salt=SALT) # detect + mask PII (LGPD)
dd.store(df_safe, "clientes.dlk", key=KEY) # AES-256-GCM encrypted
df_back = dd.read("clientes.dlk", key=KEY) # decrypt and read back
Renamed from
logus-lgpd. Theimport logus as lgalias still works.
What datalock does
| Capability | Function |
|---|---|
| Read any tabular format | dd.read() |
| Detect PII automatically | dd.scan() |
| Mask PII (HMAC-SHA256) | dd.mask() |
| Save with AES-256-GCM | dd.store() |
| Expressive manipulation | dd.where(), dd.groupby(), dd.add_column() |
| Full pipeline in one call | dd.process() |
| Data quality validation | dd.validate() |
| Database with masking | dd.db() |
| Directory PII inventory | dd.scan_directory() |
| Canary leak tracing | dd.store(..., canary=True) |
| Masked text (free-form) | dd.mask_text(..., strategy="semantic") |
| Data contracts | dd.contract() |
| Privacy metrics | dd.check.kanon(), dd.check.risk() |
Installation
pip install datalock # core
pip install "datalock[sql]" # + SQL via DuckDB
pip install "datalock[excel]" # + Excel (.xlsx)
pip install "datalock[synthetic]" # + Faker for richer synthetic data
pip install "datalock[full]" # everything
Requires: Python ≥ 3.10, Polars ≥ 1.0, pandas ≥ 2.0, pyarrow ≥ 14.0
Quick Start
import datalock as dd
import os
SALT = os.environ["DATALOCK_SALT"]
KEY = os.environ["DATALOCK_KEY"]
# Backward compat — both work
import logus as lg # still works
Read any file format
df = dd.read("clientes.csv")
df = dd.read("clientes.parquet")
df = dd.read("clientes.dlk", key=KEY)
# Big data — no OOM
df = dd.read("big.parquet", head=100_000)
df = dd.read("big.parquet", sample=500_000)
info = dd.read("big.parquet", header_only=True)
df = dd.read("big.parquet", n_chunks=5, chunks=[2, 4])
for chunk in dd.read("big.parquet", n_chunks=10, iter_chunks=True):
process(chunk)
Detect and mask PII
reports = dd.scan(df)
df_safe = dd.mask(df, salt=SALT)
df_safe = dd.mask(df.lazy(), salt=SALT) # LazyFrame stays lazy
# Custom PII patterns (company-specific identifiers)
reports = dd.scan(df, custom_patterns={
"num_contrato": r"^CTR-[0-9]{8}$",
"matricula": r"^[0-9]{6}-[A-Z]$",
})
Save encrypted (.dlk)
dd.store(df, "clientes.dlk", key=KEY)
dd.store(df, "clientes.dlk", key=KEY, salt=SALT)
dd.store(df, "clientes.dlk", key=KEY, expires_at="2025-12-31")
# Asymmetric — share without sharing the key
priv, pub = dd.generate_keypair("ec")
dd.store(df, "clientes.dlk", public_key=pub)
df = dd.read("clientes.dlk", private_key=priv)
Canary data (transparent leak tracing)
# Inject canary rows silently — user never sees them
dd.store(df, "clientes.dlk", key=KEY, canary=True)
df_back = dd.read("clientes.dlk", key=KEY)
# df_back.shape == df.shape — canary rows stripped automatically
# If "canary.1ba472d8@datalock.internal" appears in a breach dump:
dd.canary_check("canary.1ba472d8@datalock.internal")
# → {"pipeline_id": "crm_jan2025", "filepath": "clientes.dlk", ...}
Mask text (free-form strings)
text = "Cliente CPF 111.444.777-35, email joao@empresa.com"
dd.mask_text(text, salt=SALT, strategy="redact")
# → "Cliente [CPF], [EMAIL]"
dd.mask_text(text, salt=SALT, strategy="hash")
# → "Cliente 3f2a8b1c9d4e7f0a, 9e1d3c7f2a845b61"
dd.mask_text(text, salt=SALT, strategy="semantic")
# → "Cliente 478.622.984-97, roberto.santos@gmail.com"
# Real-looking fake data (CPF mathematically valid, no faker needed)
Scan a directory for PII
inventory = dd.scan_directory("./dados/", recursive=True)
print(inventory.summary())
inventory.to_html("inventario_pii.html")
inventory.to_json("inventario_pii.json")
for path, fi in inventory.items():
if fi.max_risk == "high":
print(f"HIGH RISK: {path} → {list(fi.pii_columns.keys())}")
Manipulate data
dd.where(df, uf="SP")
dd.where(df, renda_mensal=(5_000, 15_000))
dd.groupby(df, "uf", {"n": ("*","count"), "media": ("renda","mean")})
dd.add_column(df,
imposto = dd.col("renda_mensal") * 0.275,
faixa = dd.when(dd.col("renda_mensal") > 10_000, "alta")
.when(dd.col("renda_mensal") > 5_000, "media")
.otherwise("baixa"),
)
dd.shift(df, 1) # lag — previous period value
dd.lead(df, 1) # next period value
dd.explode(df, "tags") # list column → multiple rows
Contracts, validation, database
# Data contract
contrato = dd.contract({
"cpf": {"type":"str","not_null":True,"pii":"CPF","mask":"hash"},
"renda": {"type":"float","min":0,"max":500_000},
})
result = contrato.apply(df, salt=SALT)
contrato.save("schema.contract.json")
# Database
banco = dd.db("postgresql://user:pass@host/db", salt=SALT)
df = dd.read(banco, "clientes")
banco.upsert(df_new, "clientes", on="cpf")
# Audit webhook
dd.configure(audit_webhook="https://hooks.slack.com/...")
The .dlk format
Binary container: AES-256-GCM + HKDF-SHA256 + Parquet/zstd.
Magic bytes: b"DLOCK" (v1.0.1+). Prior b"LOGUS" files still readable.
info = dd.inspect("clientes.dlk", key=KEY)
# → {"shape":[150000,12], "columns":[...], "column_stats":{...}, "expires_at":...}
Backward compatibility
# All of these still work after the rename:
import logus as lg
lg.mask(df, salt=SALT) # identical to dd.mask()
lg.read("f.lgs", key=KEY) # .lgs files still read correctly
License
AGPL-3.0 — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datalock-1.0.1.tar.gz.
File metadata
- Download URL: datalock-1.0.1.tar.gz
- Upload date:
- Size: 243.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43463cefb041d79841d201144ec001ecce8b4c64c195e44184f1faafacc56e0c
|
|
| MD5 |
7e54ed19aa3a9327e52289719481775d
|
|
| BLAKE2b-256 |
9614412c3297dba27165985f857c01ef64ece3400081f4a983947f6a30ab53f9
|
File details
Details for the file datalock-1.0.1-py3-none-any.whl.
File metadata
- Download URL: datalock-1.0.1-py3-none-any.whl
- Upload date:
- Size: 270.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60c55cb04384b0e7c30c054735b082ac043ad74776b9b1d21e5bbee45d4ba4b5
|
|
| MD5 |
69d0dd8fa78ce89f43f10c20edccd3f8
|
|
| BLAKE2b-256 |
668f1668026fc1d10e4c0fae1c34dc729b820180f87524e1d8319094a28b4da6
|