Skip to main content

Formato .dlk: armazenamento criptografado com AES-256-GCM, HKDF, Canary Data e conformidade LGPD para ciência de dados em Python.

Project description

datalock

datalock is a Python library for privacy-by-design with tabular data. LGPD/GDPR compliance, automatic PII detection and masking, AES-256-GCM encrypted storage (.dlk format), expressive data manipulation over Polars, and transparent canary data for leak tracing.

pip install datalock
import datalock as dd
import os

SALT = os.environ["DATALOCK_SALT"]
KEY  = os.environ["DATALOCK_KEY"]

df       = dd.read("clientes.csv")              # any format → pl.DataFrame
df_safe  = dd.mask(df, salt=SALT)               # detect + mask PII (LGPD)
dd.store(df_safe, "clientes.dlk", key=KEY)      # AES-256-GCM encrypted
df_back  = dd.read("clientes.dlk", key=KEY)     # decrypt and read back

Renamed from logus-lgpd. The import logus as lg alias still works.


What datalock does

Capability Function
Read any tabular format dd.read()
Detect PII automatically dd.scan()
Mask PII (HMAC-SHA256) dd.mask()
Save with AES-256-GCM dd.store()
Expressive manipulation dd.where(), dd.groupby(), dd.add_column()
Full pipeline in one call dd.process()
Data quality validation dd.validate()
Database with masking dd.db()
Directory PII inventory dd.scan_directory()
Canary leak tracing dd.store(..., canary=True)
Masked text (free-form) dd.mask_text(..., strategy="semantic")
Data contracts dd.contract()
Privacy metrics dd.check.kanon(), dd.check.risk()

Installation

pip install datalock                    # core
pip install "datalock[sql]"             # + SQL via DuckDB
pip install "datalock[excel]"           # + Excel (.xlsx)
pip install "datalock[synthetic]"       # + Faker for richer synthetic data
pip install "datalock[full]"            # everything

Requires: Python ≥ 3.10, Polars ≥ 1.0, pandas ≥ 2.0, pyarrow ≥ 14.0


Quick Start

import datalock as dd
import os

SALT = os.environ["DATALOCK_SALT"]
KEY  = os.environ["DATALOCK_KEY"]

# Backward compat — both work
import logus as lg   # still works

Read any file format

df = dd.read("clientes.csv")
df = dd.read("clientes.parquet")
df = dd.read("clientes.dlk", key=KEY)

# Big data — no OOM
df   = dd.read("big.parquet", head=100_000)
df   = dd.read("big.parquet", sample=500_000)
info = dd.read("big.parquet", header_only=True)
df   = dd.read("big.parquet", n_chunks=5, chunks=[2, 4])
for chunk in dd.read("big.parquet", n_chunks=10, iter_chunks=True):
    process(chunk)

Detect and mask PII

reports = dd.scan(df)
df_safe = dd.mask(df, salt=SALT)
df_safe = dd.mask(df.lazy(), salt=SALT)    # LazyFrame stays lazy

# Custom PII patterns (company-specific identifiers)
reports = dd.scan(df, custom_patterns={
    "num_contrato": r"^CTR-[0-9]{8}$",
    "matricula":    r"^[0-9]{6}-[A-Z]$",
})

Save encrypted (.dlk)

dd.store(df, "clientes.dlk", key=KEY)
dd.store(df, "clientes.dlk", key=KEY, salt=SALT)
dd.store(df, "clientes.dlk", key=KEY, expires_at="2025-12-31")

# Asymmetric — share without sharing the key
priv, pub = dd.generate_keypair("ec")
dd.store(df, "clientes.dlk", public_key=pub)
df = dd.read("clientes.dlk", private_key=priv)

Canary data (transparent leak tracing)

# Inject canary rows silently — user never sees them
dd.store(df, "clientes.dlk", key=KEY, canary=True)
df_back = dd.read("clientes.dlk", key=KEY)
# df_back.shape == df.shape  — canary rows stripped automatically

# If "canary.1ba472d8@datalock.internal" appears in a breach dump:
dd.canary_check("canary.1ba472d8@datalock.internal")
# → {"pipeline_id": "crm_jan2025", "filepath": "clientes.dlk", ...}

Mask text (free-form strings)

text = "Cliente CPF 111.444.777-35, email joao@empresa.com"

dd.mask_text(text, salt=SALT, strategy="redact")
# → "Cliente [CPF], [EMAIL]"

dd.mask_text(text, salt=SALT, strategy="hash")
# → "Cliente 3f2a8b1c9d4e7f0a, 9e1d3c7f2a845b61"

dd.mask_text(text, salt=SALT, strategy="semantic")
# → "Cliente 478.622.984-97, roberto.santos@gmail.com"
# Real-looking fake data (CPF mathematically valid, no faker needed)

Scan a directory for PII

inventory = dd.scan_directory("./dados/", recursive=True)
print(inventory.summary())
inventory.to_html("inventario_pii.html")
inventory.to_json("inventario_pii.json")

for path, fi in inventory.items():
    if fi.max_risk == "high":
        print(f"HIGH RISK: {path}{list(fi.pii_columns.keys())}")

Manipulate data

dd.where(df, uf="SP")
dd.where(df, renda_mensal=(5_000, 15_000))
dd.groupby(df, "uf", {"n": ("*","count"), "media": ("renda","mean")})
dd.add_column(df,
    imposto = dd.col("renda_mensal") * 0.275,
    faixa   = dd.when(dd.col("renda_mensal") > 10_000, "alta")
                .when(dd.col("renda_mensal") > 5_000, "media")
                .otherwise("baixa"),
)
dd.shift(df, 1)        # lag — previous period value
dd.lead(df, 1)         # next period value
dd.explode(df, "tags") # list column → multiple rows

Contracts, validation, database

# Data contract
contrato = dd.contract({
    "cpf":   {"type":"str","not_null":True,"pii":"CPF","mask":"hash"},
    "renda": {"type":"float","min":0,"max":500_000},
})
result = contrato.apply(df, salt=SALT)
contrato.save("schema.contract.json")

# Database
banco = dd.db("postgresql://user:pass@host/db", salt=SALT)
df    = dd.read(banco, "clientes")
banco.upsert(df_new, "clientes", on="cpf")

# Audit webhook
dd.configure(audit_webhook="https://hooks.slack.com/...")

The .dlk format

Binary container: AES-256-GCM + HKDF-SHA256 + Parquet/zstd. Magic bytes: b"DLOCK" (v1.1.1+). Prior b"LOGUS" files still readable.

info = dd.inspect("clientes.dlk", key=KEY)
# → {"shape":[150000,12], "columns":[...], "column_stats":{...}, "expires_at":...}

Backward compatibility

# All of these still work after the rename:
import logus as lg
lg.mask(df, salt=SALT)   # identical to dd.mask()
lg.read("f.lgs", key=KEY)  # .lgs files still read correctly

License

AGPL-3.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datalock-1.1.1.tar.gz (263.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datalock-1.1.1-py3-none-any.whl (290.1 kB view details)

Uploaded Python 3

File details

Details for the file datalock-1.1.1.tar.gz.

File metadata

  • Download URL: datalock-1.1.1.tar.gz
  • Upload date:
  • Size: 263.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for datalock-1.1.1.tar.gz
Algorithm Hash digest
SHA256 f837daf7a80947af2c0a7d2363606252af4a917eaaf98a1e214ac2cf393a33a3
MD5 1df736a5a2b2a656e2b7488ab0b6091b
BLAKE2b-256 7573eeea15333fc014acb3a374b7a8fc7d15fe3ba86d6c6b9806cff2eea80968

See more details on using hashes here.

File details

Details for the file datalock-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: datalock-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 290.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for datalock-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cbc76979441ae6b5877cd3a5535e99bfaaf589475693d611e9d78aa544c11cee
MD5 5868c572acb6357e52e9977afbe84615
BLAKE2b-256 d2737e54051f597c217e0f2cea34fcede6eb0d3de4b150565c7f235357d687c7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page