Skip to main content

Safe processing utilities for CSVW data

Project description

CSVW-SAFE Utility Library

This library provides Python utilities for generating, validating, and testing CSVW-SAFE metadata and associated dummy datasets for differential privacy (DP) development and safe data modeling workflows.

It includes five main scripts:

  1. make_metadata_from_data.py
  2. make_dummy_from_metadata.py
  3. validate_metadata.py
  4. validate_metadata_shacl.py (requires pyshacl)
  5. assert_same_structure.py

Overview

In addition, two other scripts are available for conversion of csvw-safe metadata to smartnoise sql and opendp libraries: 6. csvw_to_smartnoise_sql.py converts the metadata to the format expected in smartnoise-sql 7. assert_same_structure.py prepares a context object for opendp with margin and information extracted from csvw-metadata format.

Overview

NOTES:

  • These scripts assist safe data modeling workflows; they DO NOT replace governance decisions on what is public information or not.
  • IMPORTANT: Automatically generated metadata may contain sensitive information — MANUAL REVIEW IS ALWAYS REQUIRED before further steps.

For a description of CSVW-SAFE metadata, see here.


Installation

Install Python 3.11+ and

pip install csvw-safe

or for development:

git clone https://github.com/dscc-admin-ch/csvw-safe-library.git
cd csvw-safe-library
pip install -e .[dev]

For testing:

cd csvw-safe-library
pip install -e .[dev]
pytest --cov=csvw_safe --cov-report=term-missing tests/

Learn via example

To get to know the library with examples, see the [notebook on the extended penguin dataset](notebook https://github.com/dscc-admin-ch/csvw-safe/blob/update_readme/csvw-safe-library/examples/Use-Library.ipynb) and the associated outputs in metadata example folder.

Scripts Overview

1. make_metadata_from_data.py

Purpose

Automatically generate baseline CSVW-SAFE metadata from an existing dataset.

This script infers:

  • Column datatypes
  • Nullability and missingness rates
  • Numeric bounds (min/max)
  • Optional continuous partitions
  • Contribution constraints (DP-oriented metadata)
  • Optional column dependencies
  • Optional column grouping metadata

Important: This tool is for automated metadata drafting only. All outputs must be manually reviewed (and properties can be removed) before publication.

The script first builds a pydantic TableMetadata model and then serialises it to a json-ld via a to_dict() method. See TableMetadata.md for more detailed explanation on the inner workings.

Differential Privacy (DP) Contribution Levels

The script provides flexibility in defining the level of detail for DP metadata.

Warning: Increasing the level of detail (i.e., more granular contribution definitions) can increase the risk of privacy leakage.
It is strongly recommended to:

  • Choose the lowest level of detail sufficient for your use case
  • Carefully review and validate the generated metadata

Four contribution levels are supported: table, table_with_keys, column and partition. By default the contribution level is default_contributions_level=table. If a different level is required for a column, it can by given via the argument fine_contributions_level (see CLI usage examples below).

1. table level

Defines DP constraints only at the table level.

Characteristics:

  • Only table-level DP properties are specified
  • Column metadata is minimal and includes:
    • name
    • datatype
    • required
    • privacy_id
    • nullable_proportion
    • minimum / maximum (if applicable)
  • No:
    • public_keys_values properties on column
    • ColumnGroup class
    • Partition class

Use case:

  • When only global dataset-level privacy guarantees are required
  • Safest option in terms of minimizing privacy leakage risk
2. table_with_keys level

As table level but with keys on categorical columns and ColumnGroup.

Use case:

  • As table with keys being public information (like months in year, hours in day).
3. column level

Defines DP constraints at both the table_with_keys and column levels.

Requirements:

  • privacy_unit must be specified to compute contribution bounds

Characteristics:

  • Includes all table-level information
  • Adds per-column DP properties (maximum contribution when grouping by the column):
    • max_length
    • max_groups_per_unit
    • max_contributions
  • For categorical columns:
    • Extracts public_keys_values (set of possible values)
  • Introduces column groups (ColumnGroupMetadata):
    • Represent combinations of columns
    • Include:
      • public_keys_values (combinations of values)
      • DP parameters for grouped contributions (maximum contribution when grouping by the group of columns)

Not included:

  • No Partition objects

Use case:

  • When per-column and multi-column contribution constraints are needed
  • Balanced trade-off between utility and privacy
4. partition level

Defines DP constraints at the table_with_keys, column, and partition levels.

Characteristics:

  • Includes all column-level information
  • Introduces explicit Partition objects
  • DP parameters are defined at:
    • Table level (global bounds)
    • Partition level (fine-grained bounds)

Partition behavior:

  • Each Partition specifies:
    • A predicate (categorical value or continuous range)
    • DP parameters (maximum contribution in the partition):
      • max_length
      • max_groups_per_unit
      • max_contributions
  • These parameters represent the maximum contribution of a privacy unit within that specific partition

Continuous columns:

  • If bounds (minimum, maximum) are provided:
    • The column is divided into partitions (e.g., ranges)
    • Each partition is assigned its own DP constraints

Use case:

  • When fine-grained control over contributions is required
  • Highest expressiveness, but also highest privacy risk
Summary
Level Scope Risk Level
table Table only Lowest
table_with_keys Table with keys in categorical columns Medium
column Table + Column Medium
partition Table + Column + Partition Highest

Start with the table level and only increase granularity if required.
Always validate that all information are already public information.

CLI Usage Examples

# Basic usage
python make_metadata_from_data.py data.csv --privacy_unit user_id,

It is possible to compute dependencies (bigger, depends on, etc) between columns with

# Enable dependency detection (default: True)
python make_metadata_from_data.py data.csv \
  --privacy_unit user_id \
  --with_dependencies True

It is also possible to describe partitions level of continuous data if public bounds are provided

# Add continuous partitions
python make_metadata_from_data.py data.csv \
  --privacy_unit user_id \
  --continuous_partitions '{"age": [0, 18, 30, 50, 100]}'

It is also possible to describe group of columns information (like after grouping by a list of columns) to have their metadata

# Define column groups
python make_metadata_from_data.py data.csv \
  --privacy_unit user_id \
  --column_groups '[["age", "income"], ["city", "country"]]'
# Set default contribution level
python make_metadata_from_data.py data.csv \
  --privacy_unit user_id \
  --default_contributions_level table

# Column-specific contribution overrides
python make_metadata_from_data.py data.csv \
  --privacy_unit user_id \
  --fine_contributions_level '{"age": "column", "income": "partition"}'

Save output to specific file

python make_metadata_from_data.py data.csv \
  --privacy_unit user_id \
  --output my_metadata.json

Notes

  • Datetime columns are automatically inferred using pandas.to_datetime.
  • Numeric bounds are computed only for non-string columns.
  • Contribution levels control per-privacy-unit contribution constraints.
  • Dependency detection may increase runtime on large datasets.
  • Output is a JSON-serializable CSVW-SAFE metadata structure.

Future plans:

  • Allow a DP vs non-DP mode (with/without) DP attributes
  • Allow finer contribution level descrition (for now column level is very broad)

2. make_dummy_from_metadata.py

Purpose

Generate a synthetic dummy dataset from CSVW-SAFE metadata.

The generator creates structured data that follows the declared metadata constraints, including:

  • Column datatypes
  • Numeric and categorical partitions
  • Optional dependency structure between columns
  • Nullable proportions
  • Column-group constraints (when provided)

Important: This tool produces synthetic structural data only.
It does not preserve semantic meaning or real-world correlations beyond what is encoded in metadata.

Output Guarantees

The generated dataset:

  • Respects declared column schema (datatypes)
  • Respects partition definitions (categorical + continuous)
  • Respects numeric bounds when defined
  • Applies nullable proportions per column
  • Optionally respects column-group partition constraints
  • Produces reproducible results via random seed

Typical Use Cases

  • Unit testing of CSVW-SAFE and DP pipelines
  • Schema validation without real data access
  • Debugging metadata-driven transformations
  • Synthetic data generation for integration tests

CLI Usage Examples

Basic example with 100 rows:

# Basic
python make_dummy_from_metadata.py metadata.json --output dummy.csv

Set a seed (seed=42) and a number of rows (rows=1000) for a reproducible example:

python make_dummy_from_metadata.py metadata.json \
  --rows 1000 \
  --seed 42 \
  --output dummy.csv

3. validate_metadata.py

Purpose

Validate a CSVW-SAFE metadata file against the internal metadata schema.

This tool ensures that a metadata file is structurally correct and conforms to the expected CSVW-SAFE specification as defined by the internal TableMetadata model.

It is primarily used as a validation step before using metadata for:

  • dummy dataset generation
  • DP pipeline configuration
  • downstream schema-driven processing

This validator performs schema-level validation only, including:

  • Required fields presence
  • Type correctness
  • Structural consistency of metadata objects
  • Compatibility with the TableMetadata model

Validation is implemented via a Pydantic model (TableMetadata.from_dict). See TableMetadata.md for more detailed explanation of the underlying pydantic model used to validate the metadata.

Output behaviour:

  • If metadata is valid → script exits silently (no output)
  • If metadata is invalid → raises a validation exception and exits with error

CLI Usage

python validate_metadata.py metadata.json

4. validate_metadata_shacl.py

Purpose

Validate CSVW-SAFE metadata using a SHACL constraint schema.

This tool performs structural validation of metadata expressed in JSON-LD format against a SHACL shapes graph defined in Turtle format.

It is the most strict validation layer in the CSVW-SAFE toolchain, intended to ensure full compliance with RDF-based constraints.

Validation Scope

This validator checks:

  • RDF structural consistency of metadata (JSON-LD parsing)
  • Constraint satisfaction against SHACL shapes
  • Class/property-level restrictions defined in the schema
  • Cross-field structural rules defined in the SHACL graph

Unlike validate_metadata.py, this tool performs formal SHACL validation, not just schema validation.

Python usage

python validate_metadata_shacl.py metadata.jsonld shapes.ttl

Validation output On success: SHACL validation SUCCESSFUL On failure: SHACL validation FAILED with a

Typical Use Cases

  • Formal compliance validation of CSVW-SAFE metadata
  • CI/CD enforcement of metadata correctness
  • Pre-deployment validation in RDF-based pipelines
  • Ensuring compatibility with external SHACL-aware systems

Notes

  • Metadata must be valid JSON-LD RDF
  • SHACL shapes must be valid Turtle RDF
  • This is the strictest validation layer
  • More expressive than Pydantic-based validation (validate_metadata.py)

5. assert_same_structure.py

Purpose

Verify that a generated dummy CSV preserves the structural properties of an original dataset under the CSVW-SAFE assumptions.

This tool ensures that synthetic data produced by make_dummy_from_metadata.py remains schema-compatible with the original dataset used to derive metadata.

This validator checks structure only. It does not assess statistical similarity or data realism.

The tool checks:

  • Column names and ordering
  • Inferred CSVW-SAFE datatypes
  • Nullability constraints (required vs optional columns)
  • Optional categorical domain compatibility (subset check)

It does not check:

  • Statistical similarity between datasets
  • Distributional properties
  • Correlation structure
  • Semantic correctness of values

Core Validation Logic

Ensures that both datasets share identical schema:

  • Same column names
  • Same column ordering

Each column is type-checked using CSVW-SAFE inference:

  • Datatypes are inferred via infer_xmlschema_datatype
  • Integer subtype differences are tolerated (e.g., small vs large integer variants)

Validates whether required/optional status is preserved:

  • A column is considered required if it has no missing values
  • Both datasets must agree on required vs optional status per column

If enabled, ensures:

  • All values in dummy dataset are a subset of original dataset values
  • Uses is_categorical() to detect categorical columns

CLI Usage

python assert_same_structure.py original.csv dummy.csv

Skip categorical validation

python validate_dummy_structure.py original.csv dummy.csv --no-categories

Typical Use Cases

  • Validate synthetic dataset generation correctness
  • Regression testing for metadata-driven pipelines
  • Ensuring structural integrity in DP synthetic data workflows
  • Debugging mismatches between metadata and generated datasets Notes
  • This tool is intentionally strict on schema alignment but lenient on integer type variations
  • Designed to validate synthetic structural fidelity, not realism
  • Works best in combination with: make_metadata_from_data.py and make_dummy_from_metadata.py

6. csvw_to_smartnoise_sql.py

Purpose

Convert CSVW-SAFE metadata into the format expected by SmartNoise SQL.

This script transforms a CSVW-SAFE JSON metadata file into a SmartNoise-compatible YAML configuration, enabling direct use in differential privacy queries with SmartNoise SQL.

The script maps CSVW-SAFE metadata into SmartNoise SQL structure:

  • Table-level privacy constraints:
    • max_contributionsmax_ids
  • Column definitions:
    • Datatypes (converted to SmartNoise types)
    • Nullability
    • Value bounds (minimum / maximumlower / upper)
    • Privacy identifier (privacy_idprivate_id)
  • Optional DP configuration parameters:
    • sampling, clamping, censoring, DPSU

Output Structure

The generated YAML follows SmartNoise SQL format:

"": 
  schema_name:
    table_name:
      max_ids: ...
      rows: ...
      sample_max_ids: ...
      censor_dims: ...
      clamp_counts: ...
      clamp_columns: ...
      use_dpsu: ...
      column_name:
        name: ...
        type: ...
        nullable: ...
        lower: ...
        upper: ...
        private_id: ...

CLI Usage

Basic conversion

python csvw_to_smartnoise_sql.py \
  --input metadata.json \
  --output snsql_metadata.yaml

With custom schema and table

python csvw_to_smartnoise_sql.py \
  --input metadata.json \
  --output snsql_metadata.yaml \
  --schema MySchema \
  --table MyTable

With DP configuration options

python csvw_to_smartnoise_sql.py \
  --input metadata.json \
  --output snsql_metadata.yaml \
  --sample_max_ids True \
  --censor_dims True \
  --clamp_columns True

7. csvw_to_opendp_context.py

Purpose

Create an OpenDP Context from CSVW-SAFE metadata and a dataset.

This script bridges CSVW-SAFE metadata with the OpenDP library by:

  • Converting metadata into OpenDP margins
  • Defining privacy units and privacy loss
  • Building a ready-to-use OpenDP Context for DP queries

The resulting OpenDP Context includes:

  • Privacy unit (based on max_contributions)
  • Privacy loss:
    • ε-DP (Laplace)
    • ρ-DP / zCDP (Gaussian)
  • Margins derived from CSVW metadata
  • Dataset (as a Polars LazyFrame)

Supported Privacy Models

Model Parameter
Laplace DP epsilon
Gaussian / zCDP rho
Approximate DP delta

You must provide either epsilon OR rho, not both.

CLI Usage

Basic conversion

import polars as pl
from csvw_safe.csvw_to_opendp_context import csvw_to_opendp_context

data = pl.scan_csv("data.csv")

context = csvw_to_opendp_context(
    csvw_meta=metadata,
    data=data,
    epsilon=1.0,
)

Typical Workflow

Via CLI

  1. Generate baseline metadata from the original dataset:
python make_metadata_from_data.py data.csv --id user_id --mode fine
  1. Review manually with a data expert and approve metadata for safety and governance compliance. Optionnaly after removing private information, run (to validate metadata format)
python scripts/validate_metadata_shacl.py metadata.json csvw-safe-constraints.ttl

and with shacl constraints:

python validate_metadata.py metadata.json --shacl csvw-safe-constraints.ttl
  1. Generate a dummy dataset from the approved metadata:
python make_dummy_from_metadata.py metadata.json --rows 1000 --output dummy.csv
  1. Verify that the dummy matches the original structure:
python assert_same_structure.py data.csv dummy.csv

Python API Workflow

import pandas as pd
from csvw_safe.make_metadata_from_data import make_metadata_from_data

df = pd.read_csv("data.csv")

# Generate metadata
metadata = make_metadata_from_data(df, csv_url="data.csv", individual_col="user_id")

MANUAL REVIEW OF METADATA. VERIFY ONLY PUBLIC INFORMATION. REMOVE OTHERWISE.

from csvw_safe.validate_metadata import validate_metadata
from csvw_safe.validate_metadata_shacl import validate_metadata_shacl
from csvw_safe.make_dummy_from_metadata import make_dummy_from_metadata
from csvw_safe.assert_same_structure import assert_same_structure

# Validate metadata
errors = validate_metadata(metadata)
errors = validate_metadata_shacl(metadata)

# Generate dummy dataset
dummy_df = make_dummy_from_metadata(metadata, nb_rows=500)

# Assert structure
assert_same_structure(df, dummy_df)

Directory Structure

examples/
└─ Notebooks.ipynb                      # Example notebooks demonstrating CSVW-SAFE workflows

src/csvw_safe/
    ├─ __init__.py                          # Package initializer for CSVW-SAFE library

    ├─ make_metadata_from_data.py          # Generate CSVW-SAFE metadata automatically from a dataset
    ├─ make_dummy_from_metadata.py         # Generate synthetic dummy datasets from CSVW-SAFE metadata
    ├─ validate_metadata.py                # Validate metadata using internal schema (TableMetadata model)
    ├─ validate_metadata_shacl.py          # Validate metadata using SHACL constraints via RDF graphs
    ├─ assert_same_structure.py            # Compare original and dummy CSVs for structural consistency

    ├─ csvw_to_opendp_context.py           # Convert CSVW-SAFE metadata into OpenDP analysis context
    ├─ csvw_to_opendp_margins.py           # Translate CSVW-SAFE metadata into OpenDP margin definitions
    ├─ csvw_to_smartnoise_sql.py           # Convert CSVW-SAFE metadata into SmartNoise SQL format

    ├─ generate_series.py                  # Generate synthetic column values based on metadata rules
    ├─ metadata_structure.py               # Core data models defining CSVW-SAFE metadata schema
    ├─ constants.py                        # Shared constants used across metadata pipeline
    ├─ datatypes.py                        # Datatype inference and CSVW-SAFE type utilities
    └─ utils.py                            # General helper utilities for metadata processing
tests/                                  # Unit and integration tests for CSVW-SAFE library

pyproject.toml                         # Project configuration and dependencies
README.md                              # Project overview and documentation entry point
run_linter.sh                          # Script to run linting and style checks

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csvw_safe-0.0.1.tar.gz (55.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

csvw_safe-0.0.1-py3-none-any.whl (40.7 kB view details)

Uploaded Python 3

File details

Details for the file csvw_safe-0.0.1.tar.gz.

File metadata

  • Download URL: csvw_safe-0.0.1.tar.gz
  • Upload date:
  • Size: 55.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for csvw_safe-0.0.1.tar.gz
Algorithm Hash digest
SHA256 55f6e7f3d1a5deeb71152c301ee3b3fe675cdb8fbddf177c948f0493eebbc2fb
MD5 02b554c6a44a31d16f1a69462083c2dd
BLAKE2b-256 85ef46bd0e64adaebb4b341f58761d923c01aebb9abc28705b75ea84ef0e1844

See more details on using hashes here.

File details

Details for the file csvw_safe-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: csvw_safe-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 40.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for csvw_safe-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 88e1488ff73979ebba6b60d138ab1bcbe72f34d77d80bbfd71b5472b33a92312
MD5 11fb0676c8de6afaad5d5e6a1a501639
BLAKE2b-256 179896eef65302d12f5762c8d580aa873c43c758143d8334ccef554ddb273d33

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page