Skip to main content

Tools for the microdata.no platform

Project description

microdata-tools

Tools for the microdata.no platform

Installation

microdata-tools can be installed from PyPI using pip:

pip install microdata-tools

Usage

Once you have your metadata and data files ready to go, they should be named and stored like this:

my-input-directory/
    MY_DATASET_NAME/
        MY_DATASET_NAME.csv
        MY_DATASET_NAME.json

The CSV file is optional in some cases.

Package dataset

The package_dataset() function will encrypt and package your dataset as a tar archive. The process is as follows:

  1. Generate the symmetric key for a dataset.
  2. Encrypt the dataset data (CSV) using the symmetric key and store the encrypted file as <DATASET_NAME>.csv.encr
  3. Encrypt the symmetric key using the asymmetric RSA public key microdata_public_key.pem and store the encrypted file as <DATASET_NAME>.symkey.encr
  4. Gather the encrypted CSV, encrypted symmetric key and metadata (JSON) file in one tar file.

Unpackage dataset

The unpackage_dataset() function will untar and decrypt your dataset using the microdata_private_key.pem RSA private key.

The packaged file has to have the <DATASET_NAME>.tar extension. Its contents should be as follows:

<DATASET_NAME>.json : Required medata file.

<DATASET_NAME>.csv.encr : Optional encrypted dataset file.

<DATASET_NAME>.symkey.encr : Optional encrypted file containing the symmetrical key used to decrypt the dataset file. Required if the .csv.encr file is present.

Decryption uses the RSA private key located at RSA_KEY_DIR.

The packaged file is then stored in output_dir/archive/unpackaged after a successful run or output_dir/archive/failed after an unsuccessful run.

Example

Python script that uses a RSA public key named microdata_public_key.pem and packages a dataset:

from pathlib import Path
from microdata_tools import package_dataset

RSA_KEYS_DIRECTORY = Path("tests/resources/rsa_keys")
DATASET_DIRECTORY = Path("tests/resources/input_package/DATASET_1")
OUTPUT_DIRECTORY = Path("tests/resources/output")

package_dataset(
   rsa_keys_dir=RSA_KEYS_DIRECTORY,
   dataset_dir=DATASET_DIRECTORY,
   output_dir=OUTPUT_DIRECTORY,
)

Validation

Once you have your metadata and data files ready to go, they should be named and stored like this:

my-input-directory/
    MY_DATASET_NAME/
        MY_DATASET_NAME.csv
        MY_DATASET_NAME.json

Note that the filename only allows upper case letters A-Z, number 0-9 and underscores.

Import microdata-tools in your script and validate your files:

from microdata_tools import validate_dataset

validation_errors = validate_dataset(
    "MY_DATASET_NAME",
    input_directory="path/to/my-input-directory"
)

if not validation_errors:
    print("My dataset is valid")
else:
    print("Dataset is invalid :(")
    # You can print your errors like this:
    for error in validation_errors:
        print(error)

For a more in-depth explanation of usage visit the usage documentation.

Data format description

A dataset as defined in microdata consists of one data file, and one metadata file.

The data file is a csv file seperated by semicolons. A valid example would be:

000000000000001;123;2020-01-01;2020-12-31;
000000000000002;123;2020-01-01;2020-12-31;
000000000000003;123;2020-01-01;2020-12-31;
000000000000004;123;2020-01-01;2020-12-31;

Read more about the data format and columns in the documentation.

The metadata files should be in json format. The requirements for the metadata is best described through the Pydantic model, the examples, and the metadata model.

Contribute

Set up

To work on this repository you need to install uv:

# macOS / linux / BashOnWindows
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows powershell
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Then install the virtual environment from the root directory:

uv sync

Running unit tests

Open terminal and go to root directory of the project and run:

uv run pytest

Pre-commit

There are currently 3 active rules: Ruff-format, Ruff-lint and sync lock file. Install pre-commit

pip install pre-commit

If you've made changes to the pre-commit-config.yaml or its a new project install the hooks with:

pre-commit install

Now it should run when you do:

git commit

By default it only runs against changed files. To force the hooks to run against all files:

pre-commit run --all-files

if you dont have it installed on your system you can use: (but then it won't run when you use the git-cli)

uv run pre-commit

Read more about pre-commit

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

microdata_tools-1.12.0.tar.gz (40.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

microdata_tools-1.12.0-py3-none-any.whl (57.6 kB view details)

Uploaded Python 3

File details

Details for the file microdata_tools-1.12.0.tar.gz.

File metadata

  • Download URL: microdata_tools-1.12.0.tar.gz
  • Upload date:
  • Size: 40.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for microdata_tools-1.12.0.tar.gz
Algorithm Hash digest
SHA256 8567d60286da10d40f05b25134540110dc73a2382cd89f53c3837f157b9337b7
MD5 6061d810fd95e1c2fb6192ff0bb33b87
BLAKE2b-256 12978edadd082e8c698bece9d2c9e3a0439b1d9c08e24a5b70400c7cea7d876d

See more details on using hashes here.

Provenance

The following attestation bundles were made for microdata_tools-1.12.0.tar.gz:

Publisher: test-and-publish.yaml on statisticsnorway/microdata-tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file microdata_tools-1.12.0-py3-none-any.whl.

File metadata

File hashes

Hashes for microdata_tools-1.12.0-py3-none-any.whl
Algorithm Hash digest
SHA256 af4c09b2a68c757856eb5ed09e0317f61f788f4f71544b46e5bcfb6a83a03bb6
MD5 8c00479f14af1be6f010c197d9a0134b
BLAKE2b-256 e88d47212cdb89180f467d564b11121bc09b30c191bbde07a4e067048d3f41fa

See more details on using hashes here.

Provenance

The following attestation bundles were made for microdata_tools-1.12.0-py3-none-any.whl:

Publisher: test-and-publish.yaml on statisticsnorway/microdata-tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page