Skip to main content

A library for encryption/decryption and analysis of sensitive data.

Project description

Introduction

py4phi is a simple solution for the complex problem of dealing with sensitive data.

In the modern IT world, sharing a dataset with sensitive data is common, especially if a team working on it is wide. It can be used for various purposes, including building a ML/DL model, simple business analysis, etc. Of course, in most companies, different restrictions are applied on the data, including row-level security, column hashing, or encrypting, but this requires at least some knowledge of data engineering libraries and can be a challenging and time-consuming task. At the same time, employees with access to sensitive parts of the data may not have such expertise, which is where py4phi can be helpful.

Coverage Status PyPI - Downloads image image image image image

Functionality

py4phi offers the following functionality to solve the problem mentioned above and more: Encrypt a dataset column-wise. Decrypt a dataset column-wise. Encrypt any folder or machine learning model Decrypt any folder or machine learning model Perform principal component analysis on a dataset Perform correlation analysis for feature selection on a dataset You can use py4phi both in Python code and through your terminal via the convenient CLI interface.

Setup and prerequisites

In order to install the library from PyPi, just run

pip install py4phi

py4phi is compatible with the following engines for data processing and encryption:

Default engine for CLI is Pandas, whereas for the library - PySpark.

NOTE: You can avoid steps below and still be able to use pandas or polars engines if that suits your needs.

Pyspark installation

You'll need a JDK installed and a JAVA_HOME environment variable set in order to work with the PySpark engine on UNIX-like systems.

Windows pyspark installation

Apart from the JDK and JAVA_HOME variable, in order to work with pyspark engine, you will have to:

  1. Download and extract Spark 3.5.0+.
  2. Download winutils.exe and hadoop.dll.
  3. Add these files to /hadoop/bin folder.
  4. Set HADOOP_HOME environment variable pointing to $absolute_path/hadoop/bin.
  5. You may need to also set SPARK_HOME and PYSPARK_PYTHON environment variables.

For more detailed info, please go through this guide.

API help

py4phi documentation is still TBD, however API is pretty simple and most parameters are covered in the Usage section below and in the /examples folder.

Library API

py4phi.core currently exposes 4 main functions.

  • from_path - allows you to initialize py4phi Controller(read below) from a file.
  • from_dataframe - initialize with a dataframe (Pandas/Polars/PySpark).
  • encrypt_model - encrypt any folder or ML model INPLACE
  • decrypt_model - decrypt any folder or ML model.

The Controller itself has a lot of functionality, including encryption/decryption of a dataframes, feature selection and principal component analysis.

Terminal API

As previously mentioned, you can perform all the same actions from your terminal. to get list of available commands, enter

py4phi --help

to get options/parameters of a specific command (e.g. encrypt), enter

py4phi encrypt --help

Usage

You can integrate py4phi in your existing pandas/pyspark/polars data pipeline by initializing from a DataFrame or loading from a file. Currently, CSV and Parquet file types are supported.

Encryption and decryption of the datasets are facilitated by the use of configs. Each column gets its own encryption key and a nonce, which are saved in the configs. These resulting files can be further encrypted for even more safety.

Therefore, you can encrypt only sensitive columns, send the outputs, for example, to the data analysis team, and keep the data safe. Later, data can be decrypted using configs on-demand. Moreover, you do not need deep knowledge of the underlying engines (pandas, etc.) and don't need to write long scripts to encrypt data and save the keys.

The following example showcases the encryption process of a dataset.csv file.
(you can find it in the /examples folder)
The output dataset, along with the decryption configs, is then saved to the "test_folder" directory under CWD,

from py4phi.core import from_path, PYSPARK

controller = from_path(
    './dataset.csv',
    'csv',
    engine=PYSPARK,
    log_level='DEBUG',
    header=True  # pyspark read option
)
controller.print_current_df()
controller.encrypt(columns_to_encrypt=['Staff involved', 'ACF'])
controller.print_current_df()
controller.save_encrypted(
    output_name='my_encrypted_file',
    save_location='./test_folder/',   # results will be saved under CWD/test_folder/py4phi_encrypted_outputs
    save_format='PARQUET',
)

To decrypt these outputs, you can use:

import pandas as pd
from py4phi.core import from_dataframe

df = pd.read_parquet('./test_folder/py4phi_encrypted_outputs/my_encrypted_file.parquet')
controller = from_dataframe(
    df,
    log_level='DEBUG'
)
controller.print_current_df()
controller.decrypt(
    columns_to_decrypt=['Staff involved', 'ACF'],
    configs_path='./test_folder/py4phi_encrypted_outputs', 
)
controller.print_current_df()
controller.save_decrypted(
    output_name='my_decrypted_file',
    save_location='./test_folder',
    save_format='csv'
)

This example also shows how to initialize py4phi from a (pandas, in this case) DataFrame.

Similar workflow through a terminal can be executed with the following CLI commands:

py4phi encrypt-and-save -i ./dataset.csv -c ACF -c 'Staff involved' -e pyspark -p -o ./ -r header True
py4phi decrypt-and-save -i ./py4phi_encrypted_outputs/output_dataset.csv -e pyspark -c ACF -c 'Staff involved' -p -o ./ -r header True

To encrypt and decrypt a folder or a ML/DL model, you can use:

from py4phi.core import encrypt_model, decrypt_model
encrypt_model(
    './test_folder',
    encrypt_config=False #or True
)

decrypt_model(
    './test_folder',
    config_encrypted=False # or True
)

After encryption, all files whithin the specified folder will be not readable. This can be used for easy one-line model encryption.

The same actions can be taken in a terminal:

# encrypt model/folder, do not encrypt config. Note that encryption is done inplace. Please save original before encryption.
py4phi encrypt-model -p ./py4phi_encrypted_outputs/ -d

# decrypt model/folder when config is not encrypted
py4phi decrypt-model -p ./py4phi_encrypted_outputs/ -c

Analytics usage

Apart from the main encrypt/decrypt functionality, one may be interested in reducing the dimensionality of a dataset or performing correlation analysis of the feature (feature selection). In a typical scenario, this requires a lot of effort from the data analyst. Instead, a person with access to the sensitive data can perform a lightweight PCA/feature selection in a couple of code lines or terminal commands.

NOTE: This functionality is a quick top-level analysis, diving deeper into a dataset's feature analysis will always bring more profit.

To perform principal component analysis with Python, use:

from py4phi.core import from_path, PYSPARK
controller = from_path('Titanic.parquet', file_type='parquet', engine=PYSPARK)
controller.perform_pca(
    target_feature='Survived',
    ignore_columns=['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'],
    save_reduced=False
)

Via terminal:

py4phi perform-pca -i ./dataset.csv  --target 'Staff involved' -c ACF

NOTE In order to suggest feature - candidates for dropping, correlation analysis is leveraged. As for categorical features, Cramers V measure is used to calculate correlation. It is heavily impacted by dataset's size, so please consider ignoring string columns for feature selection unless your data is bigger than at least 100-200 rows. In general, keep in mind that this analysis is kind of top-level.

To perform feature selection with Python, use:

from py4phi.core import from_path, POLARS
controller = from_path('Titanic.parquet', file_type='parquet', engine=POLARS)
controller.perform_feature_selection(
    target_feature='Survived',
    target_correlation_threshold=0.2,
    features_correlation_threshold=0.2,
    drop_recommended=False
)

Via terminal:

py4phi feature-selection -i ./Titanic.parquet --target Survived --target_corr_threshold 0.3 --feature_corr_threshold 0.55

Please look into the /examples folder for more examples. It also contains respective demo datasets.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py4phi-1.0.0.tar.gz (40.5 kB view details)

Uploaded Source

Built Distribution

py4phi-1.0.0-py3-none-any.whl (48.3 kB view details)

Uploaded Python 3

File details

Details for the file py4phi-1.0.0.tar.gz.

File metadata

  • Download URL: py4phi-1.0.0.tar.gz
  • Upload date:
  • Size: 40.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for py4phi-1.0.0.tar.gz
Algorithm Hash digest
SHA256 cfea6dce1fd272ef33b24d298ddb0408533041ff84a06ee1ddb515b90bb1c434
MD5 8c77e71b4c71342830f1cfc3897c0a2e
BLAKE2b-256 7f0b0962542d65ad59aac92933c043aabb43e4b666809586970be88a3364b84b

See more details on using hashes here.

File details

Details for the file py4phi-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: py4phi-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 48.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for py4phi-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 45be176d048fc94323056d89b7f796c8c0b5e8d46b4d1a9462eb7936b1494907
MD5 5bb7f51d394e3dbfe6f4fc1d285f4e95
BLAKE2b-256 e724cff1a03610e69bb9b69dc86c574adae280b97b756f119096cd2ae52ea43d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page