Skip to main content

PySpark implementation of the Open Privacy Preserving Record Linkage protocol.

Project description

Spindle Token

The open source implementation of the Open Privacy Preserving Record Linkage (OPPRL) protocol build on Spark.

Rationale

Privacy Preserving Record Linkage (PPRL) is crucial component to data de-identification systems. PPRL obfuscates identifying attributes or other sensitive information about the subjects described in the records of a dataset while still preserving the ability to link records pertaining to the same subject through the use of an encrypted token. This practice is sometimes referred to as "tokenization" and is one of the components of data de-identification.

The task of PPRL is to replace the attributes of a every record denoting Personally Identifiable Information (PII) with a token produced by a one-way cryptographic function. This prevents observers of the tokenized data from obtaining the PII. The tokens are produced deterministically such that input records with the same, or similar, PII attributes will produce an identical token. This allows practitioners to associate records across datasets that are highly likely to belong to the same data subject without having access to PII.

Tokenization is also used when data is shared between organizations to limit, or in some cases fully mitigate, the risk of subject re-identification in the event that an untrusted third party gains access to a dataset containing sensitive data. Each party produces encrypted tokens using a different secret key so that any compromised data asset is, at worst, only matchable to other datasets maintained by the same party. During data sharing transactions, a specific "transcode" data flow is used to re-encrypt the sender's tokens into ephemeral tokens that do not match tokens in any other dataset and can only be ingested using the recipients secret key. At no point in the "transcode" data flow is the original PII used.

The spindle-token is the canonical implementation of the Open Privacy Preserving Record Linkage (OPPRL) protocol. This protocol presents a standardized methodology for tokenization that can be implemented in any data system to increase interoperability. The spindle-token implementation is a python library that distributes tokenization workloads using apache Spark across multiple cores or multiple machines in a high performance computing cluster for efficient tokenization of any scale datasets.

The pre-v1.0 versions of this library were published under the name "carduus" and the deprecated APIs can be found here.

Getting Started

See the getting started guide on the project's web page for an detailed explanation of how carduus is used including example code snippets.

The full API and an example usage on Databricks are also provided on the project's web page.

Security Audit

This project has received a security audit from Echelon Risk + Cyber who provided the following statement. More details on this security audit can be obtained from Echelon Risk + Cyber at this link.

Echelon Risk + Cyber certifies that as of May 30, 2025, The Spindle Token implementation and Open Privacy Preserving Record Linkage (OPPRL) Protocol exhibit a high degree of alignment with secure cryptographic standards and secure development practices. The use of FIPS -compliant algorithms (AES-GCM-SIV, RSA-OAEP, SHA2 family), layered encryption, and privacy preserving design patterns indicate strong foundational security. Note: This certification is issued in good faith, based on the materials available to the Echelon team at the time of the review.

Contributing

Please refer to the spindle-token contributing guide for information on how to get started contributing to the project.

Organizations that have contributed to spindle-token

Spindle Health
Echelon Risk + Cyber

Individuals that have contributed to spindle-token

Brian Fallik - @bfallik

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spindle_token-1.0.0.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spindle_token-1.0.0-py3-none-any.whl (24.8 kB view details)

Uploaded Python 3

File details

Details for the file spindle_token-1.0.0.tar.gz.

File metadata

  • Download URL: spindle_token-1.0.0.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.10.1 Darwin/24.6.0

File hashes

Hashes for spindle_token-1.0.0.tar.gz
Algorithm Hash digest
SHA256 672660e3aab1cf97a3753a6641c1b0d3f331a9c638f337fa578898e653c6ac2a
MD5 e6059a9e0971b49eb62a18696fea9db8
BLAKE2b-256 1878f158a54ea90e9ab9b2c173cdf546db80ae48d6ed2d57f3bf6917325241bf

See more details on using hashes here.

File details

Details for the file spindle_token-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: spindle_token-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 24.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.10.1 Darwin/24.6.0

File hashes

Hashes for spindle_token-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7efcf36b080ed903d205c7e0b1dd21143bffa11073daa9a58c82377eb5049f79
MD5 3b3f626c858a9aab610e1e108bf2fdc1
BLAKE2b-256 0cbba481862e3c41f22e75b949ce687f8b2b04b1e0a6cb93c55fcd3a62583d85

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page