Skip to main content

PySpark implementation of the Open Privacy Preserving Record Linkage protocol.

Project description

Spindle Token

PyPI version

The open source implementation of the Open Privacy Preserving Record Linkage (OPPRL) protocol build on Spark.

Rationale

Privacy Preserving Record Linkage (PPRL) is crucial component to data de-identification systems. PPRL obfuscates identifying attributes or other sensitive information about the subjects described in the records of a dataset while still preserving the ability to link records pertaining to the same subject through the use of an encrypted token. This practice is sometimes referred to as "tokenization" and is one of the components of data de-identification.

The task of PPRL is to replace the attributes of a every record denoting Personally Identifiable Information (PII) with a token produced by a one-way cryptographic function. This prevents observers of the tokenized data from obtaining the PII. The tokens are produced deterministically such that input records with the same, or similar, PII attributes will produce an identical token. This allows practitioners to associate records across datasets that are highly likely to belong to the same data subject without having access to PII.

Tokenization is also used when data is shared between organizations to limit, or in some cases fully mitigate, the risk of subject re-identification in the event that an untrusted third party gains access to a dataset containing sensitive data. Each party produces encrypted tokens using a different secret key so that any compromised data asset is, at worst, only matchable to other datasets maintained by the same party. During data sharing transactions, a specific "transcrypt" data flow is used to re-encrypt the sender's tokens into ephemeral tokens that do not match tokens in any other dataset and can only be ingested using the recipients secret key. At no point in the "transcrypt" data flow is the original PII used.

The spindle-token is the canonical implementation of the Open Privacy Preserving Record Linkage (OPPRL) protocol. This protocol presents a standardized methodology for tokenization that can be implemented in any data system to increase interoperability. The spindle-token implementation is a python library that distributes tokenization workloads using apache Spark across multiple cores or multiple machines in a high performance computing cluster for efficient tokenization of any scale datasets.

For Spark-backed tokenization and transcrypt workflows, install the optional Spark extra:

pip install "spindle-token[spark]"

Use transcrypt_out() and transcrypt_in() for data-sharing workflows. The older transcode_out() and transcode_in() names remain available as deprecated compatibility aliases.

The base spindle-token package remains importable without Spark so non-Spark environments and serverless dependency checks do not need to pull in PySpark.

Applications that only need OPPRL token metadata can inspect supported V2 token names without Spark:

from spindle_token.opprl.metadata import get_opprl_v2_tokens

token_columns = {token.token_id: token.name for token in get_opprl_v2_tokens()}

The pre-v1.0 versions of this library were published under the name "carduus" and the deprecated APIs can be found here.

Getting Started

See the getting started guide on the project's web page for an detailed explanation of how spindle-token is used including example code snippets.

The full API and an example usage on Databricks are also provided on the project's web page.

Migrating from OPPRL V1 to V2

New integrations should use OpprlV2. The current examples in this repository use V2 unless they are explicitly historical.

If you already have code that imports OpprlV1, the migration path is usually a small mechanical change:

  1. Replace OpprlV1 imports with OpprlV2.
  2. Keep the same attribute mapping and token selection.
  3. Re-run your tokenization and transcrypt tests against your existing data.

OpprlV2 preserves the V1 token structure and normalization rules, but it canonicalizes the private key before deriving the AES key. That means V2 is the right choice when you want the same logical protocol behavior with stable results across equivalent PEM encodings of the same RSA key.

If you need byte-for-byte compatibility with historical V1 token outputs, keep using OpprlV1 for those flows.

Independent Security Review

Spindle engaged Echelon Risk + Cyber, an independent, leading cybersecurity firm, to conduct an audit of the Spindle Token implementation. The review evaluated and confirmed alignment with industry best practices in secure software development, cryptographic algorithm selection, and tokenization methods.

View the full report.

This report reflects Echelon Risk + Cyber’s independent professional opinion as of the date of review and does not constitute a warranty, certification, or guarantee of future performance or security. The scope of the assessment was limited to the artifacts, documentation, and workshop discussions reviewed through May 30, 2025, and does not extend to undisclosed code, subsequent modifications, or third-party dependencies.

Contributing

Please refer to the spindle-token contributing guide for information on how to get started contributing to the project.

Organizations that have contributed to spindle-token

Spindle Health
Echelon Risk + Cyber

Individuals that have contributed to spindle-token

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spindle_token-2.4.0.tar.gz (26.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spindle_token-2.4.0-py3-none-any.whl (32.2 kB view details)

Uploaded Python 3

File details

Details for the file spindle_token-2.4.0.tar.gz.

File metadata

  • Download URL: spindle_token-2.4.0.tar.gz
  • Upload date:
  • Size: 26.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.4.1 CPython/3.14.5 Darwin/25.5.0

File hashes

Hashes for spindle_token-2.4.0.tar.gz
Algorithm Hash digest
SHA256 676e5f1e704a79486db8df84eba4e73caab0b41bc382957433ff5177f3a0a8ac
MD5 1930472434ad5b83e1f8fe05cdc7feb0
BLAKE2b-256 3dcb764756c2dacd9f823a5e6ca208f33922135ae2a60a79f0fb72c3f2eae37b

See more details on using hashes here.

File details

Details for the file spindle_token-2.4.0-py3-none-any.whl.

File metadata

  • Download URL: spindle_token-2.4.0-py3-none-any.whl
  • Upload date:
  • Size: 32.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.4.1 CPython/3.14.5 Darwin/25.5.0

File hashes

Hashes for spindle_token-2.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3f35d1f981145c70b156b66899086b3fb8a129ea731e7104291783af1b887293
MD5 8d8101f36b1d04c1bd0eb08088ff0c33
BLAKE2b-256 cc6d5a321c0f47959a916f91919711b990009947bd58b445d1c0f2cc46c83b1d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page