Skip to main content

PySpark implementation of the Open Privacy Preserving Record Linkage protocol.

Project description

Carduus

PyPI version

The open source implementation of the Open Privacy Preserving Record Linkage (OPPRL) protocol build on Spark.

:warning: Carduus has not reached a v1.0 release yet and therefore the API and behaviors are subject to change. See the contributing guide if you would like to help the project.

Rationale

Privacy Preserving Record Linkage (PPRL) is crucial component to data de-identification systems. PPRL obfuscate identifying attributes or other sensitive information about the subjects described in the records of a dataset while still preserving the ability to link records pertaining to the same subject through the use of an encrypted token. This practice is sometimes referred to as "tokenization" and is one of the components of data de-identification.

The task of PPRL is to replace the attributes of a every record denoting Personally Identifiable Information (PII) with a token produced by a one-way cryptographic function. This prevents observers of the tokenized data from obtaining the PII. The tokens are produced deterministically such that input records with the same, or similar, PII attributes will produce an identical token. This allows practitioners to associate records across datasets that are highly likely to belong to the same data subject without having access to PII.

Tokenization is also used when data is shared between organizations to limit, or in some cases fully mitigate, the risk of subject re-identification in the event that an untrusted third party gains access to a dataset containing sensitive data. Each party produced encrypted tokens using a different secret key so that any compromised data asset is, at worst, only matchable to other datasets maintained by the same party. During data sharing transactions, a specific "transcryption" data flow is used to first re-encrypt the sender's tokens into ephemeral tokens that do not match tokens in any other dataset and can only be ingested using the recipients secret key. At no point in the "transcryption" data flow is the original PII used.

Carduus is the first (and canonical) implementation of the Open Privacy Preserving Record Linkage (OPPRL) protocol. This protocol presents a standardized methodology for tokenization that can be implemented in any data system to increase interoperability. The carduus implementation is a python library that distributes the tokenization workload using apache Spark across multiple cores or multiple machines in a high performance computing cluster for efficient tokenization of any scale datasets.

Why the name "Carduus"? The carduus is a genus of thistle plants that was used to brush fibrous materials so that individual fibres align in preparation for spinning the material into thread or yarn. Today this process is known as "carding" and is done by specialized machines.

Getting Started

See the getting started guide on the project's web page for an detailed explanation of how carduus is used including example code snippets.

The full API and an example usage on Databricks are also provided on the project's web page.

Contributing

Please refer to the carduus contributing guide for information on how to get started contributing to the project.

Organizations that have contributed to Carduus

Spindle Health

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

carduus-0.4.0.tar.gz (15.8 kB view details)

Uploaded Source

Built Distribution

carduus-0.4.0-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file carduus-0.4.0.tar.gz.

File metadata

  • Download URL: carduus-0.4.0.tar.gz
  • Upload date:
  • Size: 15.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.10.1 Darwin/23.6.0

File hashes

Hashes for carduus-0.4.0.tar.gz
Algorithm Hash digest
SHA256 2f4ae33df8a2f583ae984455cf63250458022c9d8b41b2cab67e2859b6ca8665
MD5 992b8fdef7c84fb378124365ca40b2f5
BLAKE2b-256 11bd799f43c6a486f807c1d1ff0b0531e044f1542c225e80e643fd1fa6808c6b

See more details on using hashes here.

File details

Details for the file carduus-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: carduus-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.10.1 Darwin/23.6.0

File hashes

Hashes for carduus-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6f0e72f0c6ab561565ccf114756c6e66da0bbb28303843d48d65080ad7cfabc0
MD5 fb5e5d2e98c26437faf19c36d73e46f9
BLAKE2b-256 24430211999a9b0ae5190517ab761e84d37d8abacd57fc136cb9b6ceb9c3f3a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page