Cape manages secure access to all of your data.
Project description
Cape Privacy offers data scientists and data engineers a policy-based interface for applying privacy-enhancing techniques across several popular libraries and frameworks to protect sensitive data throughout the data science life cycle.
Cape Python brings Cape's policy language to Pandas and Apache Spark, enabling you to collaborate on privacy-preserving policy at a non-technical level. The supported techniques include tokenization with linkability as well as perturbation and rounding. You can experiment with these techniques programmatically, in Python or in human-readable policy files. Stay tuned for more privacy-enhancing techniques in the future!
See below for instructions on how to get started or visit the documentation.
Getting Started
Cape Python is available via Pypi.
pip install cape-privacy
Support for Apache Spark is optional. If you plan on using the library together with Apache Spark, we suggest the following instead:
pip install cape-privacy[spark]
We recommend running it in a virtual environment, such as venv.
Installing from source
It is also possible to install the library from source.
git clone https://github.com/capeprivacy/cape-python.git
cd cape-python
make bootstrap
This will also install all dependencies, including Apache Spark. Make sure you have make
installed before running the above.
Example
(this example is an abridged version of the tutorial found here)
To discover what different transformations do and how you might use them, it is best to explore via the transformations
APIs:
df = pd.DataFrame({
"name": ["alice", "bob"],
"age": [34, 55],
"birthdate": [pd.Timestamp(1985, 2, 23), pd.Timestamp(1963, 5, 10)],
})
tokenize = Tokenizer(max_token_len=10, key=b"my secret")
perturb_numeric = NumericPerturbation(dtype=dtypes.Integer, min=-10, max=10)
df["name"] = tokenize(df["name"])
df["age"] = perturb_numeric(df["age"])
print(df.head())
# >>
# name age birthdate
# 0 f42c2f1964 34 1985-02-23
# 1 2e586494b2 63 1963-05-10
These steps can be saved in policy files so you can share them and collaborate with your team:
# my-policy.yaml
label: my-policy
version: 1
rules:
- match:
name: age
actions:
- transform:
type: numeric-perturbation
dtype: Integer
min: -10
max: 10
seed: 4984
- match:
name: name
actions:
- transform:
type: tokenizer
max_token_len: 10
key: my secret
You can then load this policy and apply it to your data frame:
# df can be a Pandas or Spark data frame
policy = cape.parse_policy("my-policy.yaml")
df = cape.apply_policy(policy, df)
print(df.head())
# >>
# name age birthdate
# 0 f42c2f1964 34 1985-02-23
# 1 2e586494b2 63 1963-05-10
You can see more examples and usage here or by visiting our documentation.
Contributing and Bug Reports
Please file any feature request or bug report as GitHub issues.
License
Licensed under Apache License, Version 2.0 (see LICENSE or http://www.apache.org/licenses/LICENSE-2.0). Copyright as specified in NOTICE.
About Cape
Cape Privacy helps teams share data and make decisions for safer and more powerful data science. Learn more at capeprivacy.com.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for cape_privacy-0.1.1rc0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5ec55d0c5acc3326ecd739048456ab0e05ca584da57bb549a7db468a19cfed6a |
|
MD5 | 9d0c2b8384fa4923db96897afa77d89d |
|
BLAKE2b-256 | 3e1e395238918b54259ef3967b39b84c7b279979ef419d764b685be164d049b0 |