Skip to main content

Cape manages secure access to all of your data.

Project description

Cape Dataframes

License codecov PyPI version Cape Community Discord

A Python library supporting data transformations and collaborative privacy policies, for data science projects in Pandas and Apache Spark

See below for instructions on how to get started or visit the documentation.

Getting started

Prerequisites

  • Python 3.6 or above, and pip
  • Pandas 1.0+
  • PySpark 3.0+ (if using Spark)
  • Make (if installing from source)

Install with pip

Cape Dataframes is available through PyPi.

pip install cape-dataframes

Support for Apache Spark is optional. If you plan on using the library together with Apache Spark, we suggest the following instead:

pip install cape-dataframes[spark]

We recommend running it in a virtual environment, such as venv.

Install from source

It is possible to install the library from source. This installs all dependencies, including Apache Spark:

git clone https://github.com/capeprivacy/cape-dataframes.git
cd cape-dataframes
make bootstrap

Usage example

This example is an abridged version of the tutorial found here

df = pd.DataFrame({
    "name": ["alice", "bob"],
    "age": [34, 55],
    "birthdate": [pd.Timestamp(1985, 2, 23), pd.Timestamp(1963, 5, 10)],
})

tokenize = Tokenizer(max_token_len=10, key=b"my secret")
perturb_numeric = NumericPerturbation(dtype=dtypes.Integer, min=-10, max=10)

df["name"] = tokenize(df["name"])
df["age"] = perturb_numeric(df["age"])

print(df.head())
# >>
#          name  age  birthdate
# 0  f42c2f1964   34 1985-02-23
# 1  2e586494b2   63 1963-05-10

These steps can be saved in policy files so you can share them and collaborate with your team:

# my-policy.yaml
label: my-policy
version: 1
rules:
  - match:
      name: age
    actions:
      - transform:
          type: numeric-perturbation
          dtype: Integer
          min: -10
          max: 10
          seed: 4984
  - match:
      name: name
    actions:
      - transform:
          type: tokenizer
          max_token_len: 10
          key: my secret

You can then load this policy and apply it to your data frame:

# df can be a Pandas or Spark data frame 
policy = cape.parse_policy("my-policy.yaml")
df = cape.apply_policy(policy, df)

print(df.head())
# >>
#          name  age  birthdate
# 0  f42c2f1964   34 1985-02-23
# 1  2e586494b2   63 1963-05-10

You can see more examples and usage or read our documentation.

About Cape Privacy and Cape Dataframes

Cape Privacy empowers developers to easily encrypt data and process it confidentially. No cryptography or key management required.. Learn more at capeprivacy.com.

Cape Dataframes brings Cape's policy language to Pandas and Apache Spark. The supported techniques include tokenization with linkability as well as perturbation and rounding. You can experiment with these techniques programmatically, in Python or in human-readable policy files.

Project status and roadmap

Cape Python 0.1.1 was released 24th June 2020. It is actively maintained and developed, alongside other elements of the Cape ecosystem.

Upcoming features:

  • Reversible tokenisation: allow reversing of tokenization to reveal the raw value.
  • Expand pipeline integrations: add Apache Beam, Apache Flink, Apache Arrow Flight or Dask integration as another pipeline we can support, either as part of Cape Dataframes or in its own separate project.

Help and resources

If you need help using Cape Dataframes, you can:

Please file feature requests and bug reports as GitHub issues.

Contributing

View our contributing guide for more information.

Code of conduct

Our code of conduct is included on the Cape Privacy website. All community members are expected to follow it. Please refer to that page for information on how to report problems.

License

Licensed under Apache License, Version 2.0 (see LICENSE or http://www.apache.org/licenses/LICENSE-2.0). Copyright as specified in NOTICE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cape-dataframes-0.3.1.tar.gz (35.2 kB view details)

Uploaded Source

Built Distribution

cape_dataframes-0.3.1-py3-none-any.whl (49.1 kB view details)

Uploaded Python 3

File details

Details for the file cape-dataframes-0.3.1.tar.gz.

File metadata

  • Download URL: cape-dataframes-0.3.1.tar.gz
  • Upload date:
  • Size: 35.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for cape-dataframes-0.3.1.tar.gz
Algorithm Hash digest
SHA256 c9ad8323cfade41e6d156882fd8acaf671af1343e758c63dca06c21bcd9191fe
MD5 5fea4951b2a1b95ad832dcbdbaad7a1d
BLAKE2b-256 47ce96ba66f1596fd0c2f07067303c56374da0bc86eb480d92ec1b1045075f5e

See more details on using hashes here.

File details

Details for the file cape_dataframes-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for cape_dataframes-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7377cbadfa303330f770e169342f750e70c93470de8e8d77dbc214074de0c341
MD5 7f8ea58af7d43498e899295ef6c52b28
BLAKE2b-256 c50826d847b9f039b049531e7e96c78a404a01ef382ec153d5d064aadbe44f64

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page