Skip to main content

Cache Spark Dataframes for Jupyter

Project description

Defines a %%sparkcache cell magic in the IPython notebook to cache DataFrame and outputs of long-lasting computations in a persistent Parquet file in Hadoop. Useful when some computations in a notebook are long and you want to easily save the results in a file.

Based on ipycache module.

Installation

  • pip install isparkcache

Usage

  • In IPython/Jupyter:

    %load_ext isparkcache
    
  • Then, create a cell with:

    %%sparkcache df1 df2
    
    df = ...
    df1 = sql.createDataFrame(df)
    df2 = sql.createDataFrame(df)
    
  • When you execute this cell the first time, the code is executed, and the dataframes df1 and df2 are saved in /user/$USER/sparkcache/mysparkapplication/df1 and /user/$USER/sparkcache/mysparkapplication/df2. When you execute this cell again, the code is skipped, the dataframes are loaded from the Parquet and injected into the namespace, and the outputs are restored in the notebook.

  • Use the --force or -f option to force the cell’s execution and overwrite the file.

  • Use the --read or -r option to prevent the cell’s execution and always load the variables from the cache. An exception is raised if the file does not exist.

  • Use the --cachedir or -d option to specify the cache directory. Default directory: /user/$USER/sparkcache. You can specify a default directory in the IPython configuration file in your profile (typically in ~\.ipython\profile_default\ipython_config.py) by adding the following line:

    c.SparkCacheMagics.cachedir = “/path/to/mycache”

If both a default cache directory and the --cachedir option are given, the latter is used.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

isparkcache-0.1.12.tar.gz (18.3 kB view hashes)

Uploaded source

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page