Skip to main content

A PySpark package used to expedite and standardise the data linkage process

Project description

DLH utils

A package produced by the linkage development team from the Data Linkage Hub, containing a set of functions used to expedite and streamline the data linkage process.

Thanks to all those in the Data Linkage Hub and Methodology that have contributed towards this repository.

Please log an issue on the issue board or contact any of the active contributors with any issues or suggestions for improvements you have.

Installation steps

  • click the 'clone' button on the project homepage and copy the project's HTTP address
  • open a terminal session within CDSW and run git clone [http_address]
  • the project files will now be moved to your local file structure, within a folder called "dlh_utils"
  • you can now install the package, typically by running either !pip3 install '/home/cdsw/dlh_utils' in a workbench/jupyter notebook session, or pip3 install '/home/cdsw/dlh_utils' in terminal.

Note: the filepath shown in this example may differ depending on where you have cloned the project.

  • all finished! You can now import modules from the dlh_utils package like any other Python library

This package is a work in progress! We will notify you of significant changes to the package. If you want to upgrade to the latest version, clone the project from GitLab again and run either !pip3 install -U '[path_to_dlh_utils]' in workbench, or pip3 install -U '[path_to_dlh_utils]' in terminal, to upgrade your package installation.

Using the cluster function

The cluster function uses Graphframes, which requires an extra JAR file dependency to be submitted to your spark context in order for it to run.

At ONS, we have a graphframes-wrapper package that contains this JAR file. This is included in the package requirements as an optional dependency. To install this and use graphframes, run !pip3 install dlh_utils[full]

If outside of ONS you will need to submit graphframes' JAR file dependency to your spark context. This can be found here:

https://repos.spark-packages.org/graphframes/graphframes/0.6.0-spark2.3-s_2.11/graphframes-0.6.0-spark2.3-s_2.11.jar

Once downloaded, this can be submitted to your spark context via: spark.conf.set('spark.jars', path_to_jar_file)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dlh_utils-0.2.0.tar.gz (52.2 kB view hashes)

Uploaded Source

Built Distribution

dlh_utils-0.2.0-py3-none-any.whl (56.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page