A PySpark package used to expedite and standardise the data linkage process
Project description
DLH utils
A package produced by the linkage development team from the Data Linkage Hub, containing a set of functions used to expedite and streamline the data linkage process.
Thanks to all those in the Data Linkage Hub and Methodology that have contributed towards this repository.
Please log an issue on the issue board or contact any of the active contributors with any issues or suggestions for improvements you have.
Installation steps
- click the 'clone' button on the project homepage and copy the project's HTTP address
- open a terminal session within CDSW and run
git clone [http_address]
- the project files will now be moved to your local file structure, within a folder called "dlh_utils"
- you can now install the package, typically by running either
!pip3 install '/home/cdsw/dlh_utils'
in a workbench/jupyter notebook session, orpip3 install '/home/cdsw/dlh_utils'
in terminal.
Note: the filepath shown in this example may differ depending on where you have cloned the project.
- all finished! You can now import modules from the dlh_utils package like any other Python library
This package is a work in progress! We will notify you of significant changes to the package. If you want to upgrade to the latest version, clone the project from GitLab again and run either !pip3 install -U '[path_to_dlh_utils]'
in workbench, or pip3 install -U '[path_to_dlh_utils]'
in terminal, to upgrade your package installation.
Using the cluster function
The cluster function uses Graphframes, which requires an extra JAR file dependency to be submitted to your spark context in order for it to run.
At ONS, we have a graphframes-wrapper package that contains this JAR file. This is included in the package requirements
as an optional dependency. To install this and use graphframes, run !pip3 install dlh_utils[full]
If outside of ONS you will need to submit graphframes' JAR file dependency to your spark context. This can be found here:
Once downloaded, this can be submitted to your spark context via: spark.conf.set('spark.jars', path_to_jar_file)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for dlh_utils-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a7aa3c12db6afd5718206450011956864f566965fbf2437beb77ea56f575d2d1 |
|
MD5 | f362f689e4e296187d5161776d561bf7 |
|
BLAKE2b-256 | d0efa16552240fdd0741c59aa9574ef484a01ef12a8ead63d0d84346773b81a3 |