A PySpark package used to expedite and standardise the data linkage process
Project description
DLH_utils
A Python package produced by the Linkage Development team from the Data Linkage Hub at Office for National Statistics (ONS) containing a set of functions used to expedite and streamline the data linkage process.
It's key features include:
- it's scalability to large datasets, using
spark
as a big-data backend - profiling and flagging functions used to describe and highlight issues in data
- standardisation and cleaning functions to make data comparable ahead of linkage
- linkage functions to derive linkage variables and join data together efficiently
Please log an issue on the issue board or contact any of the active contributors with any issues or suggestions for improvements you have.
Installation steps
DLH_utils supports Python 3.6+. To install the latest version, simply run:
pip install dlh_utils
Using the cluster function
The cluster function uses Graphframes, which requires an extra JAR file dependency to be submitted to your spark context in order for it to run.
We have published a graphframes-wrapper package on Pypi that contains this JAR file. This is included in the package requirements as a dependency.
If outside of ONS and this dependency doesn't work, you will need to submit graphframes' JAR file dependency to your spark context. This can be found here:
Once downloaded, this can be submitted to your spark context by adding this parameter to your SparkSession config:
spark.conf.set('spark.jars', path_to_jar_file)
Thanks
Thanks to all those in the Data Linkage Hub, Data Engineering and Methodology at ONS that have contributed towards this repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for dlh_utils-0.2.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fe1f1944e5e63ff2df0591f28d3f6c41c60ff679a0551225febccaec4de7fca4 |
|
MD5 | 3f193961c7e364e66e4a598172cf265b |
|
BLAKE2b-256 | a7a7d2b3c6bdabb78b4676ce585a97895cb1ffa6b2325ffd766dd5da4e5225d3 |