Skip to main content

A PySpark package used to expedite and standardise the data linkage process

Project description

DLH_utils

MIT License PyPI version PyPi Python Versions

A Python package produced by the Linkage Development team from the Data Linkage Hub at Office for National Statistics (ONS) containing a set of functions used to expedite and streamline the data linkage process.

It's key features include:

  • it's scalability to large datasets, using spark as a big-data backend
  • profiling and flagging functions used to describe and highlight issues in data
  • standardisation and cleaning functions to make data comparable ahead of linkage
  • linkage functions to derive linkage variables and join data together efficiently

Please log an issue on the issue board or contact any of the active contributors with any issues or suggestions for improvements you have.

Installation steps

DLH_utils supports Python 3.6+. To install the latest version, simply run:

pip install dlh_utils

Or, if using CDSW, in a terminal session run:

pip3 install dlh_utils

The -U argument can be used to upgrade the package to its newest version:

pip3 install -U dlh_utils

Demo

For a worked demonstration notebook of these functions being applied within a data linkage context, head over to our separate demo repository

Contributing

This repository adheres to pep8 coding standards. These can be automatically checked for when you're making new commits by the repository's pre-commit hooks. To get this working:

  • pip install both 'flake8' and 'pre-commit'
  • install the git hook scripts pre-commit install
  • When adding new git commits, the pre-commit hooks will now run and make suggestions needed to adhere to pep8 code standards

Common issues

When using the jaro/jaro_winkler functions the error "no module called Jellyfish found" is thrown

These functions are dependent on the Jellyfish package and this may not be installed on the executors used in your spark session. Try submitting Jellyfish to your sparkcontext via addPyFile() or by setting the following environmental variables in your CDSW engine settings (ONS only):

  • PYSPARK_DRIVER_PYTHON = /usr/local/bin/python3.6
  • PYSPARK_PYTHON = /opt/ons/virtualenv/miscMods_v4.04/bin/python3.6

Using the cluster function

The cluster function uses Graphframes, which requires an extra JAR file dependency to be submitted to your spark context in order for it to run.

We have published a graphframes-wrapper package on Pypi that contains this JAR file. This is included in the package requirements as a dependency.

If outside of ONS and this dependency doesn't work, you will need to submit graphframes' JAR file dependency to your spark context. This can be found here:

https://repos.spark-packages.org/graphframes/graphframes/0.6.0-spark2.3-s_2.11/graphframes-0.6.0-spark2.3-s_2.11.jar

Once downloaded, this can be submitted to your spark context by adding this parameter to your SparkSession config:

spark.conf.set('spark.jars', path_to_jar_file)

Thanks

Thanks to all those in the Data Linkage Hub, Data Engineering and Methodology at ONS that have contributed towards this repository.

Any questions?

If you need any additional help, or have any feedback on the package, please contact the Data Linkage Hub at Linkage.Hub@ons.gov.uk .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dlh_utils-0.4.1.tar.gz (73.9 kB view details)

Uploaded Source

Built Distribution

dlh_utils-0.4.1-py3-none-any.whl (79.3 kB view details)

Uploaded Python 3

File details

Details for the file dlh_utils-0.4.1.tar.gz.

File metadata

  • Download URL: dlh_utils-0.4.1.tar.gz
  • Upload date:
  • Size: 73.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/6.5.0-1021-azure

File hashes

Hashes for dlh_utils-0.4.1.tar.gz
Algorithm Hash digest
SHA256 164c4681d0c7d208b0612d9cea24be4c7e70346f92db3685135a477ec6897072
MD5 e20fefd7495ae0798f9ff5a197333572
BLAKE2b-256 dd6222ff3f15570adef862903eac23666fef21d2478a600ed451a70b57e5c6e0

See more details on using hashes here.

File details

Details for the file dlh_utils-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: dlh_utils-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 79.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/6.5.0-1021-azure

File hashes

Hashes for dlh_utils-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1308791dc2d2900d3d63432d42b0486bb0c5fd14b40e4654c3301388a9830f3e
MD5 d77af4933f6fe4bd62f4feffb96ef156
BLAKE2b-256 15080a4d2c0ac2e5661b927635b8e21dbda959185d728db517cf50d57dc66c3e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page