de_utils is a package that contains collections of useful functions grouped by theme e.g. HDFS, Spark, Python, etc... and a data engineering framework under the engineering_utils sub-package.
Project description
Documentation for the package (copy paste it into Chrome/Edge)
file://fa1rpwapxx272/ons/ci/jenkins_home/userContent/de-utils/build/index.html
Introduction
de_utils
is a package that contains collections of useful functions grouped by theme e.g. HDFS, Spark, Python, etc... and a data engineering framework under the engineering_utils
sub-package.
This is an open-source project designed to be used by anyone and does not follow the needs of a specific project or team. The functions in here will be generalised for wider use, we will not make deprecations or backward incompatible changes unless absolutely necessary, which would be marked with a (major release)[https://www.geeksforgeeks.org/introduction-semantic-versioning/].
How to get involved
To suggest an improvement/ change or contribute code please raise an issue or submit a merge request.
You don't have to submit the finished product, it can be:
- A function that is not fully built because you're either stuck on a problem or just to collaborate with others on building
- An idea for a function you need, it might be that someone else already has built that function or is willing to help build it
Gitlab issue example - As shown it's worth keeping discussions around a GitLab issue in the comments section of the issue, instead of a private chat on teams, to allow others to get involved.
Developer contribution guide
Setup SSH
This only needs to be done once across all CDSW projects so skip this if you've already done this.
- Open CDSW and navigate to the settings tab using the menu on the left side
- Click on the
outbound SSH
tab and copy your user public SSH key - Open GitLab and navigate to preferences using the dropdown menu in the top right of the screen
- Click on the
SSH Keys
tab in the menu on the left side of the screen - Paste your SSH key and give it a relevant title, e.g.
cdsw
- Click
Add key
Setup environment
- Clone the project by creating a new CDSW project and (using git for the initial setup with the repo's SSH link)
- Open the workbench and start a session
- Setup your environment variables for pip and Artifactory
- Open the terminal and run:
pip3 install -r requirements-dev.txt
pre-commit install
to install the pre-commit hooks (flake8, pydocstyle & darglint) used in development to ensure coding style consistency
Code Style Guide
Please follow the style of the existing codebase, which follows NumPy documentation style for Python code and docstrings.
The code should follow standard PEP8 guidelines Further reading: how to write beautiful Python code.
Contributions should also:
- Use native Spark functions over UDFs where possible
- Use
()
for multi-line code instead of\
line continuation characters - Use absolute imports and no wildcard imports i.e.
import pandas as *
- Import pyspark functions as F
- Be designed to use the user's existing Spark session
- Document any library dependencies in the requirements.txt
Development workflow
-
Create feature branch off of the development (dev) branch and name it after the function or group of work being submitted
git checkout dev
,git pull origin dev
&git checkout -b your_branch
-
Commit work regularly
git add -p filename
orgit add filename
&git commit -m 'commit message'
&git push origin
hint:git add -p ...
allows you to cycle through individual changes before committing
-
The pre-commit hooks will fail the commit if Flake8 or pydocstyle find issues with the code.
- Each error will have a description and a code which can be googled for further info.
-
If you have created any new functions or classes make sure that they are publically exposed.
- In the module level init.py file, import them and add a string representation of their name to all.
- If you have created a new public module, include this in the init.py in the parent directory.
- Ensure all private scripts and functions are named with a leading underscore.
- Further reading around the use of the __init__.py file.
-
Once the ticket is finished create a merge request for 'your_branch' into development
-
Go to GitLab to manually create a merge request, selecting the option that says "delete source branch once merge request is accepted"
-
Merge request needs to be peer-reviewed by at least one colleague and have one 👍 before the merge request is accepted
- This means the changes have been carefully reviewed and the code is ready for merging
-
Amend code in line with any comments from the peer review if needed
-
Once merged into development the core developers will perform a final review of submissions and merge development into master
-
When merging to master the package version number will be bumped up in line with semantic versioning
Peer-reviews - things to look out for and ask yourself when reviewing code:
- Check code for any bugs and that the logic makes sense
- Can I easily understand what the code does?
- If I had to fix this, would I understand enough?
- Is the code sufficiently documented for me to understand it and does it explain the why? -
"Code Tells You How, Comments Tell You Why"
- Do the naming conventions make sense/ follow PEP8 and are they consistent with the rest of the project?
- Have any dependencies changes been added and the requirement.txt updated?
Further reading on this topic
GitLab email notifications:
These are very useful but to save you from being distracted by GitLab emails from peer reviews and comments, create a rule to divert emails from gitlab@ons.gov.uk into their own folder. The easiest way to do this is:
- Select an email from GitLab and find the rules dropdown on the main options bar in Outlook.
- Click on always move emails from GitLab.
- A window will open with a tree view of all your Outlook folders, select the main one right at the top and click the "New" button.
Be nice 😀
We'd like our contributors to feel valued, so if you have found something useful in this repo either a function or a new way of doing something please let them know! they have used their time to share hard earnt knowledge and we'd like them to come back.
(git blame shows the contributor - top right corner of each file on GitLab)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file data_eng_utils-0.11.1.tar.gz
.
File metadata
- Download URL: data_eng_utils-0.11.1.tar.gz
- Upload date:
- Size: 62.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.9.16 Linux/5.15.0-1037-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2137e532be9ca09b98fa48447795c949e69d882755e3f06866f3a87e6f4343db |
|
MD5 | b0d0a3e2eb9cd8a4fe27d2b59e6b906a |
|
BLAKE2b-256 | b8d3f4f7abdb97f810c35c7df6ea392b4711e644f16774824e2b311b8127303c |
File details
Details for the file data_eng_utils-0.11.1-py3-none-any.whl
.
File metadata
- Download URL: data_eng_utils-0.11.1-py3-none-any.whl
- Upload date:
- Size: 76.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.9.16 Linux/5.15.0-1037-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 10b634dc08a94a52c37e74ca0d2c75d2871dd6364c1911e2dd193b8b764b891b |
|
MD5 | 4de0246b380f4bcc7aa4b45e71b41c99 |
|
BLAKE2b-256 | e8ec3d9459d1405044d26c1a44351d599d82395c7546962e948670b9db6e8b79 |