Skip to main content

A suite of PySpark, Pandas, and general pipeline utils for Reproducible Data Science and Analysis (RDSA) projects.

Project description

🧰 rdsa-utils

Deploy to PyPI Deploy MkDocs PyPI version PyPI - Python Version Code style: Ruff Code style: black

A suite of PySpark, Pandas, and general pipeline utils for Reproducible Data Science and Analysis (RDSA) projects.

The RDSA team sits within the Economic Statistics Change Directorate, and uses cutting-edge data science and engineering skills to produce the next generation of economic statistics. Current priorities include overhauling legacy systems and developing new systems for key statistics. More information about work at RDSA can be found here: Using Data Science for Next-Gen Statistics.

rdsa-utils is a Python codebase built with Python 3.8 and higher, and uses setup.py, setup.cfg, and pyproject.toml for dependency management and packaging.

📋 Prerequisites

  • Python 3.8 or higher

💾 Installation

rdsa-utils is available for installation via PyPI and can also be found on GitHub Releases for direct downloads and version history.

To install via pip, simply run:

pip install rdsa-utils

🗂️ How the Project is Organised

The rdsa-utils package is designed to make it easy to work with different platforms like Cloudera Data Platform (CDP) and Google Cloud Platform (GCP), as well as handle general Python tasks. Here's a breakdown of how everything is organised:

  • General Utilities (Top-Level):

    • These are tools you can use for any project, regardless of the platform you're working on. They focus on common Python, PySpark, and Pandas tasks.
    • 📂 Helpers: Handy functions that simplify working with Python and PySpark.
    • 📂 IO: Functions for handling input and output, like reading configurations or saving results.
  • Platform-Specific Utilities:

    • CDP (Cloudera Data Platform):
      • 📂 Helpers: Functions that help you work with tools supported by CDP, such as HDFS, Impala, and AWS S3.
      • 📂 IO: Input/output functions specifically for CDP, such as managing data and logs in CDP environments.
    • GCP (Google Cloud Platform):
      • 📂 Helpers: Functions to help you interact with GCP tools like Google Cloud Storage and BigQuery.
      • 📂 IO: Input/output functions for managing data with GCP services.

This structure keeps the tools for each platform separate, so you can easily find what you need, whether you're working in a cloud environment or on general Python tasks.

📖 Documentation and Further Information

Our documentation is automatically generated using GitHub Actions and MkDocs. It uses the ons_mkdocs_theme package for a consistent ONS look and feel on GitHub Pages.

For an in-depth understanding of rdsa-utils, how to contribute to rdsa-utils, and more, please refer to our MkDocs-generated documentation.

📘 Further Reading on Reproducible Analytical Pipelines

While rdsa-utils provides essential tools for data processing, it's just one part of the broader development process needed to build and maintain a robust, high-quality codebase. Following best practices and using the right tools are crucial for success.

We highly recommend checking out the following resources to learn more about creating Reproducible Analytical Pipelines (RAP), which focus on important areas such as version control, modular code development, unit testing, and peer review -- all essential for developing these pipelines:

  • Reproducible Analytical Pipelines (RAP) Resource - This resource offers an overview of Reproducible Analytical Pipelines, covering benefits, case studies, and guidelines on building a RAP. It discusses minimising manual steps, using open source software like R or Python, enhancing quality assurance through peer review, and ensuring auditability with version control. It also addresses challenges and considerations for implementing RAPs, such as data access restrictions or confidentiality, and underscores the importance of collaborative development.

  • Quality Assurance of Code for Analysis and Research - This book details methods and practices for ensuring high-quality coding in research and analysis, including unit testing and peer reviews.

  • PySpark Introduction and Training Book - An introduction to using PySpark for large-scale data processing.

Additionally, if you are facing the challenge of repeatedly setting up new developers and new users in local Python, then you may want to consider making a batch file to carry out the setup process for you. The easypipelinerun repo has a batch file that can be modified to set your users up for your project, taking care of things like conda and pip set up as well as environment management.

📬 Contact

For questions, support, or feedback about rdsa-utils, please email RDSA.Support@ons.gov.uk.

🙌 Acknowledgements

Thanks to colleagues from the ONS Data Science Campus (DSC) and the ONS Methods and Quality Directorate (MQD) for their contributions to rdsa-utils.

🛡️ Licence

Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.

The documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rdsa_utils-0.16.1.tar.gz (93.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rdsa_utils-0.16.1-py3-none-any.whl (99.3 kB view details)

Uploaded Python 3

File details

Details for the file rdsa_utils-0.16.1.tar.gz.

File metadata

  • Download URL: rdsa_utils-0.16.1.tar.gz
  • Upload date:
  • Size: 93.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rdsa_utils-0.16.1.tar.gz
Algorithm Hash digest
SHA256 8d212426ce2053ed2b02d712826d273984ba9f214e41059aa12b617802a3a95c
MD5 b426bb21dd3b503f1bac0e6ce2a26e45
BLAKE2b-256 f19c945ca455b181ab0b3c38f71ddd592369bafbd6157817176496baed8cda12

See more details on using hashes here.

File details

Details for the file rdsa_utils-0.16.1-py3-none-any.whl.

File metadata

  • Download URL: rdsa_utils-0.16.1-py3-none-any.whl
  • Upload date:
  • Size: 99.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rdsa_utils-0.16.1-py3-none-any.whl
Algorithm Hash digest
SHA256 24e1caf97733f54d475d58e45435c83753b3cb96fe185e78b2d3447cb02ea05e
MD5 40f027f998358bbadf3c39ae0be48236
BLAKE2b-256 c271ac90255a2d107371a22e39bf832e554ee656fc3e925511a39f651230ad9d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page