helpers for quality checking data
Project description
dfeqa
Introduction
DfE-QA Python helper functions - covering a variety of checks primarily to report on the quality of data.
Getting Started
python -m pip install dfeqa
or you can install with additional dependencies for analysis using Quarto with:
python -m pip install dfeqa[user]
Using
dfeqa create
Once dfeqa has been installed you can create new project files at the shell:
dfeqa create data_quality my_dq_report.qmd
Type dfeqa create --help for a list of templates.
You don't need to use the templates at all - you can just import the functions into your own scripts as follows
from dfeqa import x [,y...]
for example:
from dfeqa import load_census, barchart as bc
dfeqa addenv
dfeqa v0.0.6 introduced addenv which will create a .env file in your working directory with some examples for connecting to SQL Server or Databricks databases. You can define several connections and then define one of them using the DEFAULT_CONN variable (as shown in the template) as your default so you don't need to state it explicitly every time you use it. It's worth being explicit in your scripts though, so use the default when exploring data or developing scripts, but use explicit connection references when finalising your scripts.
More details of the helper functions
You can find more information about the various helper functions and objects available to you at the dfeqa wiki.
Some functions you may find helpful to get started:
Data transformation and validation
- year_group() - predict a pupil national curriculum year group from their date of birth
- valid_name_regex() - identify unlikely names (single character, odd characters like question marks, etc.)
- relaxed_valid_name_regex() - identify unlikely names (relaxed version used for school names)
- valid_upn() - validate UPNs, which allows for identifying invalid ones
Summary functions
- fd() - calculate frequency distributions from multiple variables and compare the results
- barchart()
- status_summary() - create a high-level summary suitable for mapping to organisational goals
The Summary object
- wide_fd() - create multiple frequency distibutions for comparison in wide format
- long_fd() - create multiple frequency distibutions for comparison in long format
The Series object
- minmax() - summarise the contents of a series including length of columns and characters used
The DataFrame object
- set_header() - convenience function to change the column headings in an easy-read format
- minmax() - summarise the contents of a dataframe including length of columns and characters used
Database functions
- list_tables() - list tables in a database
- get_table() - pull contents of a complete table from a database
- get_table_metadata() - pull the description of a table from a database including data types
- query() - query a database and put the result in a dataframe
Contribute
You're very welcome to fork the repo or create a pull-request. If you're working within DfE, get in touch, and I'll provide what guidance I can on developing new functions and updating the library.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dfeqa-0.0.6.tar.gz.
File metadata
- Download URL: dfeqa-0.0.6.tar.gz
- Upload date:
- Size: 3.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
070826bc4d9f9a8c126f02fbbc670f453ac1402165a054627cdc9a914f139ae5
|
|
| MD5 |
b462649f92f6cf252d7c11ec4dde935f
|
|
| BLAKE2b-256 |
ac44c2e6b432986693bad0236860ba081de90354127eebc8d9afdaa32e151013
|
File details
Details for the file dfeqa-0.0.6-py3-none-any.whl.
File metadata
- Download URL: dfeqa-0.0.6-py3-none-any.whl
- Upload date:
- Size: 3.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19a99aec36a479c79f44a300890a28c0185ad73f79663109c7fb9003043ad19a
|
|
| MD5 |
57716da4e22c17ec8170fbea9e446175
|
|
| BLAKE2b-256 |
04c0c53f468401cb00c3687c2e05bd32f57009d986c04b3be5a3fbaaa66959c3
|