Skip to main content

A cleaner module for Annex A Data

Project description

Annex A Normaliser

This code was originally developed by Celine Gross, Chris Owen and Kaj Siebert at Social Finance as part of a grant funded programme to support Local Authorities to collaborate on data analysis. The programme was called the ‘Front Door Data Collaboration’. It was supported financially by the Christie Foundation and Nesta (through the ‘What Works Centre for Children’s Social Care’). The LAs whose staff guided its development were Bracknell Forest, West Berkshire, Southampton, and Surrey. It also benefitted from advice from the National Performance and Information Managers Group.

We are happy to share this code hoping that other data analysts may benefit from a quick way to standardize Annex A to conduct more analysis.

You can find more info about Social Finance on our website: https://www.socialfinance.org.uk/

What is this code about?

What if you could conduct analysis on a Local Authority's Annex A without worrying about cleaning the data first?

Conducting varied pieces of analysis off of Annex A data required us to repeatedly clean the data of typos, inconsistencies, incorrect column labels, and many more fun things. We realised that there was value in writing a "cleaner" that would standardize the data so that we could get on with the analysis without re-cleaning each time.

How to run this programme

To run this programme, you will need to have installed Python and Poetry.

Once that is done, follow the steps detailed below:

If you have Annex A data:

    1. Run the 10-annexa-MERGE step
    1. Run the 20-annexa-CLEAN step
    1. Run the 30-annexa-CUSTOM_CLEAN step (optional)

You're done!

Step 1: 10-annexa-MERGE

The 10-annexa-MERGE file uploads all the Annex A files and merges them into a unique one. It also works if you have only one file. In the process, it checks column titles and data type within the columns. This programme will ouput three items:

  • The merged Annex A ("merged.xlsx"): a unique Annex A file
  • The Annex A column report ("column_names.xlsx"): a list of "column_name" from the Annex A guidance matched with the "header_name" found in your file. You may see that some columns were not matched if their titles were not aligned with the Annex A guidance.
  • The Annex A error report: a list of values that were discarded because they didn't match the normal column type - e.g. a field with "Yes" where a date was expected. (currently disabled, but will get added in the next version)

To run this step, open the 10-annexa-MERGE notebook and run all the cells. You can input the path to your Annex A files by:

  1. Giving a list of individual files names:
sources = find_sources('examples/example-A-2005.xls', 'examples/example-B-2004.xlsx', data_sources=data_sources)
  1. Giving a 'glob' pattern to find the files within a folder:
sources = find_sources('examples/example-*.*', data_sources=data_sources)

You can follow the full, step-by-step walk-through of this step in docs/merger-components.ipynb.

Step 2: 20-annexa-CLEAN

The 20-annexa-CLEAN file goes over the merged Annex A ("merged.xlsx") created in step 10 and aligns the values within the columns with the 2019 Annex A guidance. E.g. 'White British' (Ethnicity column) will be converted to 'a) WBRI'. This programme will output two items:

  • The matching report ("matching_report.xlsx"): Excel table showing which original values were matched with Annex A-aligned values. Those that were not matched are shown as 'not matched'. The matching is done based on generic rules that should work for most users; however, you have the option to custom the matching in the 'custom clean' step.
  • The cleaned Annex A ("cleaned.xlsx"): new Annex A file with values aligned with the 2019 Annex A guidance. The values that were not matched are replaced by 'not matched'. You can change this behaviour in the 'custom clean' step.

To run this step, open the 20-annexa-CLEAN notebook and run all the cells. You can change the file paths if required.

Step 3: 30-annexa-CUSTOM_CLEAN

The 30-annexa-CUSTOM_CLEAN step enables you to custom the Annex A cleaning and output a new version of the cleaned Annex A. This programme will output one item:

  • The cleaned Annex A ("final_cleaned.xslx"): new Annex A file including the edits you made in the matching report.

Go ahead and open the matching report ("matching_report.xlsx") generated by 20-CLEAN: you'll see that you can change how the original value ('former_value' column) is mapped to the Annex A-aligned value ('new_value' column).

Let's imagine that your data had a row with "Contact Source" : "James Bond". Our generic cleaning rules would not pick that up and you would see the line "James Bond" : "not matched" in the matching report. You can edit this and change it into "James Bond" : "d) 1D: Individual". Do your edits and save the matching report. You're ready to go!

To run this step, open the 30-annexa-CUSTOM_CLEAN notebook and run all the cells. You can change the file paths if required.

Step 4: 40-cincensus-CLEAN

The 40-cincensus-CLEAN goes over one or several CIN Census files (XML) and performs a quick, low-level cleaning. This programme will output one item for each input item:

  • A "cleaned" version of each CIN Census file ("cleaned-{}.xml"): The cleaning consists of removing empty or incomplete tags, removing trailing spaces, checking that date fields are dates, checking that codes used are aligned with the CIN Census guidance. If a field is not of the correct type/code, the code will add a "Not in proper format" mention to the data field. The user can open the clean XML and search for "Not in proper format" and manually edit if needed.

To run this step, open the 40-cincensus-CLEAN notebook and run all the cells. You need to input the CIN folder filepath.

Step 5: 50-all-LOG

The 50-all-LOG creates an event-based csv table pulling all the data from Annex A and/or CIN Census. The code can be used to pull together Annex A and CIN Census data, but also works with only Annex A or CIN Census. This programme will output one item:

  • The log ("log.csv"): New csv file recaping all the data contained in Annex A and/or CIN Census.

Having a log file allows to access all the child information in one place and easily access events that occurred.

To run this step, open the 50-all-LOG notebook and run all the cells. You need to input the cleaned Annex A filepath and the cleaned CIN folder filepath. If you are not using Annex A or CIN Census, you need to change "include_annexa" or "include_cincensus" to False.

Caveats and assumptions

Annex A cleaning - We have focused on providing cleaning rules on the first 8 Annex A lists. If you're keen to add additional rules for the remaining lists, please get in touch and we'd be happy to collaborate.

Contributing

This is our first go at providing some quick code to simplify the analysis of statutory children services data. Much more could be done! If you'd like to contribute, head over to CONTRIBUTING.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sfdata_annexa_clean-0.1.1.tar.gz (37.4 kB view details)

Uploaded Source

Built Distribution

sfdata_annexa_clean-0.1.1-py3-none-any.whl (44.4 kB view details)

Uploaded Python 3

File details

Details for the file sfdata_annexa_clean-0.1.1.tar.gz.

File metadata

  • Download URL: sfdata_annexa_clean-0.1.1.tar.gz
  • Upload date:
  • Size: 37.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.10 CPython/3.9.7 Darwin/20.6.0

File hashes

Hashes for sfdata_annexa_clean-0.1.1.tar.gz
Algorithm Hash digest
SHA256 89c5464cab96312550f54e7ba1214d2a8c96a225bb4a48f5dd51ed3c8a014d69
MD5 cc962e6c2b391895b6b1d11a4f9d673f
BLAKE2b-256 b3e87347d976428dfe08ea57a827c9d324f18bf95cf83542781e67f0b46a1cea

See more details on using hashes here.

File details

Details for the file sfdata_annexa_clean-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for sfdata_annexa_clean-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b71e6a6b4301e493abd9be75c96a38d8bf000feae11e5e672c93117f96612f97
MD5 874c804aa271929fb796222ee7853a3d
BLAKE2b-256 04e7683aaa78f6471e58df4d4a021ff0a413e5a8fe2e19355eddbf8fc593eb32

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page