Skip to main content

No project description provided

Project description

Scrape NHS Conditions

Build Code style: black License: MIT

This package uses the NHS Website Developer portal and Scrapy to pull down the text content of the NHS Conditions website into text files for downstream use by data science projects.

This is a simplified version of the work found here: https://github.com/nhsx/language-corpus-tools

Contact

This repository is maintained by NHS England Data Science Team.

To contact us raise an issue on Github or via email.

See our (and our colleagues') other work here:

Description

There is a need for easy access to the text content of NHS Conditions, particularly given the useful work by CogStack in creating lists of NHS Conditions questions and answers.

The NHS Developer API is very useful, but requires some setup and training to use - overkill if all a data science project needs is the NHS Conditions text. Additionally, the outputs of the API need further processing to get just the textual components of each page.

This package aims to make this whole process easier, requiring the user to simply run:

  • run_nhs_conditions_scraper: to extract the HTML for each page
  • process_nhs_conditions_json: to extract the text for each page into txt files

An example of how these are used can be see in the scrape_nhs_conditions.ipynb notebook

Prerequisites

If applicable, list the items a user needs to be able to use your repository, such as a certain version of a programming language. It can be useful to link to documentation on how to install these items.

  • Python (> 3.0)

Getting Started

Tell the user how to get started (using a numbered list can be helpful). List one action per step with example code if possible.

  1. Clone the repository. To learn about what this means, and how to use Git, see the Git guide.
git clone <insert URL>
  1. Set up your environment using pip. For more information on how to use virtual environments and why they are important see the virtual environments guide.

Using pip

python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -r requirements.txt

For Visual Studio Code it is necessary that you change your default interpreter to the virtual environment you just created .venv. To do this use the shortcut Ctrl-Shift-P, search for Python: Select interpreter and select .venv from the list.

Project structure

Provide the user with an outline of your repository structure. This template is primarily designed for publications teams at NHS England. Projects with different requirements (e.g. more complex documentation and modelling) should look to DrivenData's cookiecutter project structure, as well as our Community of Practice for guidance.

|   .gitignore                        <- Files (& file types) automatically removed from version control for security purposes
|   config.toml                       <- Configuration file with parameters we want to be able to change (e.g. date)
|   requirements.txt                  <- Requirements for reproducing the analysis environment 
|   pyproject.toml                    <- Configuration file containing package build information
|   LICENSE                           <- License info for public distribution
|   README.md                         <- Quick start guide / explanation of your project
|
|   scrape_nhs_conditions.ipynb       <- Shows how to use the main functions to scrape NHS Conditions.     
|
+---src                               <- Contains project's codebase.
|   |       __init__.py               <- Makes the functions folder an importable Python module
|   |
|   +---utils                     <- Scripts relating to configuration and handling data connections e.g. importing data, writing to a database etc.
|   |       __init__.py               <- Makes the functions folder an importable Python module
|   | 
|   +---data_ingestion                <- Scripts with modules containing functions to preprocess read data i.e. perform validation/data quality checks, other preprocessing etc.
|   |       __init__.py               <- Makes the functions folder an importable Python module
|   |       simple_nhs_conditions_scrap.py <- Scrapes the HTML down from the NHS Conditions website.
|   |
|   +---processing                    <- Scripts with modules containing functions to process data i.e. clean and derive new fields
|   |       __init__.py               <- Makes the functions folder an importable Python module
|   |       process_html.py           <- processes the HTML files to make text files 
|   |
|   +---data_exports
|   |       __init__.py               <- Makes the functions folder an importable Python module
|   |
|
+---tests
|   |       __init__.py               <- Makes the functions folder an importable Python module
|   |
|   +---backtests                     <- Comparison tests for the old and new pipeline's outputs
|   |       __init__.py               <- Makes the functions folder an importable Python module
|   |
|   +---unittests                     <- Tests for the functional outputs of Python code
|   |       test_simple_nhs_conditions_scrape.py
|   |       __init__.py               <- Makes the functions folder an importable Python module

Licence

The LICENCE file will need to be updated with the correct year and owner

Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.

Any HTML or Markdown documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.

Acknowledgements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrape_nhs_conditions-1.0.4.tar.gz (11.7 kB view hashes)

Uploaded Source

Built Distribution

scrape_nhs_conditions-1.0.4-py3-none-any.whl (9.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page