No project description provided
Project description
Scrape NHS Conditions
This package uses the NHS Website Developer portal and Scrapy to pull down the text content of the NHS Conditions website into text files for downstream use by data science projects.
This is a simplified version of the work found here: https://github.com/nhsx/language-corpus-tools
Contact
This repository is maintained by NHS England Data Science Team.
To contact us raise an issue on Github or via email.
See our (and our colleagues') other work here:
Description
There is a need for easy access to the text content of NHS Conditions, particularly given the useful work by CogStack in creating lists of NHS Conditions questions and answers.
The NHS Developer API is very useful, but requires some setup and training to use - overkill if all a data science project needs is the NHS Conditions text. Additionally, the outputs of the API need further processing to get just the textual components of each page.
This package aims to make this whole process easier, requiring the user to simply run:
- run_nhs_conditions_scraper: to extract the HTML for each page
- process_nhs_conditions_json: to extract the text for each page into txt files
An example of how these are used can be see in the scrape_nhs_conditions.ipynb notebook
Prerequisites
If applicable, list the items a user needs to be able to use your repository, such as a certain version of a programming language. It can be useful to link to documentation on how to install these items.
- Python (> 3.0)
Getting Started
Tell the user how to get started (using a numbered list can be helpful). List one action per step with example code if possible.
- Clone the repository. To learn about what this means, and how to use Git, see the Git guide.
git clone <insert URL>
- Set up your environment using pip. For more information on how to use virtual environments and why they are important see the virtual environments guide.
Using pip
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -r requirements.txt
For Visual Studio Code it is necessary that you change your default interpreter to the virtual environment you just created .venv. To do this use the shortcut Ctrl-Shift-P, search for Python: Select interpreter and select .venv from the list.
Project structure
Provide the user with an outline of your repository structure. This template is primarily designed for publications teams at NHS England. Projects with different requirements (e.g. more complex documentation and modelling) should look to DrivenData's cookiecutter project structure, as well as our Community of Practice for guidance.
| .gitignore <- Files (& file types) automatically removed from version control for security purposes
| config.toml <- Configuration file with parameters we want to be able to change (e.g. date)
| requirements.txt <- Requirements for reproducing the analysis environment
| pyproject.toml <- Configuration file containing package build information
| LICENSE <- License info for public distribution
| README.md <- Quick start guide / explanation of your project
|
| scrape_nhs_conditions.ipynb <- Shows how to use the main functions to scrape NHS Conditions.
|
+---src <- Contains project's codebase.
| | __init__.py <- Makes the functions folder an importable Python module
| |
| +---utils <- Scripts relating to configuration and handling data connections e.g. importing data, writing to a database etc.
| | __init__.py <- Makes the functions folder an importable Python module
| |
| +---data_ingestion <- Scripts with modules containing functions to preprocess read data i.e. perform validation/data quality checks, other preprocessing etc.
| | __init__.py <- Makes the functions folder an importable Python module
| | simple_nhs_conditions_scrap.py <- Scrapes the HTML down from the NHS Conditions website.
| |
| +---processing <- Scripts with modules containing functions to process data i.e. clean and derive new fields
| | __init__.py <- Makes the functions folder an importable Python module
| | process_html.py <- processes the HTML files to make text files
| |
| +---data_exports
| | __init__.py <- Makes the functions folder an importable Python module
| |
|
+---tests
| | __init__.py <- Makes the functions folder an importable Python module
| |
| +---backtests <- Comparison tests for the old and new pipeline's outputs
| | __init__.py <- Makes the functions folder an importable Python module
| |
| +---unittests <- Tests for the functional outputs of Python code
| | test_simple_nhs_conditions_scrape.py
| | __init__.py <- Makes the functions folder an importable Python module
Licence
The LICENCE file will need to be updated with the correct year and owner
Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.
Any HTML or Markdown documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.
Acknowledgements
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrape_nhs_conditions-1.0.4.tar.gz
.
File metadata
- Download URL: scrape_nhs_conditions-1.0.4.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e1ac4f2bf05ad8be50ea2f25c97d7e81a6cbcddcfca4dcc01b06df0bd837ffd |
|
MD5 | 6b6600b2d2816ca43f8ff234bc3959fe |
|
BLAKE2b-256 | 6176c71568a8ea75c3e5bfb0f15f18bdbf572276c827a35d31b78d1dbf58367a |
File details
Details for the file scrape_nhs_conditions-1.0.4-py3-none-any.whl
.
File metadata
- Download URL: scrape_nhs_conditions-1.0.4-py3-none-any.whl
- Upload date:
- Size: 9.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 517bed9a9485eba92ae80c8c49c16b03e64173366745d8e8ca0ae8dedca5206f |
|
MD5 | 8145a5c69234ad4d64771fde2c820561 |
|
BLAKE2b-256 | 8014e529964f8d2dc1d6f1b9566f9e2fbfdb1139ebfcc9fc6ffe87c56d29174a |