Package for importing, processing and visualising radition exposure data
Project description
Datenanalyse Strahlenexposition
Getting Started
Prerequisites
- Python 3.10 or higher (and pip)
- installation of Visual Studio Code is recommended
- installation of VS Code extension Database Client is recommended
Installation on Windows
- Create a folder that will contain both the repository and the KI-Labor_strukturiert data folder.
- Clone Repository or download and unzip source code
- Open repository folder in VS Code
- In VS Code open Terminal (CMD) and run following commands
# Create virtual environment python -m venv venv # Activate it (CMD) .\venv\Scripts\activate # Install Poetry and dependencies from .lock file pip install poetry poetry install # Remove WeasyPrint (only on Windows) poetry remove weasyprint - weasyprint (used for pdf report generation) requires additional setup on Windows. This approach translates unix source code to windows binary and is also documented here. Steps:
- Download and install MSYS2 from here
- open MSYS2’s shell (search for "MSYS2 MINGW64" in your Start Menu) and install Pango by executing:
Close the MSYS2 terminal.pacman -S mingw-w64-x86_64-pango - Open a new terminal in VS Code (Make sure your virtual environment is activated). Install weasyprint using pip.
pip install --force-reinstall weasyprint==64.1
Common issues:
step 4:
- If the venv activation command raises
windows running scripts is disabled on this systemOpen a power shell as administrator and runSet-ExecutionPolicy RemoteSigned - if you run the commands in a PowerShell instead CMD, activate the environment by running
.\venv\Scripts\Activate.ps1 - if
python -m venv venvreturns "python not found" try to runpy -m venv venv
Run the application
Excel processing
To create the database and read/process the original excel files you can open the file pipeline.py (stored in src/strahlenexposition_uba/pipeline.py) and click run. A folder named logs should have been created with a log file. Check the log file for any errors and warnings when you process new data.
Inspect the data
Inspect database after excel files have been processed successfully:
- open Database extension on the left sidebar
- Click add connection
- Select SQLite, type any name, and in Database Path field select the created database file (.db) in folder database which was created in the same folder where repository and data is stored.
- Click save and connect. You can open data tables in the Tables section now.
Create a report
To create a pdf report for selected years open terminal with activated Python environment and run e.g. for years 2018-2020
- in CMD terminal
python src\strahlenexposition_uba\pipeline.py --skip-read-original-excel --pdf-report-years 2018 2019 2020 - or if you are using PowerShell
python .\src\strahlenexposition_uba\pipeline.py --skip-read-original-excel --pdf-report-years 2018 2019 2020
This might take a minute. Created reports (both pseudonymized and not pseudonymized) are saved to output folder. See next section for details how to provide pseudonyms.
Folder structure
The minimal folder structure in the base path is the following. Some of the folder names are unfortunately hardcoded within the code. If you change the folder structure, you need to make sure, that the code is adjusted accordingly. <placeholder> is used when precise folder names are not relevant.
KI-Labor_strukturiert/
├── 241216_R-Skripte_Vorlagen_U-Codes_und_Berichte/
│ └── 02-Untersuchungscodes_und_DRW.xlsx
└── 250122_Originalmeldungen/
└── <all_data_one_folder_per_year>
Pseudonyms for pseudonymized report
To generate a report with pseudonymized "aerztliche_stelle", create a file "pseudonym_mapping.csv" somewhere inside the "KI-Labor_strukturiert" data directory. It must contain all aerztliche stellen formatted as
Aerztl_Stelle,pseudonym
name_aerztl_stelle,as_01
name_2_aerztl_stelle,as_02
then run the same command as above to create the report with pseudonymization applied.
Database info
The schema of the database (tables, columns) is defined in './sql/schema.sql'
When you run the pipeline to read excel file, it will only create a new database and new tables if no database file './database/raw_strahlenexposition.db' exists. An Excel sheet that have been successfully processed already (db entry in table 'eingelesene_dateien' with success=1) will not be processed again. Removed UCodes from Untersuchungscode excel will not automatically be removed from db table Untersuchungscodes if you run the pipeline again.
- if you manually changed data in the excel files and want to replace the existing data in the database, delete file './database/raw_strahlenexposition.db' and rerun pipeline.py
- if you want to exclude a specific UCode from reports you can remove the UCode from the Untersuchungscode excel file, delete the file './database/raw_strahlenexposition.db' and rerun the pipeline
- if you change database schema or processing logic in python code delete/rename './database/raw_strahlenexposition.db' and then run the pipeline
General information for running pipeline.py
- If there are any issues with acticated environment try to run
.\venv\Scripts\python.exe .\src\strahlenexposition_uba\pipeline.py --pdf-report-years 2018 2019 2020instead. - To see doc for all parameters and flags in pipeline.py run:
python src\strahlenexposition_uba\pipeline.py --help
Start interactive Dash
- in CMD terminal
python src\strahlenexposition_uba\pipeline.py --skip-read-original-excel --start-dash
Click on url in terminal to open local dash app in browser.
Data Science and Heatmaps
For data science tasks and heatmap visualisation, the following arguments can (but don't have to) be applied:
- the years for which the data science shall be performed (if no years are provided, all data are used)
- the path to the base directory (if no path is provided, the grandparent folder is selected)
- the threshold for outlier detection, i.e. multiples of the DRW. e.g. --threshold 3 will mark all doses above 3x DRW as outliers.
For outlier analysis and clustering, run data_science.py. Example:
python src/strahlenexposition_uba/data_science.py --years 2018 2019 2020 --path <path/where/kilaborstrukturiert/is/stored> --threshold 4
For data visualization as heatmaps, run heatmaps.py. Example:
python src/strahlenexposition_uba/heatmaps.py --years 2018 2019 2020 --path <path/where/kilaborstrukturiert/is/stored> --threshold 4
Install and run via python wheel (.whl)
A Wheel (.whl) is a standard built distribution format for Python packages. It lets you install Python software quickly without needing to compile anything.
1. Install the package
Open terminal and run:
pip install path/to/wheel/strahlenexposition_uba-1.0.0-py3-none-any.whl
On windows, you can ignore the warning about the kaleido version. Be sure to follow step 5 in Installation on Windows:
- 5.1 and 5.2: Only if not done before
- 5.3: Mandatory (
pip install --force-reinstall weasyprint==64.1)
2. Run the Application
You can now execute the pipeline to:
- Read Excel files
- Write to a database
- Generate reports
The application will create subfolders (database/, output/, logs/) inside your base path (=argument passed to --path parameter) if they don’t exist.
Usage
View available options
python -m strahlenexposition_uba --help
Example: Read Excel files and create PDF reports
python -m strahlenexposition_uba --pdf-report-years 2018 2019 2020 --path <path/where/kilaborstrukturiert/is/stored>
You will find:
- Reports in: basepath/output/ (basepath=argument passed to --path parameter)
- Logs in: basepath/logs/
Example: Outlier analysis and clustering
python -m strahlenexposition_uba.data_science --years 2018 2019 2020 --path <path/where/kilaborstrukturiert/is/stored> --threshold 4
Example: Heatmaps
python -m strahlenexposition_uba.heatmaps --years 2018 2019 2020 --path <path/where/kilaborstrukturiert/is/stored> --threshold 4
Setup for Development
This project uses poetry for Dependency management, virtual environments, building packages and publishing to PyPI, ruff for formatting and linting and sphinx for documentation. See pyproject.toml for details.
python3 -m venv venv
source venv/bin/activate
pip install poetry
poetry install
pre-commit install
Optional: Install Ruff extension in VSCode https://marketplace.visualstudio.com/items?itemName=charliermarsh.ruff.
SQLite Database
Optional: To explore the database tables you can use the Database Client extension in VSCode. To connect, select the database file located at /database/<db_name>.db. If prompted, install SQLite on your system. Alternatively, you can use any other SQLite-compatible tool for database inspection.
Documentation
We use the google formatting style for docstrings. For creating the documentation given the current docs/source folder
sphinx-build -M html docs/source docs/build
NOTE
Building the documentation is lazy, i.e. html pages are changed instead of deleted and re-created from scratch. This can lead to warnings. If you encounter atypical behaviour, try deleting the docs/build folder and re-run the above command.
Make sure you create a [module].rst file in the docs/source for each [module] in the package. Also include it in the modules.rst.
After running the above command, the documentation will be included in the folder docs/build/html. Click on 'index.html' and navigate through installation instructions, explanatory sections on how the code works and the code documentation of the modules.
Data pipeline
We visualised the processing steps from reading the excel file to writing to the SQLite database in a flowchart. It contains most details which should be helpful in finding errors in the excel files and show you how to adjust the file in order to make it machine readable, e.g. how you need to adjust the filename s.t. the processor correctly and uniquely identifies the aerztl. Stelle.
There are a number of reasons why a sheet might not be processed correctly. For better clarity, we only show the successful path.
Explanations
| Term | Explanation |
|---|---|
| Anchor cell | The formatter needs something for orientation within the sheet. Each formatter tries to find a cell which contains the same (or very similar) entries across all sheets using the same template. This point of orientation we call anchor cell. |
| Formatter | The formatter is a Python object which handles all the processing necessary to align the different Excel files. It aligns the column names, removes rows without data, and more. |
| Blacklist | The blacklist is a list of columns that need to be removed before processing to ensure that the remaining columns in the Excel sheet are uniquely identifiable. |
| Forward fill ID column | The ID column is usually called "ID_der_RX". If there are rows which include data but the ID column is empty, the forward fill operation fills that column with the entry from the row above. |
| Clean data | The details for these steps are included in the documentation under:Reference/Submodules/formatter/LongTableFormatter/_clean_data. |
| Duplicate mean values | If an Excel sheet contains (aggregated) mean values instead of raw values,we write them into the dataset repeatedly so that the total number of considered values is correct. |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file strahlenexposition_uba-1.0.0.tar.gz.
File metadata
- Download URL: strahlenexposition_uba-1.0.0.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25c2f0bcfa78527e894a2fcdfd870d2874241136abd4621e8dff859eb3b4aa71
|
|
| MD5 |
5e55c08cd3459ab92ece9cbdbf4adb56
|
|
| BLAKE2b-256 |
e9031013e5e2177ebbb77728b72aa1065fa548c746feefd21d94a552f506c826
|
File details
Details for the file strahlenexposition_uba-1.0.0-py3-none-any.whl.
File metadata
- Download URL: strahlenexposition_uba-1.0.0-py3-none-any.whl
- Upload date:
- Size: 1.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f1734bf3fcc5c49251cc5d1cbf50ae1f7d716482c7129ca166e70d28e8893bd
|
|
| MD5 |
a0d0aaa9b4ad94143b13e73f79d9dfc2
|
|
| BLAKE2b-256 |
7d2dfbe28d99a69a0da71b9b84ca92607567833765251701f124126f8089ff32
|