Package for importing, processing and visualising radition exposure data

These details have not been verified by PyPI

Project description

Datenanalyse Strahlenexposition

Getting Started

Prerequisites

Python 3.10 or higher (and pip)
installation of Visual Studio Code is recommended
installation of VS Code extension Database Client is recommended

Installation on Windows

Create a folder that will contain both the repository and the KI-Labor_strukturiert data folder.
Clone Repository or download and unzip source code
Open repository folder in VS Code

In VS Code open Terminal (CMD) and run following commands

# Create virtual environment
python -m venv venv

# Activate it (CMD)
.\venv\Scripts\activate

# Install Poetry and dependencies from .lock file
pip install poetry
poetry install

# Remove WeasyPrint (only on Windows)
poetry remove weasyprint

weasyprint (used for pdf report generation) requires additional setup on Windows. This approach translates unix source code to windows binary and is also documented here. Steps:
1. Download and install MSYS2 from here
2. open MSYS2’s shell (search for "MSYS2 MINGW64" in your Start Menu) and install Pango by executing:
```
pacman -S mingw-w64-x86_64-pango
```
  Close the MSYS2 terminal.
3. Open a new terminal in VS Code (Make sure your virtual environment is activated). Install weasyprint using pip.
```
pip install --force-reinstall weasyprint==64.1
```

Common issues:

step 4:

If the venv activation command raises windows running scripts is disabled on this system Open a power shell as administrator and run Set-ExecutionPolicy RemoteSigned
if you run the commands in a PowerShell instead CMD, activate the environment by running .\venv\Scripts\Activate.ps1
if python -m venv venv returns "python not found" try to run py -m venv venv

Run the application

Excel processing

To create the database and read/process the original excel files you can open the file pipeline.py (stored in src/strahlenexposition_uba/pipeline.py) and click run. A folder named logs should have been created with a log file. Check the log file for any errors and warnings when you process new data.

Inspect the data

Inspect database after excel files have been processed successfully:

open Database extension on the left sidebar
Click add connection
Select SQLite, type any name, and in Database Path field select the created database file (.db) in folder database which was created in the same folder where repository and data is stored.
Click save and connect. You can open data tables in the Tables section now.

Create a report

To create a pdf report for selected years open terminal with activated Python environment and run e.g. for years 2018-2020

in CMD terminal

python src\strahlenexposition_uba\pipeline.py --skip-read-original-excel --pdf-report-years 2018 2019 2020

or if you are using PowerShell

python .\src\strahlenexposition_uba\pipeline.py --skip-read-original-excel --pdf-report-years 2018 2019 2020

This might take a minute. Created reports (both pseudonymized and not pseudonymized) are saved to output folder. See next section for details how to provide pseudonyms.

Folder structure

The minimal folder structure in the base path is the following. Some of the folder names are unfortunately hardcoded within the code. If you change the folder structure, you need to make sure, that the code is adjusted accordingly. <placeholder> is used when precise folder names are not relevant.

KI-Labor_strukturiert/
├── 241216_R-Skripte_Vorlagen_U-Codes_und_Berichte/
│   └── 02-Untersuchungscodes_und_DRW.xlsx
└── 250122_Originalmeldungen/
    └── <all_data_one_folder_per_year>

Pseudonyms for pseudonymized report

To generate a report with pseudonymized "aerztliche_stelle", create a file "pseudonym_mapping.csv" somewhere inside the "KI-Labor_strukturiert" data directory. It must contain all aerztliche stellen formatted as

Aerztl_Stelle,pseudonym
name_aerztl_stelle,as_01
name_2_aerztl_stelle,as_02

then run the same command as above to create the report with pseudonymization applied.

Database info

The schema of the database (tables, columns) is defined in './sql/schema.sql'

When you run the pipeline to read excel file, it will only create a new database and new tables if no database file './database/raw_strahlenexposition.db' exists. An Excel sheet that have been successfully processed already (db entry in table 'eingelesene_dateien' with success=1) will not be processed again. Removed UCodes from Untersuchungscode excel will not automatically be removed from db table Untersuchungscodes if you run the pipeline again.

if you manually changed data in the excel files and want to replace the existing data in the database, delete file './database/raw_strahlenexposition.db' and rerun pipeline.py
if you want to exclude a specific UCode from reports you can remove the UCode from the Untersuchungscode excel file, delete the file './database/raw_strahlenexposition.db' and rerun the pipeline
if you change database schema or processing logic in python code delete/rename './database/raw_strahlenexposition.db' and then run the pipeline

General information for running pipeline.py

If there are any issues with acticated environment try to run .\venv\Scripts\python.exe .\src\strahlenexposition_uba\pipeline.py --pdf-report-years 2018 2019 2020 instead.
To see doc for all parameters and flags in pipeline.py run:
```
python src\strahlenexposition_uba\pipeline.py --help
```

Start interactive Dash

in CMD terminal

python src\strahlenexposition_uba\pipeline.py --skip-read-original-excel --start-dash

Click on url in terminal to open local dash app in browser.

Data Science and Heatmaps

For data science tasks and heatmap visualisation, the following arguments can (but don't have to) be applied:

the years for which the data science shall be performed (if no years are provided, all data are used)
the path to the base directory (if no path is provided, the grandparent folder is selected)
the threshold for outlier detection, i.e. multiples of the DRW. e.g. --threshold 3 will mark all doses above 3x DRW as outliers.

For outlier analysis and clustering, run data_science.py. Example:

python src/strahlenexposition_uba/data_science.py --years 2018 2019 2020 --path <path/where/kilaborstrukturiert/is/stored> --threshold 4

For data visualization as heatmaps, run heatmaps.py. Example:

python src/strahlenexposition_uba/heatmaps.py --years 2018 2019 2020 --path <path/where/kilaborstrukturiert/is/stored> --threshold 4

Install and run via python wheel (.whl)

A Wheel (.whl) is a standard built distribution format for Python packages. It lets you install Python software quickly without needing to compile anything.

1. Install the package

Open terminal and run:

pip install path/to/wheel/strahlenexposition_uba-1.0.0-py3-none-any.whl

On windows, you can ignore the warning about the kaleido version. Be sure to follow step 5 in Installation on Windows:

5.1 and 5.2: Only if not done before
5.3: Mandatory (pip install --force-reinstall weasyprint==64.1)

2. Run the Application

You can now execute the pipeline to:

Read Excel files
Write to a database
Generate reports

The application will create subfolders (database/, output/, logs/) inside your base path (=argument passed to --path parameter) if they don’t exist.

Usage

View available options

python -m strahlenexposition_uba --help

Example: Read Excel files and create PDF reports

python -m strahlenexposition_uba --pdf-report-years 2018 2019 2020 --path <path/where/kilaborstrukturiert/is/stored>

You will find:

Reports in: basepath/output/ (basepath=argument passed to --path parameter)
Logs in: basepath/logs/

Example: Outlier analysis and clustering

python -m strahlenexposition_uba.data_science --years  2018 2019 2020 --path <path/where/kilaborstrukturiert/is/stored> --threshold 4

Example: Heatmaps

python -m strahlenexposition_uba.heatmaps --years  2018 2019 2020 --path <path/where/kilaborstrukturiert/is/stored> --threshold 4

Setup for Development

This project uses poetry for Dependency management, virtual environments, building packages and publishing to PyPI, ruff for formatting and linting and sphinx for documentation. See pyproject.toml for details.

python3 -m venv venv
source venv/bin/activate
pip install poetry
poetry install
pre-commit install

Optional: Install Ruff extension in VSCode https://marketplace.visualstudio.com/items?itemName=charliermarsh.ruff.

SQLite Database

Optional: To explore the database tables you can use the Database Client extension in VSCode. To connect, select the database file located at /database/<db_name>.db. If prompted, install SQLite on your system. Alternatively, you can use any other SQLite-compatible tool for database inspection.

Documentation

We use the google formatting style for docstrings. For creating the documentation given the current docs/source folder

sphinx-build -M html docs/source docs/build

NOTE

Building the documentation is lazy, i.e. html pages are changed instead of deleted and re-created from scratch. This can lead to warnings. If you encounter atypical behaviour, try deleting the docs/build folder and re-run the above command.

Make sure you create a [module].rst file in the docs/source for each [module] in the package. Also include it in the modules.rst.

After running the above command, the documentation will be included in the folder docs/build/html. Click on 'index.html' and navigate through installation instructions, explanatory sections on how the code works and the code documentation of the modules.

Data pipeline

We visualised the processing steps from reading the excel file to writing to the SQLite database in a flowchart. It contains most details which should be helpful in finding errors in the excel files and show you how to adjust the file in order to make it machine readable, e.g. how you need to adjust the filename s.t. the processor correctly and uniquely identifies the aerztl. Stelle.

There are a number of reasons why a sheet might not be processed correctly. For better clarity, we only show the successful path.

Explanations

Term	Explanation
Anchor cell	The formatter needs something for orientation within the sheet. Each formatter tries to find a cell which contains the same (or very similar) entries across all sheets using the same template. This point of orientation we call anchor cell.
Formatter	The formatter is a Python object which handles all the processing necessary to align the different Excel files. It aligns the column names, removes rows without data, and more.
Blacklist	The blacklist is a list of columns that need to be removed before processing to ensure that the remaining columns in the Excel sheet are uniquely identifiable.
Forward fill ID column	The ID column is usually called "ID_der_RX". If there are rows which include data but the ID column is empty, the forward fill operation fills that column with the entry from the row above.
Clean data	The details for these steps are included in the documentation under: `Reference/Submodules/formatter/LongTableFormatter/_clean_data`.
Duplicate mean values	If an Excel sheet contains (aggregated) mean values instead of raw values,we write them into the dataset repeatedly so that the total number of considered values is correct.

Excel to SQLite pipeline

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

May 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strahlenexposition_uba-1.0.0.tar.gz (1.2 MB view details)

Uploaded May 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

strahlenexposition_uba-1.0.0-py3-none-any.whl (1.2 MB view details)

Uploaded May 19, 2025 Python 3

File details

Details for the file strahlenexposition_uba-1.0.0.tar.gz.

File metadata

Download URL: strahlenexposition_uba-1.0.0.tar.gz
Upload date: May 19, 2025
Size: 1.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for strahlenexposition_uba-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`25c2f0bcfa78527e894a2fcdfd870d2874241136abd4621e8dff859eb3b4aa71`
MD5	`5e55c08cd3459ab92ece9cbdbf4adb56`
BLAKE2b-256	`e9031013e5e2177ebbb77728b72aa1065fa548c746feefd21d94a552f506c826`

See more details on using hashes here.

File details

Details for the file strahlenexposition_uba-1.0.0-py3-none-any.whl.

File metadata

Download URL: strahlenexposition_uba-1.0.0-py3-none-any.whl
Upload date: May 19, 2025
Size: 1.2 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for strahlenexposition_uba-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2f1734bf3fcc5c49251cc5d1cbf50ae1f7d716482c7129ca166e70d28e8893bd`
MD5	`a0d0aaa9b4ad94143b13e73f79d9dfc2`
BLAKE2b-256	`7d2dfbe28d99a69a0da71b9b84ca92607567833765251701f124126f8089ff32`

See more details on using hashes here.

strahlenexposition-uba 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Datenanalyse Strahlenexposition

Getting Started

Prerequisites

Installation on Windows

Common issues:

Run the application

Excel processing

Inspect the data

Create a report

Folder structure

Pseudonyms for pseudonymized report

Database info

General information for running pipeline.py

Start interactive Dash

Data Science and Heatmaps

Install and run via python wheel (.whl)

1. Install the package

2. Run the Application

Setup for Development

SQLite Database

Documentation

Data pipeline

Explanations

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes