Local data management for geophysical measurements
Project description
ubg_data_management
Introduction
Data management is hard.
Data management in small (work) groups is even harder, especially if data is very inhomogeneous, often incomplete, and data (format) standards are missing in a given field of research.
Goal: This repository presents work of the Geophysics Section of the University of Bonn to develop a set of guidelines, meta-data entries, and Python helper scripts for a base research data management.
Problems with data management (DM)
- DM is tedious, and often does not lead to direct benefits for the researcher
- Often, guidelines and standards are missing for data management on the lowest level of research, i.e., at the level of data creation
- DM is resource intensive. Commonly, metadata is entered and stored in database, which need to be set up and maintained, including front end software for data input and validation
- In research environments, there often is a frequent staff turn-over, complicating long-term maintenance issues
Our approach
We aim to alleviate the issue of data management at the lowest level by
- defining a simple directory structure to store heterogeneous research data (the data tree)
- defining a simple set of metadata entries that are stored in human- and machine-readable .ini format within the directory structure
- provide a set of python libraries and helper scripts for simple DM tasks, such as adding new data to a data tree, or listing all available measurements
Onion-shell principle
We recognize that DM requirements vary across institutions, even between individual researchers.
We envision our DM practices the smallest shell of a DM stack, as a basic fall back that requires no special hardware or expert skills.
If resources are available, the metadata files stored in the data tree can be scanned and imported in a database, and built upon to create sophisticated DM practices.
The data tree can also be used for easy export and subsequent import into larger-scale DM operations, such as often operated by research projects or larger research institutions.
Required hard- and software
-
A directory tree can be created by hand, if required. Therefore, only a computer and a file browser is required
-
In order to use the provided Python scripts and libraries, a working Python interpreter is required, as well as the following packages:
- numpy
- prompt_toolkit
- pandas
The data tree
A data tree consists of pre-defined levels, some of which are optional. Each directory level is uniquely identified by a two-character prefix, separated by the level name by an underscore. Some levels are restricted to a certain set of possible level names (i.e., the target level only allows the values field or laboratory).
The following image visualizes the directory structure:
An example a directory tree (with only one measurement) is:
└── dr_data
└── tc_hydrogeophysics
└── t_field
└── s_Spiekeroog
└── a_North
└── md_ERT
└── p_p_01_nor
└── m_01_p1_nor
├── metadata.ini
└── RawData
└── data.dat
Measurement directories
Measurement directories (starting with m_) can contain the following subdirectories (it is advised to not create empty directories):
- Analysis/ directories are `free-for-all', that is, no internal structure is prescribed. Use these directories to store relevant analysis steps (but in general, analysis should happen outside a data directory tree!)
- DataProcessed/ holds processed data, e.g., with certain corrections or clean-ups applied. Ensure proper documentation!
- DataRaw/ holds the raw data, as downloaded from the device
- Documentation/ contains all auxiliary data that can not be included in the metadata file. This includes maps, pdfs, literature and external documentation
- Pictures/ Store relevant pictures of the measurement in here
Special directories
- All levels are allowed to include an subdirectory Documentation. This directory can contain arbitrary information that is considered important for a given level/measurement. Use these directories to store auxiliary information, such as maps, notebook scans, programming information of measurement devices, etc.
- The dr_[DATA ROOT NAME] directory can contain a subdirectory .management. This directory is used to store temporary/caching data of the data toolbox. It can always be safely deleted without removing any relevant data. The directory is used, for example, to store a list of currently-used ids.
The metadata entries
A typical metadata.ini file can look like this:
[general]
label = 20240610_ert_p1_nor
person_responsible = Maximilian Weigand
person_email = mw@domain.com
theme_complex = Hydrogeophysics
datetime_start = 20240610_1200
description = A small test measurement
Note that some entries are multi-line capable!
survey_type = field
method = ERT
completed = yes
[field]
site = Spiekeroog
area = north
profile = p_01
[geoelectrics]
profile_direction = normal
Metadata entries are comprised of "key=value" pairs, grouped by [sections].
The Python helper libraries and scripts
Adding new data
The command dm_add can be used to easily add data to an existing, or new, data tree. The command will display information in the terminal (command line) and ask for input.
Example:
$ dm_add -t dr_data -i walkthrough.qmd
--------------------------------------------------------------------------------
Input: ['walkthrough.qmd']
Output Data Tree: dr_data
--------------------------------------------------------------------------------
Filename with highest priority /home/mweigand/.data_toolbox/ub_geoph_dm.cfg
--------------------------------------------------------------------------------
Please enter required metadata entries:
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Delete last 100 characters: STRG - a
Ignore current input and go backwards: STRG - u
Commit current input and stop data input STRG - z
There are autocomplete values available (Press TAB).
--------------------------------------------------------------------------------
Field or laboratory measurements? Allowed values: field, laboratory
Enter value for general.survey_type: field
Checking an existing directory tree
$ dm_check_dirtree
Working in directory /home/mweigand/test/dr_data
################################################################################
Checking directory structure of directory: /home/mweigand/test/dr_data
................................................................................
Directory dr_data
Directory dr_data/tc_Hydrogeophysics
Directory dr_data/tc_Hydrogeophysics/t_field
Directory dr_data/tc_Hydrogeophysics/t_field/s_Spiekeroog
Directory dr_data/tc_Hydrogeophysics/t_field/s_Spiekeroog/a_HausAmMeer
Directory dr_data/tc_Hydrogeophysics/t_field/s_Spiekeroog/a_HausAmMeer/md_TEMP
Directory dr_data/tc_Hydrogeophysics/t_field/s_Spiekeroog/a_HausAmMeer/md_TEMP/p_profile_02
Directory dr_data/tc_Hydrogeophysics/t_field/s_Spiekeroog/a_HausAmMeer/md_TEMP/p_profile_02/m_2025.2asd
Check empty directories: ok
Check for required metadata.ini file: ok
Check for required metadata entries
Required entry [geoelectrics]-profile_direction is missing
Required entry [geoelectrics]-electrode_positions is missing
Check metadata contents
OK: [general][datetime_start]
Directory dr_data/tc_Hydrogeophysics/t_field/s_Spiekeroog/a_north
Directory dr_data/tc_Hydrogeophysics/t_field/s_Spiekeroog/a_north/md_ERT
Directory dr_data/tc_Hydrogeophysics/t_field/s_Spiekeroog/a_north/md_ERT/p_p_01
Directory dr_data/tc_Hydrogeophysics/t_field/s_Spiekeroog/a_north/md_ERT/p_p_01/m_very_important
Check empty directories: ok
Check for required metadata.ini file: ok
Check for required metadata entries
Required entry [general]-description is empty
Required entry [geoelectrics]-electrode_positions is empty
Check metadata contents
FAIL: [general][datetime_start] is not a valid date format!
################################################################################
Installation
The easiest way to install the data toolbox is using the Pypi package:
pip install ubg_data_toolbox
You can also clone this directory and install from there:
git clone https://github.com/geophysics-ubonn/ubg_data_toolbox
cd ubg_data_toolbox
pip install .
Questions
-
How do I merge two data trees? TODO
-
Isn't adding all the metadata of a given site/location highly repetitive?
Yes, but it also keeps things simple. The metadata definition, and the Python library, already support metadata added to different levels of a data tree. This way, common metadata entries can be propagated downwards.
This approach, however, introduces quite some complications:
- How to deal with inconsistent data (must be dealt with when integrating existing measurement directories into a data tree)?
- The tools to add new data files (e.g., dm_add) must be aware of this existing metadata to automatically include it (and point out inconstencies)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ubg_data_toolbox-0.2.5.tar.gz.
File metadata
- Download URL: ubg_data_toolbox-0.2.5.tar.gz
- Upload date:
- Size: 50.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d31202033e032bf6399b08ef609fa0501f6ab0cce6258006a37b41af01a3032c
|
|
| MD5 |
0297247da283f1f2d65f6c9cd0a062a5
|
|
| BLAKE2b-256 |
d35c12da1c7a10840631831e2a14d5c675bfe1d951f34fa6456cfc3d034458fb
|
File details
Details for the file ubg_data_toolbox-0.2.5-py3-none-any.whl.
File metadata
- Download URL: ubg_data_toolbox-0.2.5-py3-none-any.whl
- Upload date:
- Size: 59.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39323031291677e04147858902395f87004e3f7f56621df8d3e937d43c3d52cc
|
|
| MD5 |
2b897837894ba2a2a449d1c8856c49f0
|
|
| BLAKE2b-256 |
fbdf88a0f040931314ac929a6782896bade695904e1efa4bec6c10256a26d6cd
|