Skip to main content

Local data management for geophysical measurements

Project description

ubg_data_management

Introduction

Data management is hard.

Data management in small (work) groups is even harder, especially if data is very inhomogeneous, often incomplete, and data (format) standards are missing in a given field of research.

Goal: This repository presents work of the Geophysics Section of the University of Bonn to develop a set of guidelines, meta-data entries, and Python helper scripts for a base research data management.

Problems with data management (DM)

  • DM is tedious, and often does not lead to direct benefits for the researcher
  • Often, guidelines and standards are missing for data management on the lowest level of research, i.e., at the level of data creation
  • DM is resource intensive. Commonly, metadata is entered and stored in database, which need to be set up and maintained, including front end software for data input and validation
  • In research environments, there often is a frequent staff turn-over, complicating long-term maintenance issues

Our approach

We aim to alleviate the issue of data management at the lowest level by

  • defining a simple directory structure to store heterogeneous research data (the data tree)
  • defining a simple set of metadata entries that are stored in human- and machine-readable .ini format within the directory structure
  • provide a set of python libraries and helper scripts for simple DM tasks, such as adding new data to a data tree, or listing all available measurements

Onion-shell principle

We recognize that DM requirements vary across institutions, even between individual researchers.

We envision our DM practices the smallest shell of a DM stack, as a basic fall back that requires no special hardware or expert skills.

If resources are available, the metadata files stored in the data tree can be scanned and imported in a database, and built upon to create sophisticated DM practices.

The data tree can also be used for easy export and subsequent import into larger-scale DM operations, such as often operated by research projects or larger research institutions.

Required hard- and software

  • A directory tree can be created by hand, if required. Therefore, only a computer and a file browser is required

  • In order to use the provided Python scripts and libraries, a working Python interpreter is required, as well as the following packages:

    • numpy
    • prompt_toolkit
    • pandas

The data tree

A data tree consists of pre-defined levels, some of which are optional. Each directory level is uniquely identified by a two-character prefix, separated by the level name by an underscore. Some levels are restricted to a certain set of possible level names (i.e., the target level only allows the values field or laboratory).

The following image visualizes the directory structure:

dirstruc.jpg

An example a directory tree (with only one measurement) is:

└── dr_data
  └── tc_hydrogeophysics
      └── t_field
          └── s_Spiekeroog
              └── a_North
                  └── md_ERT
                      └── p_p_01_nor
                          └── m_01_p1_nor
                              ├── metadata.ini
                              └── RawData
                                  └── data.dat

Measurement directories

Measurement directories (starting with m_) can contain the following subdirectories (it is advised to not create empty directories):

  • Analysis/ directories are `free-for-all', that is, no internal structure is prescribed. Use these directories to store relevant analysis steps (but in general, analysis should happen outside a data directory tree!)
  • DataProcessed/ holds processed data, e.g., with certain corrections or clean-ups applied. Ensure proper documentation!
  • DataRaw/ holds the raw data, as downloaded from the device
  • Documentation/ contains all auxiliary data that can not be included in the metadata file. This includes maps, pdfs, literature and external documentation
  • Pictures/ Store relevant pictures of the measurement in here

Special directories

  • All levels are allowed to include an subdirectory Documentation. This directory can contain arbitrary information that is considered important for a given level/measurement. Use these directories to store auxiliary information, such as maps, notebook scans, programming information of measurement devices, etc.
  • The dr_[DATA ROOT NAME] directory can contain a subdirectory .management. This directory is used to store temporary/caching data of the data toolbox. It can always be safely deleted without removing any relevant data. The directory is used, for example, to store a list of currently-used ids.

The metadata entries

A typical metadata.ini file can look like this:

[general]
label = 20240610_ert_p1_nor
person_responsible = Maximilian Weigand
person_email = mw@domain.com
theme_complex = Hydrogeophysics
datetime_start = 20240610_1200
description = A small test measurement
    Note that some entries are multi-line capable!
survey_type = field
method = ERT
completed = yes

[field]
site = Spiekeroog
area = north
profile = p_01

[geoelectrics]
profile_direction = normal

Metadata entries are comprised of "key=value" pairs, grouped by [sections].

The Python helper libraries and scripts

Adding new data

The command dm_add can be used to easily add data to an existing, or new, data tree. The command will display information in the terminal (command line) and ask for input.

Example:

    $ dm_add -t dr_data -i walkthrough.qmd
    --------------------------------------------------------------------------------
    Input: ['walkthrough.qmd']
    Output Data Tree: dr_data
    --------------------------------------------------------------------------------

    Filename with highest priority /home/mweigand/.data_toolbox/ub_geoph_dm.cfg
    --------------------------------------------------------------------------------
            Please enter required metadata entries:
    --------------------------------------------------------------------------------

    --------------------------------------------------------------------------------
    Delete last 100 characters: STRG - a
    Ignore current input and go backwards: STRG - u
    Commit current input and stop data input STRG - z
    There are autocomplete values available (Press TAB).
    --------------------------------------------------------------------------------
    Field or laboratory measurements? Allowed values: field, laboratory
    Enter value for general.survey_type: field

Checking an existing directory tree

$ dm_check_dirtree
Working in directory /home/mweigand/test/dr_data
################################################################################
Checking directory structure of directory: /home/mweigand/test/dr_data
................................................................................
Directory dr_data
    Directory dr_data/tc_Hydrogeophysics
        Directory dr_data/tc_Hydrogeophysics/t_field
            Directory dr_data/tc_Hydrogeophysics/t_field/s_Spiekeroog
                Directory dr_data/tc_Hydrogeophysics/t_field/s_Spiekeroog/a_HausAmMeer
                    Directory dr_data/tc_Hydrogeophysics/t_field/s_Spiekeroog/a_HausAmMeer/md_TEMP
                        Directory dr_data/tc_Hydrogeophysics/t_field/s_Spiekeroog/a_HausAmMeer/md_TEMP/p_profile_02
                            Directory dr_data/tc_Hydrogeophysics/t_field/s_Spiekeroog/a_HausAmMeer/md_TEMP/p_profile_02/m_2025.2asd
                                Check empty directories:  ok
                                Check for required metadata.ini file:  ok
                                Check for required metadata entries
                                    Required entry [geoelectrics]-profile_direction is missing
                                    Required entry [geoelectrics]-electrode_positions is missing

                                Check metadata contents
                                    OK: [general][datetime_start]

                Directory dr_data/tc_Hydrogeophysics/t_field/s_Spiekeroog/a_north
                    Directory dr_data/tc_Hydrogeophysics/t_field/s_Spiekeroog/a_north/md_ERT
                        Directory dr_data/tc_Hydrogeophysics/t_field/s_Spiekeroog/a_north/md_ERT/p_p_01
                            Directory dr_data/tc_Hydrogeophysics/t_field/s_Spiekeroog/a_north/md_ERT/p_p_01/m_very_important
                                Check empty directories:  ok
                                Check for required metadata.ini file:  ok
                                Check for required metadata entries
                                    Required entry [general]-description is empty
                                    Required entry [geoelectrics]-electrode_positions is empty

                                Check metadata contents
                                    FAIL: [general][datetime_start] is not a valid date format!
################################################################################

Installation

The easiest way to install the data toolbox is using the Pypi package:

pip install ubg_data_toolbox

You can also clone this directory and install from there:

git clone https://github.com/geophysics-ubonn/ubg_data_toolbox
cd ubg_data_toolbox
pip install .

Questions

  • How do I merge two data trees? TODO

  • Isn't adding all the metadata of a given site/location highly repetitive?

    Yes, but it also keeps things simple. The metadata definition, and the Python library, already support metadata added to different levels of a data tree. This way, common metadata entries can be propagated downwards.

    This approach, however, introduces quite some complications:

    • How to deal with inconsistent data (must be dealt with when integrating existing measurement directories into a data tree)?
    • The tools to add new data files (e.g., dm_add) must be aware of this existing metadata to automatically include it (and point out inconstencies)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ubg_data_toolbox-0.2.5.tar.gz (50.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ubg_data_toolbox-0.2.5-py3-none-any.whl (59.6 kB view details)

Uploaded Python 3

File details

Details for the file ubg_data_toolbox-0.2.5.tar.gz.

File metadata

  • Download URL: ubg_data_toolbox-0.2.5.tar.gz
  • Upload date:
  • Size: 50.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for ubg_data_toolbox-0.2.5.tar.gz
Algorithm Hash digest
SHA256 d31202033e032bf6399b08ef609fa0501f6ab0cce6258006a37b41af01a3032c
MD5 0297247da283f1f2d65f6c9cd0a062a5
BLAKE2b-256 d35c12da1c7a10840631831e2a14d5c675bfe1d951f34fa6456cfc3d034458fb

See more details on using hashes here.

File details

Details for the file ubg_data_toolbox-0.2.5-py3-none-any.whl.

File metadata

File hashes

Hashes for ubg_data_toolbox-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 39323031291677e04147858902395f87004e3f7f56621df8d3e937d43c3d52cc
MD5 2b897837894ba2a2a449d1c8856c49f0
BLAKE2b-256 fbdf88a0f040931314ac929a6782896bade695904e1efa4bec6c10256a26d6cd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page