A Data Processing Tool to Standardize Publicly Available Clinical Diabetes Trial Data

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

Project description

BabelBetes

BabelBetes is the largest open-source data standardization pipeline for clinical diabetes trial datasets, currently normalizing approximately 500,000 subject-days of paired CGM and insulin pump delivery data across a variety of high quality publicly available datasets. It solves the "last mile" problem of inconsistent data formats, giving researchers and sponsors immediate access to shovel-ready data and accelerating innovation in type 1 diabetes care.

Challenges with Publicly Available Clinical Trial Data

Data is the raw material from which models are developed, simulations are composed, and new therapies to reduce the burden of living with type 1 diabetes are developed.

Clinical trials performed at great time and expense, funded by Breakthrough T1D, HCT, and NIH have provided large volumes of granular data which is often stored publicly (www.jaeb.org) or otherwise readily accessible (OPEN Project, OpenAPS, Nightscout Data Commons).

Unfortunately this is often the only data available to researchers and developers seeking to provide innovative solutions for people with type 1 diabetes, putting them at great disadvantage relative to leading medical device companies who together gather more data per day than exists in the entire public domain, ever (approximately 500,000 subject-days).

To add to this, public available data is not stored with consistent methods or formats, resulting in a confusing array of file formats and data descriptors which must be translated at great effort and with high probability of error by each and every researcher or developer hoping to gain insights.

Last Mile Problem

Babelbetes addresses this “last mile” problem by developing a publicly available set of tools to normalize clinical diabetes trial datasets, focusing on continuous glucose monitoring and insulin pump delivery. Babelbetes also provides recommendations on a normalized data set format to ensure future activities provide shovel-ready data for researchers and developers.

This is the official project documentation

Supported Studies

BabelBetes currently normalizes 9 datasets covering approximately 500,000 subject-days of paired CGM, basal, and bolus data. See the Supported Datasets overview for the full list with sources, versions, and known issues.

How to Contribute

BabelBetes was funded to be freely available, helping researchers and companies save costs and time, and supercharge innovation in diabetes care.

We’re incredibly excited for contributions that will expand its functionality and support even more datasets, making a bigger impact than ever before!

Learn more about how to contribute.

Key Features of the Toolbox

1. Analaysis scripts and documentation: You can learn about the datasets and what challenges came with normalizing tem by consulting the dataset summaries. You might also consult and review the jupyter notebooks that document our analysis.

2. Python modules: You can use the python modules to extract standardized continuous glucose monitor (CGM) and insulin pump data from the supported study datasets. Reuse the helper and drawing functions to work with the data.

Extend the functionality of existing study classes or add new implementations of the StudyDataset base class to support additional study datasets.

3. Recommendations: As guidance for investigators, we've summarized our learnings and challanges in a list of recommendations that we believe would dramatically improve the quality and usability of datasets published in the future.

Data Standardization

The ultimate purpose of this toolbox is to bring CGM, insulin, and demographic data into a common standardized format. We chose to abstract study datasets as objects. Each study class derives from the parent StudyDataset class and overrides methods to extract cgm, bolus, basal, and age data. The StudyDataset base class defines methods to extract all data types in standardized pandas dataframes.

Supported Data Types

Bolus Data - extract_bolus_event_history():

Column Name	Type	Description
`patient_id`	`str`	Patient ID
`datetime`	`pd.Timestamp`	Datetime of the bolus event
`bolus`	`float`	Actual delivered bolus amount in units
`delivery_duration`	`pd.Timedelta`	Duration of the bolus delivery

Basal Data - extract_basal_event_history():

Column Name	Type	Description
`patient_id`	`str`	Patient ID
`datetime`	`pd.Timestamp`	Datetime of the basal rate start event
`basal_rate`	`float`	Basal rate in units per hour

CGM Data - extract_cgm_history():

Column Name	Type	Description
`patient_id`	`str`	Patient ID
`datetime`	`pd.Timestamp`	Datetime of the CGM measurement
`cgm`	`float`	CGM value in mg/dL

Age Data - extract_age_data():

Column Name	Type	Description
`patient_id`	`str`	Patient ID
`age`	`float`	Patient age at study enrollment/start

refer to the Code Reference for more details.

How to use BabelBetes (Quickstart)

Here, we explain how to install the toolbox and how to use the run_functions.py script that batch processes all studies and extracts the standardized data.

Setup Python

Make sure you have python version > 3.X installed.
We recommend using a python virtual environment (see using vitual environments)

Installation

Installation via `pip install`

This repository can be used as a dependency in other projects with pip. Install by running:

pip install babelbetes

Example usage:

from babelbetes.studies import Flair
import os

current_dir = os.path.dirname(os.path.abspath(__file__))

# MODIFY SO THE PATH POINTS TO YOUR RAW DATA. THIS CAN BE EITHER THE .zip OR UNZIPPED FOLDER
study_path = os.path.join(current_dir, 'FLAIRPublicDataSet.zip')
flair = Flair(study_path)

basal_events = flair.extract_basal_event_history()
cgm = flair.extract_cgm_history()
boluses = flair.extract_bolus_event_history()
age_data = flair.extract_age_data() 

print("Basal events: ", basal_events.head())
print("CGM events: ", cgm.head())
print("Boluses: ", boluses.head())
print("Age data: ", age_data.head())

Developer

Clone the repository:

git clone git@github.com:nudgebg/babelbetes.git

Install all dependencies

In your terminal, navigate to the repository
(Optional) activate your python virtual environment
Run this command to install all packages required by BabelBetes

pip install -r requirements.txt

Prepare the raw data

Download the study data zip files from jaeb.org (see supported studies).
Move the files inside the data/raw directory. Zipped files can either be used directly or unzipped. Do not rename the file/folder names, otherwise the run_functions.py won't know how to process them.
Depending on which studies you downloaded and whether you have .zip archives (or unzipped folders), the folder structure should look like this:

    babelbetes/
    ├── babelbetes/
    │   ├── studies/
    │   ├── src/
    │   └── run_functions.py
    ├── data/
    │   └── raw/
    │       ├── FLAIRPublicDataSet.zip
    │       ├── DCLP3 Public Dataset - Release 3 - 2022-08-04
    │       ├── IOBP2 RCT Public Dataset
    │       ├── T1DEXI - DATA FOR UPLOAD
    │       └── T1DEXIP - DATA FOR UPLOAD.zip
    ├── docs/
    ├── examples/
    └── tests/

Run run_functions.py to batch Extract data

The run_functions.py script is the entry point for users that simply want to extract standardized data from the supported studies. It performs data extraction and standarization. For each folder in the data/raw directory the script:

Identifies the appropriate handler class (see supported studies)
Loads the study data
Extracts bolus, basal, CGM event histories, and age data to a standardized format (see data standardization)
Saves the extracted data in CSV/Parquet format

Command Usage:

# Run from project root - extract all data types
python -m babelbetes.run_functions

# Extract specific data types only
python -m babelbetes.run_functions --data-types age cgm
python -m babelbetes.run_functions --data-types bolus basal

# Process specific studies only
python -m babelbetes.run_functions --studies Flair DCLP3

# Run in test mode with subset of data
python -m babelbetes.run_functions --test

Example terminal output:

> python -m babelbetes.run_functions
[15:26:22] Looking for study folders in /data/raw and saving results to /data/out
[15:26:22] Start processing supported study folders:
[15:26:22] 'T1DEXI' using T1DEXI class
[15:26:22] 'REPLACE-BG Dataset-79f6bdc8-3c51-4736-a39f-c4c0f71d45e5' using ReplaceBG class
...
[15:26:22] Processing T1DEXI ...
[15:26:56] [x] Data loaded
[15:26:56] [x] Boluses extracted
[15:27:00] [x] Basal extracted
[15:27:12] [x] CGM extracted
[15:27:13] [x] Age data extracted
[15:27:12] T1DEXI completed in 37.43 seconds.
...
Processing complete.

Extract specific data types:

# Extract only age data
> python -m babelbetes.run_functions --data-types age

# Extract multiple data types
> python -m babelbetes.run_functions --data-types cgm bolus age

# Extract all data types (default behavior)
> python -m babelbetes.run_functions --data-types cgm bolus basal age

Execution Times

These are approximate execution times

	MacBook Pro M3
Flair	58 seconds
IOBP2	26 seconds
PEDAP	34 seconds
DCLP3	15 seconds
DCLP5	23 seconds
T1DEXI	37 seconds
T1DEXIP	7 seconds
Replace BG	30 seconds
Loop	151 seconds*
Total	~383 seconds

* Loop raw data files are very large which requires the use of dask. dask builds upon pandas and processes chunks of the data in parallel. However, the routine to save the data to csv - at the moment - still requires the whole dataframe to be loaded into memory before storing it which might fail if your machine has insufficient memory.

Troubleshooting

Ensure the raw data folders are named correctly to match the patterns in the script. You shouldn't need to rename the folders or zip archivesafter you downloaded the datasets.
Check the console output for any warning or error messages.

License

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

janvv miriamkw

Release history Release notifications | RSS feed

This version

0.3.1

Jun 2, 2026

0.3.0

Jun 2, 2026

0.2.3

Apr 8, 2026

0.2.2

Mar 31, 2026

0.2.1

Mar 17, 2026

0.1.0

Dec 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babelbetes-0.3.1.tar.gz (14.6 MB view details)

Uploaded Jun 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

babelbetes-0.3.1-py3-none-any.whl (78.8 kB view details)

Uploaded Jun 2, 2026 Python 3

File details

Details for the file babelbetes-0.3.1.tar.gz.

File metadata

Download URL: babelbetes-0.3.1.tar.gz
Upload date: Jun 2, 2026
Size: 14.6 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for babelbetes-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`59855aa6c263b7876e5c4092807d90a19d0d4257bf2c15b79b05da778d02f368`
MD5	`dab884d69b6b5d8e260c66e94bf0c1eb`
BLAKE2b-256	`fcf48c02f6ae50273dbf54fb820b0f346e2954cbb5099585a99a2c61ae2bff2e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for babelbetes-0.3.1.tar.gz:

Publisher: 2_release.yml on nudgebg/babelbetes

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: babelbetes-0.3.1.tar.gz
- Subject digest: 59855aa6c263b7876e5c4092807d90a19d0d4257bf2c15b79b05da778d02f368
- Sigstore transparency entry: 1703655726
- Sigstore integration time: Jun 2, 2026
Source repository:
- Permalink: nudgebg/babelbetes@586191c12dd1d19134228c5d2698cda045f1af17
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/nudgebg
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: 2_release.yml@586191c12dd1d19134228c5d2698cda045f1af17
- Trigger Event: release

File details

Details for the file babelbetes-0.3.1-py3-none-any.whl.

File metadata

Download URL: babelbetes-0.3.1-py3-none-any.whl
Upload date: Jun 2, 2026
Size: 78.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for babelbetes-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2390edd7715934a93d21fc7d0522a268a1d8a97e91e4ebd0990afcca3f35c563`
MD5	`c5d5941b29a3e8107c897b2a550c4b64`
BLAKE2b-256	`79a7e63f59e4d88be2a3ced477329daca18c0e6a77aa55576a44972da1c78832`

See more details on using hashes here.

Provenance

The following attestation bundles were made for babelbetes-0.3.1-py3-none-any.whl:

Publisher: 2_release.yml on nudgebg/babelbetes

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: babelbetes-0.3.1-py3-none-any.whl
- Subject digest: 2390edd7715934a93d21fc7d0522a268a1d8a97e91e4ebd0990afcca3f35c563
- Sigstore transparency entry: 1703655772
- Sigstore integration time: Jun 2, 2026
Source repository:
- Permalink: nudgebg/babelbetes@586191c12dd1d19134228c5d2698cda045f1af17
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/nudgebg
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: 2_release.yml@586191c12dd1d19134228c5d2698cda045f1af17
- Trigger Event: release

babelbetes 0.3.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

BabelBetes

Challenges with Publicly Available Clinical Trial Data

Last Mile Problem

Supported Studies

How to Contribute

Key Features of the Toolbox

Data Standardization

Supported Data Types

How to use BabelBetes (Quickstart)

Setup Python

Installation

Installation via pip install

Developer

Prepare the raw data

Run run_functions.py to batch Extract data

Execution Times

Troubleshooting

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Installation via `pip install`