Skip to main content

Automated BIDS standardization tool powered by LLM-first architecture

Project description

autobidsify

Automated Brain Imaging Data Structure (BIDS) standardization tool powered by LLM-first architecture.

Website PyPI version License: MIT

Features

  • General compatibility: Handles diverse dataset structures (flat, hierarchical, multi-site)
  • Multi-modal support: MRI, fNIRS, EEG, and mixed modality datasets
  • Intelligent metadata extraction: Automatic participant demographics from DICOM headers, documents, and filenames
  • Format conversion: DICOM→NIfTI, JNIfTI→NIfTI, .mat/.nirs→SNIRF, and more
  • Multi-LLM support: OpenAI (gpt-4o, gpt-5.1) and Qwen (via Ollama locally, REST API, or DashScope)
  • Evidence-based reasoning: Confidence scoring and provenance tracking for all decisions

Supported Formats

Input formats:

  • MRI: DICOM (.dcm), NIfTI (.nii, .nii.gz), JNIfTI (.jnii, .bnii)
  • fNIRS: SNIRF (.snirf), Homer3 (.nirs), MATLAB (.mat)
  • EEG: EDF/EDF+ (.edf), BrainVision (.vhdr), EEGLAB (.set), Biosemi (.bdf)
  • Documents: PDF, DOCX, TXT, Markdown

Output: Compliant to BIDS specification (v1.10.0)

Installation

pip install autobidsify

Optional dependencies:

# For BIDS validation
npm install -g bids-validator

# For DICOM conversion
pip install dcm2niix          # or: apt-get install dcm2niix / brew install dcm2niix

Set API key:

# OpenAI
export OPENAI_API_KEY="your-key-here"

# Qwen via DashScope (optional cloud alternative to Ollama)
export DASHSCOPE_API_KEY="your-key-here"

Run all testing datasets:

./run_all_tests.sh

Quick Start

# Full pipeline (one command)
autobidsify full \
  --input /path/to/your/data \
  --output outputs/my_dataset \
  --model gpt-4o \
  --modality mri \
  --nsubjects 10 \
  --id-strategy auto \
  --describe "Your dataset description here"

# Step-by-step execution
autobidsify ingest   --input data/ --output outputs/run
autobidsify evidence --output outputs/run --modality mri
autobidsify trio     --output outputs/run --model gpt-4o
autobidsify plan     --output outputs/run --model gpt-4o
autobidsify execute  --output outputs/run
autobidsify validate --output outputs/run

Command Options

--input PATH            Input data (archive or directory)
--output PATH           Output directory
--model MODEL           LLM model (default: gpt-4o, maximum context 128000 tokens)
--modality TYPE         Data modality: mri | nirs | eeg | mixed
--nsubjects N           Number of subjects (optional, auto-detected if omitted)
--describe "TEXT"       Dataset description (recommended for metadata accuracy)
--id-strategy STRATEGY  Subject ID strategy: auto | numeric | semantic (default: auto)

Supported Models

OpenAI:

--model gpt-4o           # Recommended, stable
--model gpt-4o-mini      # Faster, cheaper
--model gpt-5.1          # Latest

Qwen (via local Ollama):

--model qwen3-coder-next:latest     # Recommended
--model qwen3-coder-careful:latest  # Recommended
--model qwen2.5-coder:7b            # Not recommended, slow and sometimes inaccurate

Qwen (via remote Ollama REST API):

export OLLAMA_BASE_URL=http://your-server.com:xxxx
--model qwen3-coder-next:latest

Qwen (via DashScope cloud API):

export DASHSCOPE_API_KEY="your-key-here"
--model qwen-max

Pipeline Stages

Stage Command Input Output Purpose
1 ingest Raw data ingest_info.json Extract/reference data
2 evidence All files evidence_bundle.json Analyze structure, detect subjects, scan auxiliary files
3 classify Mixed data classification_plan.json, pool directories Separate MRI/fNIRS/EEG (optional, mixed only)
4 trio Evidence BIDS trio files Generate dataset_description.json, README, participants.tsv
5 plan Evidence + trio BIDSPlan.yaml Create conversion strategy, generate modality-specific mappings
6 execute Plan bids_compatible/, conversion_log.json, BIDSManifest.yaml Execute conversions, generate BIDS sidecars
7 validate BIDS dataset Validation report Check compliance (Tier 1: Python bids_validator, Tier 2: npm bids-validator)

Output Structure

outputs/my_dataset/
├── bids_compatible/              # Final BIDS dataset
│   ├── dataset_description.json
│   ├── README.md
│   ├── participants.tsv
│   ├── sub-001/
│   │   ├── anat/
│   │   │   └── sub-001_T1w.nii.gz
│   │   ├── func/
│   │   │   └── sub-001_task-rest_bold.nii.gz
│   │   ├── nirs/
│   │   │   ├── sub-001_task-rest_nirs.snirf
│   │   │   └── sub-001_task-rest_nirs.json
│   │   └── eeg/
│   │       ├── sub-001_task-rest_eeg.edf
│   │       ├── sub-001_task-rest_eeg.json
│   │       ├── sub-001_task-rest_channels.tsv
│   │       ├── sub-001_optodes.tsv        # fNIRS only
│   │       ├── sub-001_electrodes.tsv     # EEG only
│   │       └── sub-001_coordsystem.json
│   └── derivatives/              # Unprocessed files (original structure)
└── _staging/                     # Intermediate files
    ├── evidence_bundle.json
    ├── BIDSPlan.yaml
    ├── mat_mapping.json           # fNIRS .mat datasets only
    ├── eeg_event_mapping.json     # EEG datasets with event files
    ├── eeg_aux_mapping.json       # EEG datasets with auxiliary metadata
    └── conversion_log.json

Examples

MRI dataset

autobidsify full \
  --input brain_scans/ \
  --output outputs/study1 \
  --model gpt-4o \
  --modality mri \
  --nsubjects 30 \
  --id-strategy numeric \
  --describe "Single-site T1w MRI study, 30 healthy adults"

fNIRS dataset

autobidsify full \
  --input fnirs_data/ \
  --output outputs/fnirs \
  --model gpt-4o \
  --modality nirs \
  --describe "Prefrontal fNIRS, 20 subjects, resting state and finger tapping"

EEG dataset

autobidsify full \
  --input eeg_data/ \
  --output outputs/eeg \
  --model gpt-4o \
  --modality eeg \
  --nsubjects 36 \
  --describe "EEG during mental arithmetic tasks, 36 subjects, EDF format"

Using Qwen (local, no API cost)

ollama serve
autobidsify full \
  --input data/ \
  --output outputs/run \
  --model qwen3-coder-next:latest \
  --modality mri

Architecture

LLM-First Design:

  • Python: Deterministic operations — file I/O, regex-based subject detection, format conversion, BIDS validation, standard 10-20 electrode lookup
  • LLM: Semantic understanding — dataset description, metadata extraction, scan type classification, license normalization, event file column mapping, auxiliary file analysis
  • Hybrid: Python analyzes ALL files for completeness; LLM sees representative samples for semantic decisions

Requirements

  • Python 3.10+
  • OpenAI API key (or Ollama for local Qwen models)
  • bids-validator (npm) for full structural validation (optional)
  • dcm2niix for DICOM conversion (optional)

Current Status

Version: 0.9.6

Contributing

We need YOUR datasets to improve robustness. Please test and report issues at: https://github.com/cotilab/autobidsify/issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autobidsify-0.9.6.tar.gz (134.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autobidsify-0.9.6-py3-none-any.whl (129.3 kB view details)

Uploaded Python 3

File details

Details for the file autobidsify-0.9.6.tar.gz.

File metadata

  • Download URL: autobidsify-0.9.6.tar.gz
  • Upload date:
  • Size: 134.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for autobidsify-0.9.6.tar.gz
Algorithm Hash digest
SHA256 1d003754abfcad4560209ee157c00c0b433e0fef296f419e9db7cbd0d6823d7e
MD5 502e47ef82cadc7d35cfc4b9c80d5900
BLAKE2b-256 30d6f83b9de487616dd144a07bce9044d0606b9297dc782c654b17c5809c9421

See more details on using hashes here.

File details

Details for the file autobidsify-0.9.6-py3-none-any.whl.

File metadata

  • Download URL: autobidsify-0.9.6-py3-none-any.whl
  • Upload date:
  • Size: 129.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for autobidsify-0.9.6-py3-none-any.whl
Algorithm Hash digest
SHA256 3112fbd7153884cae47d533765560baedf5042ca03fc72b4092d2f7a3ab0c811
MD5 aff5a31709c34f84763d8f33f260ffe1
BLAKE2b-256 c6d743a7e3945dabfae37a7793717deef8c260b47d6787844dc6f6cc114dfe6e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page