Skip to main content

Automated BIDS standardization tool powered by LLM-first architecture

Project description

autobidsify

Automated Brain Imaging Data Structure (BIDS) standardization tool powered by LLM-first architecture.

Website PyPI version License: MIT

Features

  • General compatibility: Handles diverse dataset structures (flat, hierarchical, multi-site)
  • Multi-modal support: MRI, fNIRS, and mixed modality datasets
  • Intelligent metadata extraction: Automatic participant demographics from DICOM headers, documents, and filenames
  • Format conversion: DICOM→NIfTI, JNIfTI→NIfTI, .mat/.nirs→SNIRF, and more
  • Multi-LLM support: OpenAI (gpt-4o, gpt-5.1, o1, o3) and Qwen (via Ollama or DashScope)
  • Evidence-based reasoning: Confidence scoring and provenance tracking for all decisions

Supported Formats

Input formats:

  • MRI: DICOM (.dcm), NIfTI (.nii, .nii.gz), JNIfTI (.jnii, .bnii)
  • fNIRS: SNIRF (.snirf), Homer3 (.nirs), MATLAB (.mat)
  • Documents: PDF, DOCX, TXT, Markdown

Output: Compliant to BIDS specification (v1.10.0)

Installation

pip install autobidsify

Optional dependencies:

# For DICOM conversion
apt-get install dcm2niix          # Ubuntu/Debian
brew install dcm2niix             # macOS

# For BIDS validation
npm install -g bids-validator

# For Qwen models (local)
# Install Ollama from https://ollama.com/download
ollama pull qwen2.5-coder:7b
pip install ollama

Set API key:

# OpenAI
export OPENAI_API_KEY="your-key-here"

# Qwen via DashScope (optional cloud alternative to Ollama)
export DASHSCOPE_API_KEY="your-key-here"

Quick Start

# Full pipeline (one command)
# With dataset description (recommended for better metadata extraction)
autobidsify full \
  --input /path/to/your/data \
  --output outputs/my_dataset \
  --model gpt-4o \
  --modality mri \
  --nsubjects 10 \
  --id-strategy numeric \
  --describe "Your dataset description here"

# Step-by-step execution
autobidsify ingest  --input data/ --output outputs/run
autobidsify evidence --output outputs/run --modality mri
autobidsify trio   --output outputs/run --model gpt-4o
autobidsify plan   --output outputs/run --model gpt-4o
autobidsify execute  --output outputs/run
autobidsify validate --output outputs/run

Command Options

--input PATH            Input data (archive or directory)
--output PATH           Output directory
--model MODEL           LLM model (default: gpt-4o)
--modality TYPE         Data modality: mri | nirs | mixed
--nsubjects N           Number of subjects (optional, auto-detected if omitted)
--describe "TEXT"       Dataset description (recommended for metadata accuracy)
--id-strategy STRATEGY  Subject ID strategy: auto | numeric | semantic (default: auto)

Supported Models

OpenAI:

--model gpt-4o           # Highly recommended, stable
--model gpt-4o-mini      # Faster, cheaper
--model gpt-5.1          # Not that ecommended, latest

Qwen (via Ollama, local):

--model qwen3-coder-next:latest     # Recommended
--model qwen3-coder-careful:latest  # Recommended
--model qwen2.5-coder:7b            # Slow and sometimes inaccurate, not recommended

Qwen (via rest-api):

export OLLAMA_BASE_URL=http://your-server.com:xxxx

Pipeline Stages

Stage Command Input Output Purpose
1 ingest Raw data ingest_info.json Extract/reference data
2 evidence All files evidence_bundle.json Analyze structure, detect subjects
3 classify Mixed data classification_plan.json Separate MRI/fNIRS (optional)
4 trio Evidence BIDS trio files Generate metadata files
5 plan Evidence + trio BIDSPlan.yaml Create conversion strategy
6 execute Plan bids_compatible/ Execute conversions
7 validate BIDS dataset Validation report Check compliance

Output Structure

outputs/my_dataset/
├── bids_compatible/              # Final BIDS dataset
│   ├── dataset_description.json
│   ├── README.md
│   ├── participants.tsv
│   ├── sub-001/
│   │   ├── anat/
│   │   │   └── sub-001_T1w.nii.gz
│   │   └── func/
│   │       └── sub-001_task-rest_bold.nii.gz
│   └── derivatives/              # Unprocessed files (original structure)
│       └── sub-001/
│           └── ...
└── _staging/                     # Intermediate files
    ├── evidence_bundle.json
    ├── BIDSPlan.yaml
    └── conversion_log.json

Examples

Example 1: Single-site MRI study

autobidsify full \
  --input brain_scans/ \
  --output outputs/study1 \
  --nsubjects 50 \
  --model gpt-4o \
  --modality mri

Example 2: Multi-site dataset with description

autobidsify full \
  --input camcan_data/ \
  --output outputs/camcan \
  --model gpt-4o \
  --modality mri \
  --id-strategy semantic \
  --describe "Cambridge Centre for Ageing and Neuroscience: 650 participants, ages 18-88, multi-site MRI study"

Example 3: fNIRS dataset

autobidsify full \
  --input fnirs_study/ \
  --output outputs/fnirs \
  --model gpt-4o \
  --modality nirs \
  --describe "Prefrontal cortex activation during cognitive tasks, 30 subjects"

Example 4: Using Qwen (local, no API cost)

ollama serve
autobidsify full \
  --input data/ \
  --output outputs/run \
  --model qwen2.5-coder:7b \
  --modality mri

Architecture

LLM-First Design:

  • Python: Deterministic operations — file I/O, regex-based subject detection, format conversion, BIDS validation
  • LLM: Semantic understanding — dataset description, metadata extraction, scan type classification, license normalization
  • Hybrid: Python analyzes ALL files for completeness; LLM sees representative samples for semantic decisions

Requirements

  • Python
  • OpenAI API key (or Ollama for local Qwen models)
  • dcm2niix for DICOM conversion
  • bids-validator for validation

Current Status

Version: 0.6.0

Tested datasets:

  • Visible Human Project (flat structure, DICOM CT)
  • CamCAN (hierarchical, multi-site, 30+ subjects)
  • 1-FRESH-Motor (fNIRS, existing BIDS format)
  • fNIRS tinnitus dataset (flat structure, .nirs files)

Known limitations:

  • Mixed modality classification (Stage 3) is experimental
  • .mat fNIRS conversion assumes Homer3-compatible variable naming

Contributing

We need YOUR datasets to improve robustness. Please test and report issues at: https://github.com/cotilab/autobidsify/issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autobidsify-0.6.0.tar.gz (96.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autobidsify-0.6.0-py3-none-any.whl (90.2 kB view details)

Uploaded Python 3

File details

Details for the file autobidsify-0.6.0.tar.gz.

File metadata

  • Download URL: autobidsify-0.6.0.tar.gz
  • Upload date:
  • Size: 96.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autobidsify-0.6.0.tar.gz
Algorithm Hash digest
SHA256 f5be2c5fb28c28b8d2f3bdb6c01c538a5b81e1e155024ef0b2979f4c3e579d36
MD5 2e1a1bfac3abe206744e44c63d53d717
BLAKE2b-256 1deec74a32be425dbfc80fa57974a2b7636fcc0ffd688c1dc363fa90ca704a23

See more details on using hashes here.

File details

Details for the file autobidsify-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: autobidsify-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 90.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autobidsify-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d9ada04d20721970b080621afb78920832b520421161e35a0748fa33b39b1415
MD5 55b4c9bc53d859cafd0e670710fd8787
BLAKE2b-256 ba077f07b934dad615157d01bcb546e8189de71b8bb336e65b65f09d21f223a7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page