Skip to main content

Clinical ML Pipeline Toolkit — bridging raw clinical data and production-ready ML pipelines

Project description

clinops 🏥

Clinical ML Pipeline Toolkit — production-grade data loading, preprocessing, and time-series feature engineering for healthcare AI research.

PyPI version Python 3.12+ License: Apache 2.0 Tests Code style: black


Every healthcare AI project starts with the same two weeks of plumbing: loading MIMIC-IV tables without hitting memory limits, clipping physiologically impossible values before they corrupt your model, normalizing glucose from mmol/L to mg/dL across sites, building time-series windows that handle clinical missingness correctly, and splitting data without leaking patients across folds. clinops packages those hard-won patterns into a single, well-tested library so your first notebook is actual science.

Built from production experience in clinical and genomic data engineering across multi-cloud environments.


v0.1 Modules

Module What it does
clinops.ingest Loaders for MIMIC-IV, FHIR R4, and flat CSV/Parquet with schema validation. Includes MimicTableLoader with pre-built schemas for the five tables researchers always need.
clinops.temporal Sliding/tumbling windows, gap-aware imputation, lag features, cohort alignment
clinops.preprocess Outlier clipping with physiological bounds, unit normalization (mg/dL ↔ mmol/L etc.), ICD-9→10 mapping
clinops.split Temporal, patient-level, and stratified patient train/test splitting

Roadmap: clinops.monitor (drift detection, data quality) and clinops.orchestrate (GCS/S3, Step Functions) are planned for v0.2.


Quickstart

pip install clinops

clinops.ingest

MimicTableLoader — pre-built schemas, no manual ColumnSpec required

MimicTableLoader wraps MimicLoader and exposes the five MIMIC-IV tables researchers use in every project with fully validated schemas out of the box. No ColumnSpec definitions, no schema boilerplate.

from clinops.ingest import MimicTableLoader

tbl = MimicTableLoader("/data/mimic-iv-2.2")

# ICU vitals — charttime parsed as datetime automatically
charts = tbl.chartevents(subject_ids=[10000032, 10000980])

# Lab results — reference range columns dropped by default (sparse in most exports)
labs = tbl.labevents(subject_ids=[10000032], with_ref_range=True)

# Hospital admissions with mortality flag
adm = tbl.admissions(subject_ids=[10000032])

# ICD-9/10 diagnoses — primary_only keeps only seq_num == 1
dx = tbl.diagnoses_icd(subject_ids=[10000032], primary_only=True)

# ICU stays — with_los_band adds <1d / 1-3d / 3-7d / >7d column
stays = tbl.icustays(subject_ids=[10000032], with_los_band=True)

Audit a new MIMIC download without loading full tables into memory:

tbl.summary()
#        table  rows_sampled  columns  null_rate_pct
#  chartevents         10000       23           8.41
#    labevents         10000       12           4.17
#   admissions         10000       15           6.02
# diagnoses_icd        10000        5           0.00
#     icustays         10000        8           2.31

MimicLoader — full control

For advanced filtering and chunk-based loading of large tables, use MimicLoader directly:

from clinops.ingest import MimicLoader

loader = MimicLoader("/data/mimic-iv-2.2")

charts = loader.chartevents(
    subject_ids=[10000032, 10000980],
    start_time="2150-01-01",
    end_time="2150-01-10",
)
labs  = loader.labevents(subject_ids=[10000032, 10000980])
stays = loader.icustays(subject_ids=[10000032, 10000980])

Load FHIR R4 resources

from clinops.ingest import FHIRLoader

loader   = FHIRLoader("/data/fhir_export")
obs      = loader.observations(category="vital-signs")
patients = loader.patients()

Validate any flat clinical export

from clinops.ingest import FlatFileLoader, ClinicalSchema, ColumnSpec

schema = ClinicalSchema(
    name="vitals",
    columns=[
        ColumnSpec("subject_id", nullable=False),
        ColumnSpec("heart_rate", min_value=0,  max_value=300),
        ColumnSpec("spo2",       min_value=50, max_value=100),
    ]
)
df = FlatFileLoader("vitals.csv", schema=schema).load()

clinops.preprocess

Clip physiologically impossible values

Standard statistical outlier methods (z-score, IQR) are wrong for clinical data — a heart rate of 180 in a patient with SVT is clinically meaningful and should not be removed. ClinicalOutlierClipper uses published physiological bounds to remove values that are impossible regardless of patient state.

from clinops.preprocess import ClinicalOutlierClipper

clipper = ClinicalOutlierClipper(action="clip")  # or "null" or "flag"
clean_df = clipper.fit_transform(vitals_df)

print(clipper.report())
#    column  low_outliers  high_outliers  pct_outliers  bound_low  bound_high
#  heart_rate             0              3         0.012          0         300
#        spo2             1              0         0.004         50         100

Built-in bounds cover 20 vitals and labs (heart_rate, spo2, sbp, glucose, creatinine, ph, wbc, and more). Add site-specific ranges with add_bounds().

Normalize units across sites

Multi-site studies routinely mix mg/dL and mmol/L for the same lab, or °F and °C for temperature. UnitNormalizer detects non-standard units via a companion unit column and converts in-place.

from clinops.preprocess import UnitNormalizer

# df has a "glucose" column and a "glucose_unit" column (mixed "mg/dL" / "mmol/L")
normalizer = UnitNormalizer(column_unit_map={"glucose": "glucose_unit"})
df = normalizer.transform(df)
# All glucose values now in mg/dL; glucose_unit column updated

print(normalizer.report())
#   column from_unit to_unit  n_converted
#  glucose    mmol/L   mg/dL          142

30 registered conversions covering glucose, creatinine, bilirubin, haemoglobin, calcium, temperature, weight, and height.

Harmonize ICD-9 and ICD-10 codes

MIMIC-III uses ICD-9, MIMIC-IV mixes both versions, and many real-world datasets span the October 2015 transition. ICDMapper converts ICD-9-CM codes to ICD-10-CM and adds chapter-level groupings for ML features.

from clinops.preprocess import ICDMapper

mapper = ICDMapper()

# Map a mixed-version DataFrame to ICD-10 in-place
df = mapper.harmonize(df, code_col="icd_code", version_col="icd_version")

# Add chapter-level grouping (e.g. "Diseases of the circulatory system")
df["chapter"] = mapper.chapter_series(df["icd_code"])

# Map a single code
mapper.map_code("4280")   # → "I509"

Ships with ~60 curated high-frequency mappings. Load the full CMS GEM file (~72,000 mappings) with ICDMapper.from_gem_file(path).


clinops.temporal

Build temporal feature windows

from clinops.temporal import TemporalWindower, ImputationStrategy

windower = TemporalWindower(
    window_hours=24,
    step_hours=6,
    imputation=ImputationStrategy.FORWARD_FILL,
    min_observations=3,
)

windows = windower.fit_transform(
    df=charts,
    id_col="subject_id",
    time_col="charttime",
    feature_cols=["heart_rate", "spo2", "resp_rate", "map"],
)
# → DataFrame: subject_id | window_start | window_end | heart_rate | spo2 | ...

Long-format input (MIMIC native itemid × valuenum)

windows = windower.fit_transform(
    df=charts,
    id_col="subject_id",
    time_col="charttime",
    item_col="itemid",    # auto-pivots to wide format
    value_col="valuenum",
)

Add lag and rolling features

from clinops.temporal import LagFeatureBuilder

enriched = LagFeatureBuilder(
    lags=[1, 2, 4],
    rolling_windows=[4, 8],
    id_col="subject_id",
).fit_transform(windows)
# Adds: heart_rate_lag1, heart_rate_roll4_mean, heart_rate_roll4_std, ...

Align a cohort to an anchor event (e.g. ICU admission)

from clinops.temporal import CohortAligner

aligned = CohortAligner(
    anchor_col="intime",
    max_hours_before=0,
    max_hours_after=48,
).align(events_df=charts, anchor_df=stays)
# → filtered to 48h post-admission, with hours_from_anchor column

Imputation strategies

Clinical data has unique missingness patterns that standard ML windowing gets wrong. clinops provides strategies tuned for clinical context:

Strategy Best for
FORWARD_FILL Slowly-changing vitals — carry last observation forward
BACKWARD_FILL Values recorded with lag
LINEAR Continuous signals with regular sampling
MEAN / MEDIAN Fit on training set, apply to test (no leakage)
INDICATOR Adds {col}_missing binary column — lets model learn from missingness
NONE Leave NaN in place
from clinops.temporal import Imputer, ImputationStrategy

imputer = Imputer(ImputationStrategy.MEAN, per_patient=True, id_col="subject_id")
imputer.fit(train_windows)
test_windows = imputer.transform(test_windows)

clinops.split

Standard sklearn.train_test_split is wrong for clinical ML: it leaks future observations into training, and splits patients across folds so the model memorises patient-specific patterns rather than generalising.

Temporal split — no future leakage

from clinops.split import TemporalSplitter

result = TemporalSplitter(cutoff="2155-01-01", time_col="charttime").split(df)
# or auto-compute cutoff from the data:
result = TemporalSplitter(train_frac=0.8, time_col="charttime").split(df)

print(result.summary())
# Train: 38,400 rows (80.0%)
# Test:   9,600 rows (20.0%)
# cutoff: 2155-01-01 00:00:00

Patient-level split — no leakage across admissions

from clinops.split import PatientSplitter

result = PatientSplitter(id_col="subject_id", test_size=0.2).split(df)

# Guaranteed: no patient appears in both splits
assert not set(result.train["subject_id"]) & set(result.test["subject_id"])

Stratified patient split — preserves outcome rate

Critical for imbalanced clinical endpoints (in-hospital mortality is typically 5–15%). Stratifies on a binary outcome while respecting patient boundaries.

from clinops.split import StratifiedPatientSplitter

result = StratifiedPatientSplitter(
    id_col="subject_id",
    outcome_col="hospital_expire_flag",
    test_size=0.2,
).split(df)

print(result.summary())
# Train: 32,000 rows (80.0%)
# Test:   8,000 rows (20.0%)
# population_outcome_rate: 0.0821
# train_outcome_rate:      0.0819
# test_outcome_rate:       0.0826

Installation

Requires Python 3.12+.

pip install clinops           # core
pip install clinops[fhir]     # adds FHIR R4 loader
pip install clinops[gcp]      # adds GCP extras (for v0.2)
pip install -e ".[dev]"       # development

Supported sources

Source Format
MIMIC-IV v2.0–v2.2 CSV, CSV.GZ, Parquet
FHIR R4 JSON Bundle, NDJSON
Flat files CSV, CSV.GZ, Parquet

Contributing

See CONTRIBUTING.md. Run pytest tests/ -v and ruff check clinops/ before opening a PR.


Citation

@software{kasaraneni2026clinops,
  author  = {Kasaraneni, Chaitanya},
  title   = {clinops: Clinical ML Pipeline Toolkit},
  year    = {2026},
  url     = {https://github.com/chaitanyakasaraneni/clinops},
  version = {0.1.0}
}

A companion JOSS paper is in preparation.


License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clinops-0.1.1.tar.gz (127.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clinops-0.1.1-py3-none-any.whl (45.8 kB view details)

Uploaded Python 3

File details

Details for the file clinops-0.1.1.tar.gz.

File metadata

  • Download URL: clinops-0.1.1.tar.gz
  • Upload date:
  • Size: 127.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for clinops-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ce0db07bfbe1fa507b773003077892f3ba41acddf2c7236382bf42d2a0d01414
MD5 559327b40e4413ac8084984515b9ade0
BLAKE2b-256 6ec7f6d659dfb19c6919acf605cea7a1bad1184c47ddedcf807047db63262f7f

See more details on using hashes here.

File details

Details for the file clinops-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: clinops-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 45.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for clinops-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c3e8d5cc63cf33df5b0edb00dbbfe400a2e4b87b1f8ff85189662a087c5eb056
MD5 0d1f99df08658ad791a03b3c0b047cef
BLAKE2b-256 b546fe45c819e636c773991378b79de1d27920b2a168de74fb78cabbf04e3a94

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page