Clinical ML Pipeline Toolkit — bridging raw clinical data and production-ready ML pipelines

These details have not been verified by PyPI

Project links

Project description

clinops 🏥

Clinical ML Pipeline Toolkit — production-grade data loading, preprocessing, and time-series feature engineering for healthcare AI research.

Every healthcare AI project starts with the same two weeks of plumbing: loading MIMIC-IV tables without hitting memory limits, clipping physiologically impossible values before they corrupt your model, normalizing glucose from mmol/L to mg/dL across sites, building time-series windows that handle clinical missingness correctly, and splitting data without leaking patients across folds. clinops packages those hard-won patterns into a single, well-tested library so your first notebook is actual science.

Built from production experience in clinical and genomic data engineering across multi-cloud environments.

v0.1 Modules

Module	What it does
`clinops.ingest`	Loaders for MIMIC-IV, FHIR R4, and flat CSV/Parquet with schema validation. Includes `MimicTableLoader` with pre-built schemas for the five tables researchers always need.
`clinops.temporal`	Sliding/tumbling windows, gap-aware imputation, lag features, cohort alignment
`clinops.preprocess`	Outlier clipping with physiological bounds, unit normalization (mg/dL ↔ mmol/L etc.), ICD-9→10 mapping
`clinops.split`	Temporal, patient-level, and stratified patient train/test splitting

Roadmap: clinops.monitor (drift detection, data quality) and clinops.orchestrate (GCS/S3, Step Functions) are planned for v0.2.

Quickstart

pip install clinops

clinops.ingest

MimicTableLoader — pre-built schemas, no manual ColumnSpec required

MimicTableLoader wraps MimicLoader and exposes the five MIMIC-IV tables researchers use in every project with fully validated schemas out of the box. No ColumnSpec definitions, no schema boilerplate.

from clinops.ingest import MimicTableLoader

tbl = MimicTableLoader("/data/mimic-iv-2.2")

# ICU vitals — charttime parsed as datetime automatically
charts = tbl.chartevents(subject_ids=[10000032, 10000980])

# Lab results — reference range columns dropped by default (sparse in most exports)
labs = tbl.labevents(subject_ids=[10000032], with_ref_range=True)

# Hospital admissions with mortality flag
adm = tbl.admissions(subject_ids=[10000032])

# ICD-9/10 diagnoses — primary_only keeps only seq_num == 1
dx = tbl.diagnoses_icd(subject_ids=[10000032], primary_only=True)

# ICU stays — with_los_band adds <1d / 1-3d / 3-7d / >7d column
stays = tbl.icustays(subject_ids=[10000032], with_los_band=True)

Audit a new MIMIC download without loading full tables into memory:

tbl.summary()
#        table  rows_sampled  columns  null_rate_pct
#  chartevents         10000       23           8.41
#    labevents         10000       12           4.17
#   admissions         10000       15           6.02
# diagnoses_icd        10000        5           0.00
#     icustays         10000        8           2.31

MimicLoader — full control

For advanced filtering and chunk-based loading of large tables, use MimicLoader directly:

from clinops.ingest import MimicLoader

loader = MimicLoader("/data/mimic-iv-2.2")

charts = loader.chartevents(
    subject_ids=[10000032, 10000980],
    start_time="2150-01-01",
    end_time="2150-01-10",
)
labs  = loader.labevents(subject_ids=[10000032, 10000980])
stays = loader.icustays(subject_ids=[10000032, 10000980])

Load FHIR R4 resources

from clinops.ingest import FHIRLoader

loader   = FHIRLoader("/data/fhir_export")
obs      = loader.observations(category="vital-signs")
patients = loader.patients()

Validate any flat clinical export

from clinops.ingest import FlatFileLoader, ClinicalSchema, ColumnSpec

schema = ClinicalSchema(
    name="vitals",
    columns=[
        ColumnSpec("subject_id", nullable=False),
        ColumnSpec("heart_rate", min_value=0,  max_value=300),
        ColumnSpec("spo2",       min_value=50, max_value=100),
    ]
)
df = FlatFileLoader("vitals.csv", schema=schema).load()

clinops.preprocess

Clip physiologically impossible values

Standard statistical outlier methods (z-score, IQR) are wrong for clinical data — a heart rate of 180 in a patient with SVT is clinically meaningful and should not be removed. ClinicalOutlierClipper uses published physiological bounds to remove values that are impossible regardless of patient state.

from clinops.preprocess import ClinicalOutlierClipper

clipper = ClinicalOutlierClipper(action="clip")  # or "null" or "flag"
clean_df = clipper.fit_transform(vitals_df)

print(clipper.report())
#    column  low_outliers  high_outliers  pct_outliers  bound_low  bound_high
#  heart_rate             0              3         0.012          0         300
#        spo2             1              0         0.004         50         100

Built-in bounds cover 20 vitals and labs (heart_rate, spo2, sbp, glucose, creatinine, ph, wbc, and more). Add site-specific ranges with add_bounds().

Normalize units across sites

Multi-site studies routinely mix mg/dL and mmol/L for the same lab, or °F and °C for temperature. UnitNormalizer detects non-standard units via a companion unit column and converts in-place.

from clinops.preprocess import UnitNormalizer

# df has a "glucose" column and a "glucose_unit" column (mixed "mg/dL" / "mmol/L")
normalizer = UnitNormalizer(column_unit_map={"glucose": "glucose_unit"})
df = normalizer.transform(df)
# All glucose values now in mg/dL; glucose_unit column updated

print(normalizer.report())
#   column from_unit to_unit  n_converted
#  glucose    mmol/L   mg/dL          142

30 registered conversions covering glucose, creatinine, bilirubin, haemoglobin, calcium, temperature, weight, and height.

Harmonize ICD-9 and ICD-10 codes

MIMIC-III uses ICD-9, MIMIC-IV mixes both versions, and many real-world datasets span the October 2015 transition. ICDMapper converts ICD-9-CM codes to ICD-10-CM and adds chapter-level groupings for ML features.

from clinops.preprocess import ICDMapper

mapper = ICDMapper()

# Map a mixed-version DataFrame to ICD-10 in-place
df = mapper.harmonize(df, code_col="icd_code", version_col="icd_version")

# Add chapter-level grouping (e.g. "Diseases of the circulatory system")
df["chapter"] = mapper.chapter_series(df["icd_code"])

# Map a single code
mapper.map_code("4280")   # → "I509"

Ships with ~60 curated high-frequency mappings. Load the full CMS GEM file (~72,000 mappings) with ICDMapper.from_gem_file(path).

clinops.temporal

Build temporal feature windows

from clinops.temporal import TemporalWindower, ImputationStrategy

windower = TemporalWindower(
    window_hours=24,
    step_hours=6,
    imputation=ImputationStrategy.FORWARD_FILL,
    min_observations=3,
)

windows = windower.fit_transform(
    df=charts,
    id_col="subject_id",
    time_col="charttime",
    feature_cols=["heart_rate", "spo2", "resp_rate", "map"],
)
# → DataFrame: subject_id | window_start | window_end | heart_rate | spo2 | ...

Long-format input (MIMIC native itemid × valuenum)

windows = windower.fit_transform(
    df=charts,
    id_col="subject_id",
    time_col="charttime",
    item_col="itemid",    # auto-pivots to wide format
    value_col="valuenum",
)

Add lag and rolling features

from clinops.temporal import LagFeatureBuilder

enriched = LagFeatureBuilder(
    lags=[1, 2, 4],
    rolling_windows=[4, 8],
    id_col="subject_id",
).fit_transform(windows)
# Adds: heart_rate_lag1, heart_rate_roll4_mean, heart_rate_roll4_std, ...

Align a cohort to an anchor event (e.g. ICU admission)

from clinops.temporal import CohortAligner

aligned = CohortAligner(
    anchor_col="intime",
    max_hours_before=0,
    max_hours_after=48,
).align(events_df=charts, anchor_df=stays)
# → filtered to 48h post-admission, with hours_from_anchor column

Imputation strategies

Clinical data has unique missingness patterns that standard ML windowing gets wrong. clinops provides strategies tuned for clinical context:

Strategy	Best for
`FORWARD_FILL`	Slowly-changing vitals — carry last observation forward
`BACKWARD_FILL`	Values recorded with lag
`LINEAR`	Continuous signals with regular sampling
`MEAN` / `MEDIAN`	Fit on training set, apply to test (no leakage)
`INDICATOR`	Adds `{col}_missing` binary column — lets model learn from missingness
`NONE`	Leave NaN in place

from clinops.temporal import Imputer, ImputationStrategy

imputer = Imputer(ImputationStrategy.MEAN, per_patient=True, id_col="subject_id")
imputer.fit(train_windows)
test_windows = imputer.transform(test_windows)

clinops.split

Standard sklearn.train_test_split is wrong for clinical ML: it leaks future observations into training, and splits patients across folds so the model memorises patient-specific patterns rather than generalising.

Temporal split — no future leakage

from clinops.split import TemporalSplitter

result = TemporalSplitter(cutoff="2155-01-01", time_col="charttime").split(df)
# or auto-compute cutoff from the data:
result = TemporalSplitter(train_frac=0.8, time_col="charttime").split(df)

print(result.summary())
# Train: 38,400 rows (80.0%)
# Test:   9,600 rows (20.0%)
# cutoff: 2155-01-01 00:00:00

Patient-level split — no leakage across admissions

from clinops.split import PatientSplitter

result = PatientSplitter(id_col="subject_id", test_size=0.2).split(df)

# Guaranteed: no patient appears in both splits
assert not set(result.train["subject_id"]) & set(result.test["subject_id"])

Stratified patient split — preserves outcome rate

Critical for imbalanced clinical endpoints (in-hospital mortality is typically 5–15%). Stratifies on a binary outcome while respecting patient boundaries.

from clinops.split import StratifiedPatientSplitter

result = StratifiedPatientSplitter(
    id_col="subject_id",
    outcome_col="hospital_expire_flag",
    test_size=0.2,
).split(df)

print(result.summary())
# Train: 32,000 rows (80.0%)
# Test:   8,000 rows (20.0%)
# population_outcome_rate: 0.0821
# train_outcome_rate:      0.0819
# test_outcome_rate:       0.0826

Installation

Requires Python 3.12+.

pip install clinops           # core
pip install clinops[fhir]     # adds FHIR R4 loader
pip install clinops[gcp]      # adds GCP extras (for v0.2)
pip install -e ".[dev]"       # development

Supported sources

Source	Format
MIMIC-IV v2.0–v2.2	CSV, CSV.GZ, Parquet
FHIR R4	JSON Bundle, NDJSON
Flat files	CSV, CSV.GZ, Parquet

Contributing

See CONTRIBUTING.md. Run pytest tests/ -v and ruff check clinops/ before opening a PR.

Citation

@software{kasaraneni2026clinops,
  author  = {Kasaraneni, Chaitanya},
  title   = {clinops: Clinical ML Pipeline Toolkit},
  year    = {2026},
  url     = {https://github.com/chaitanyakasaraneni/clinops},
  version = {0.1.0}
}

A companion JOSS paper is in preparation.

License

Apache 2.0 — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.2

Apr 8, 2026

0.2.1

Apr 8, 2026

0.1.7

Feb 24, 2026

0.1.6

Feb 24, 2026

0.1.5

Feb 23, 2026

0.1.4

Feb 23, 2026

0.1.3

Feb 23, 2026

0.1.2

Feb 23, 2026

This version

0.1.1

Feb 23, 2026

0.1.0

Feb 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clinops-0.1.1.tar.gz (127.1 kB view details)

Uploaded Feb 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

clinops-0.1.1-py3-none-any.whl (45.8 kB view details)

Uploaded Feb 23, 2026 Python 3

File details

Details for the file clinops-0.1.1.tar.gz.

File metadata

Download URL: clinops-0.1.1.tar.gz
Upload date: Feb 23, 2026
Size: 127.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for clinops-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`ce0db07bfbe1fa507b773003077892f3ba41acddf2c7236382bf42d2a0d01414`
MD5	`559327b40e4413ac8084984515b9ade0`
BLAKE2b-256	`6ec7f6d659dfb19c6919acf605cea7a1bad1184c47ddedcf807047db63262f7f`

See more details on using hashes here.

File details

Details for the file clinops-0.1.1-py3-none-any.whl.

File metadata

Download URL: clinops-0.1.1-py3-none-any.whl
Upload date: Feb 23, 2026
Size: 45.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for clinops-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c3e8d5cc63cf33df5b0edb00dbbfe400a2e4b87b1f8ff85189662a087c5eb056`
MD5	`0d1f99df08658ad791a03b3c0b047cef`
BLAKE2b-256	`b546fe45c819e636c773991378b79de1d27920b2a168de74fb78cabbf04e3a94`

See more details on using hashes here.

clinops 0.1.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

clinops 🏥

v0.1 Modules

Quickstart

clinops.ingest

MimicTableLoader — pre-built schemas, no manual ColumnSpec required

MimicLoader — full control

Load FHIR R4 resources

Validate any flat clinical export

clinops.preprocess

Clip physiologically impossible values

Normalize units across sites

Harmonize ICD-9 and ICD-10 codes

clinops.temporal

Build temporal feature windows

Long-format input (MIMIC native itemid × valuenum)

Add lag and rolling features

Align a cohort to an anchor event (e.g. ICU admission)

Imputation strategies

clinops.split

Temporal split — no future leakage

Patient-level split — no leakage across admissions

Stratified patient split — preserves outcome rate

Installation

Supported sources

Contributing

Citation

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes