Skip to main content

IatroCache (.iac): a lightweight medical data cache format

Project description

iatro-base-iac

IatroCache (.iac) — a lightweight, high-performance binary container format for offline caching of multimodal medical datasets (image tiles, feature vectors, clinical text, expert tokens, etc.).

This package provides the core container format and readers/writers for .iac files. It is designed for high-concurrency training pipelines, utilizing memory-mapped files (mmap) for thread-safe, zero-copy, lock-free random access.


Installation

pip install iatro-base-iac

Namespace Support

To facilitate transition and support a wider ecosystem of medical AI libraries, this package supports importing from two namespaces:

# 1. Primary Namespace (Recommended)
from iatro import iac
from iatro.iac import build_pack, PackReader

# 2. Compatibility Namespace
from iatro_base import iac
from iatro_base.iac import build_pack, PackReader

Format Layout

[ fixed header        ] 65536 bytes  — magic "IATROC", JSON header (codec, payload_type, offsets, ...)
[ slide table         ] Arrow IPC    — slide_idx / slide_id / patient_id
[ index table         ] Arrow IPC    — caller-defined columns + offset / length / crc32
[ data segment        ] raw bytes    — concatenated payloads, indexed by the index table

Key Technical Features

  • Explicit Boundaries & Integrity: Each record carries offset / length / crc32. Payload boundaries are explicit, eliminating the need to scan for framing markers. This works seamlessly for codecs without self-delimiting frames (e.g. raw Brotli).
  • High Performance: PackReader uses mmap to map the file into virtual memory, allowing highly concurrent, lock-free random reads across worker threads/processes.
  • Metadata Flexibility: payload_type and codec are free-form header fields; the low-level container does not interpret the payload bytes directly.

Quick Start (Core API)

Below is an example of writing raw bytes to an .iac pack and reading them back:

import pyarrow as pa
from iatro.iac import build_pack, PackReader

# 1. Define metadata tables
slide_table = pa.table({
    "slide_idx": pa.array([0], pa.uint8()),
    "slide_id": ["s0"], 
    "patient_id": ["p0"]
})

# Offset, length, and crc32 columns are populated automatically
index_table = pa.table({
    "item_id": ["item_a", "item_b"]
})

# 2. Build the cache file
build_pack(
    filepath="out.iac",
    header={"payload_type": "raw_bytes", "codec": "none"},
    slide_table=slide_table,
    index_table=index_table,
    payloads=[b"first_payload_data", b"second_payload_data"]
)

# 3. Read payloads concurrently
reader = PackReader("out.iac")
print(reader.read_payload(1))  # Output: b"second_payload_data"
reader.close()

For large-scale or streaming datasets, refer to build_pack_streaming, build_pack_data_segment, and build_pack_data_segment_from_file.


Clinical Text Pair Adapter

iatro-base-iac includes domain-specific adapters such as clinical_text_pair. This adapter is designed to store paired datasets (e.g., raw clinical source text and compressed text for LLM distillation/training):

  • Organizes data such that one patient maps to one record.
  • Each document inside that patient record contains both source_text and compressed_text plus metadata.
  • Allows training loaders to retrieve all document pairs for a patient in a single random-access read.
from iatro.iac.adapters.text_pair import (
    ClinicalTextPairDoc,
    ClinicalTextPairReader,
    PatientTextPairs,
    build_clinical_text_pair_pack,
)

patients = [
    PatientTextPairs(
        patient_id="Patient_00000001",
        institution="XJ",
        docs=[
            ClinicalTextPairDoc(
                doc_id="Patient_00000001/2024-01-01/入院记录_20240101000000",
                source="XJ/Patient_00000001/2024-01-01/入院记录_20240101000000.txt",
                source_text="原始文书正文",
                compressed_text="教师压缩正文",
                doc_type="入院记录",
                encounter="2024-01-01",
            )
        ],
    )
]
build_clinical_text_pair_pack("pairs.iac", patients)

# Read it back
reader = ClinicalTextPairReader("pairs.iac")
patient_data = reader.read_patient("Patient_00000001")
doc = patient_data.docs[0]
assert doc.source_text == "原始文书正文"
reader.close()

Contributing & License

This project is licensed under the MIT License. Contributions and adapters (e.g., custom payload formats or codecs) are welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iatro_base_iac-0.0.2.tar.gz (35.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iatro_base_iac-0.0.2-py3-none-any.whl (34.2 kB view details)

Uploaded Python 3

File details

Details for the file iatro_base_iac-0.0.2.tar.gz.

File metadata

  • Download URL: iatro_base_iac-0.0.2.tar.gz
  • Upload date:
  • Size: 35.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for iatro_base_iac-0.0.2.tar.gz
Algorithm Hash digest
SHA256 480fe2e537aa24df89b3288ff290e586b17fa3a491dcabd29c8f261cab2b388b
MD5 c374c82a84237ccbe3c6cfaa6c4fe96c
BLAKE2b-256 f2a47b6983bda0a89123b7127dec22d08dbda913bdfc9f12eb6f17bbc890e617

See more details on using hashes here.

File details

Details for the file iatro_base_iac-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: iatro_base_iac-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 34.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for iatro_base_iac-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7a1f9aec73c3c8c15fc288be77a159d98548bc5b1ba8fc4a947402fc0e3eb45e
MD5 96bf87e73f598a42c1a271a1a3d24948
BLAKE2b-256 89a200dd536a374488ed527074a88e2cb801177e407fa0a924799b784563a368

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page