Skip to main content

IatroCache (.iac): a lightweight medical data cache format

Project description

iatro-base-iac

IatroCache (.iac) — a lightweight, high-performance binary container format for offline caching of multimodal medical datasets (image tiles, feature vectors, clinical text, expert tokens, etc.).

This package provides the core container format and readers/writers for .iac files. It is designed for high-concurrency training pipelines, utilizing memory-mapped files (mmap) for thread-safe, zero-copy, lock-free random access.


Installation

pip install iatro-base-iac

Namespace Support

To facilitate transition and support a wider ecosystem of medical AI libraries, this package supports importing from two namespaces:

# 1. Primary Namespace (Recommended)
from iatro import iac
from iatro.iac import build_pack, PackReader

# 2. Compatibility Namespace
from iatro_base import iac
from iatro_base.iac import build_pack, PackReader

Format Layout

[ fixed header        ] 65536 bytes  — magic "IATROC", JSON header (codec, payload_type, offsets, ...)
[ slide table         ] Arrow IPC    — slide_idx / slide_id / patient_id
[ index table         ] Arrow IPC    — caller-defined columns + offset / length / crc32
[ data segment        ] raw bytes    — concatenated payloads, indexed by the index table

Key Technical Features

  • Explicit Boundaries & Integrity: Each record carries offset / length / crc32. Payload boundaries are explicit, eliminating the need to scan for framing markers. This works seamlessly for codecs without self-delimiting frames (e.g. raw Brotli).
  • High Performance: PackReader uses mmap to map the file into virtual memory, allowing highly concurrent, lock-free random reads across worker threads/processes.
  • Metadata Flexibility: payload_type and codec are free-form header fields; the low-level container does not interpret the payload bytes directly.

Quick Start (Core API)

Below is an example of writing raw bytes to an .iac pack and reading them back:

import pyarrow as pa
from iatro.iac import build_pack, PackReader

# 1. Define metadata tables
slide_table = pa.table({
    "slide_idx": pa.array([0], pa.uint8()),
    "slide_id": ["s0"], 
    "patient_id": ["p0"]
})

# Offset, length, and crc32 columns are populated automatically
index_table = pa.table({
    "item_id": ["item_a", "item_b"]
})

# 2. Build the cache file
build_pack(
    filepath="out.iac",
    header={"payload_type": "raw_bytes", "codec": "none"},
    slide_table=slide_table,
    index_table=index_table,
    payloads=[b"first_payload_data", b"second_payload_data"]
)

# 3. Read payloads concurrently
reader = PackReader("out.iac")
print(reader.read_payload(1))  # Output: b"second_payload_data"
reader.close()

For large-scale or streaming datasets, refer to build_pack_streaming, build_pack_data_segment, and build_pack_data_segment_from_file.


Clinical Text Pair Adapter

iatro-base-iac includes domain-specific adapters such as clinical_text_pair. This adapter is designed to store paired datasets (e.g., raw clinical source text and compressed text for LLM distillation/training):

  • Organizes data such that one patient maps to one record.
  • Each document inside that patient record contains both source_text and compressed_text plus metadata.
  • Allows training loaders to retrieve all document pairs for a patient in a single random-access read.
from iatro.iac.adapters.text_pair import (
    ClinicalTextPairDoc,
    ClinicalTextPairReader,
    PatientTextPairs,
    build_clinical_text_pair_pack,
)

patients = [
    PatientTextPairs(
        patient_id="Patient_00000001",
        institution="XJ",
        docs=[
            ClinicalTextPairDoc(
                doc_id="Patient_00000001/2024-01-01/入院记录_20240101000000",
                source="XJ/Patient_00000001/2024-01-01/入院记录_20240101000000.txt",
                source_text="原始文书正文",
                compressed_text="教师压缩正文",
                doc_type="入院记录",
                encounter="2024-01-01",
            )
        ],
    )
]
build_clinical_text_pair_pack("pairs.iac", patients)

# Read it back
reader = ClinicalTextPairReader("pairs.iac")
patient_data = reader.read_patient("Patient_00000001")
doc = patient_data.docs[0]
assert doc.source_text == "原始文书正文"
reader.close()

Contributing & License

This project is licensed under the MIT License. Contributions and adapters (e.g., custom payload formats or codecs) are welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iatro_base_iac-0.0.3.tar.gz (35.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iatro_base_iac-0.0.3-py3-none-any.whl (34.3 kB view details)

Uploaded Python 3

File details

Details for the file iatro_base_iac-0.0.3.tar.gz.

File metadata

  • Download URL: iatro_base_iac-0.0.3.tar.gz
  • Upload date:
  • Size: 35.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for iatro_base_iac-0.0.3.tar.gz
Algorithm Hash digest
SHA256 c601ed0b37b64a42999a7ac3e0a455484dfd5f92e8df06d3d884e00bba643400
MD5 775157c43d648f595577969b8af41f39
BLAKE2b-256 d15aef1c9bd744f7ff29c065ddcace9e1c16a1325f227d04f96542297c5a429f

See more details on using hashes here.

File details

Details for the file iatro_base_iac-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: iatro_base_iac-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 34.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for iatro_base_iac-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e9adb31ab777777120008db50f944726718f381860feab57e2f378b8752e5673
MD5 1ebc0e55f87533cd8bb823cf5db5f1ec
BLAKE2b-256 4b7a0e506887f38660c5805652055f8420877a2dc205e3730b01afc9da76c7d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page