IatroCache (.iac): a lightweight medical data cache format
Project description
iatro-base-iac
IatroCache (.iac) — a lightweight, high-performance binary container format for offline caching of multimodal medical datasets (image tiles, feature vectors, clinical text, expert tokens, etc.).
This package provides the core container format and readers/writers for .iac files. It is designed for high-concurrency training pipelines, utilizing memory-mapped files (mmap) for thread-safe, zero-copy, lock-free random access.
Installation
pip install iatro-base-iac
Namespace Support
To facilitate transition and support a wider ecosystem of medical AI libraries, this package supports importing from two namespaces:
# 1. Primary Namespace (Recommended)
from iatro import iac
from iatro.iac import build_pack, PackReader
# 2. Compatibility Namespace
from iatro_base import iac
from iatro_base.iac import build_pack, PackReader
Format Layout
[ fixed header ] 65536 bytes — magic "IATROC", JSON header (codec, payload_type, offsets, ...)
[ slide table ] Arrow IPC — slide_idx / slide_id / patient_id
[ index table ] Arrow IPC — caller-defined columns + offset / length / crc32
[ data segment ] raw bytes — concatenated payloads, indexed by the index table
Key Technical Features
- Explicit Boundaries & Integrity: Each record carries
offset/length/crc32. Payload boundaries are explicit, eliminating the need to scan for framing markers. This works seamlessly for codecs without self-delimiting frames (e.g. raw Brotli). - High Performance:
PackReaderusesmmapto map the file into virtual memory, allowing highly concurrent, lock-free random reads across worker threads/processes. - Metadata Flexibility:
payload_typeandcodecare free-form header fields; the low-level container does not interpret the payload bytes directly.
Quick Start (Core API)
Below is an example of writing raw bytes to an .iac pack and reading them back:
import pyarrow as pa
from iatro.iac import build_pack, PackReader
# 1. Define metadata tables
slide_table = pa.table({
"slide_idx": pa.array([0], pa.uint8()),
"slide_id": ["s0"],
"patient_id": ["p0"]
})
# Offset, length, and crc32 columns are populated automatically
index_table = pa.table({
"item_id": ["item_a", "item_b"]
})
# 2. Build the cache file
build_pack(
filepath="out.iac",
header={"payload_type": "raw_bytes", "codec": "none"},
slide_table=slide_table,
index_table=index_table,
payloads=[b"first_payload_data", b"second_payload_data"]
)
# 3. Read payloads concurrently
reader = PackReader("out.iac")
print(reader.read_payload(1)) # Output: b"second_payload_data"
reader.close()
For large-scale or streaming datasets, refer to build_pack_streaming, build_pack_data_segment, and build_pack_data_segment_from_file.
Clinical Text Pair Adapter
iatro-base-iac includes domain-specific adapters such as clinical_text_pair. This adapter is designed to store paired datasets (e.g., raw clinical source text and compressed text for LLM distillation/training):
- Organizes data such that one patient maps to one record.
- Each document inside that patient record contains both
source_textandcompressed_textplus metadata. - Allows training loaders to retrieve all document pairs for a patient in a single random-access read.
from iatro.iac.adapters.text_pair import (
ClinicalTextPairDoc,
ClinicalTextPairReader,
PatientTextPairs,
build_clinical_text_pair_pack,
)
patients = [
PatientTextPairs(
patient_id="Patient_00000001",
institution="XJ",
docs=[
ClinicalTextPairDoc(
doc_id="Patient_00000001/2024-01-01/入院记录_20240101000000",
source="XJ/Patient_00000001/2024-01-01/入院记录_20240101000000.txt",
source_text="原始文书正文",
compressed_text="教师压缩正文",
doc_type="入院记录",
encounter="2024-01-01",
)
],
)
]
build_clinical_text_pair_pack("pairs.iac", patients)
# Read it back
reader = ClinicalTextPairReader("pairs.iac")
patient_data = reader.read_patient("Patient_00000001")
doc = patient_data.docs[0]
assert doc.source_text == "原始文书正文"
reader.close()
Contributing & License
This project is licensed under the MIT License. Contributions and adapters (e.g., custom payload formats or codecs) are welcome.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file iatro_base_iac-0.0.3.tar.gz.
File metadata
- Download URL: iatro_base_iac-0.0.3.tar.gz
- Upload date:
- Size: 35.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c601ed0b37b64a42999a7ac3e0a455484dfd5f92e8df06d3d884e00bba643400
|
|
| MD5 |
775157c43d648f595577969b8af41f39
|
|
| BLAKE2b-256 |
d15aef1c9bd744f7ff29c065ddcace9e1c16a1325f227d04f96542297c5a429f
|
File details
Details for the file iatro_base_iac-0.0.3-py3-none-any.whl.
File metadata
- Download URL: iatro_base_iac-0.0.3-py3-none-any.whl
- Upload date:
- Size: 34.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9adb31ab777777120008db50f944726718f381860feab57e2f378b8752e5673
|
|
| MD5 |
1ebc0e55f87533cd8bb823cf5db5f1ec
|
|
| BLAKE2b-256 |
4b7a0e506887f38660c5805652055f8420877a2dc205e3730b01afc9da76c7d6
|