TensorFlow utilities for efficient TFRecord processing and random access
Project description
tfd-utils
Lightweight Python library for O(1) random access to TensorFlow TFRecord files and tar archives. No TensorFlow dependency required for the core library.
- Unified API for TFRecord and tar (read by key in O(1))
- Index built once and cached to disk; rebuilt automatically on file change
- 100% wire-compatible with
tf.data.TFRecordDataset(read either direction) - Multi-file / glob support, parallel index build
tfdCLI:list,extract,get,convert,prebuild,install-skill
Installation
pip install tfd-utils
Core dependencies: numpy, protobuf, crc32c. TensorFlow is not required.
Quickstart
Read a TFRecord
from tfd_utils import TFRecordRandomAccess
reader = TFRecordRandomAccess("data.tfrecord")
# also accepts a list or glob: ["train_*.tfrecord", "val_*.tfrecord"]
image_bytes = reader.get_feature("record_1", "image")
record = reader["record_1"] # Example protobuf
print(len(reader), "records")
Write a TFRecord
from tfd_utils.writer import TFRecordWriter
from tfd_utils.pb2 import Example, Features, Feature, BytesList
with TFRecordWriter("data.tfrecord") as w:
ex = Example(features=Features(feature={
'key': Feature(bytes_list=BytesList(value=[b'record_1'])),
'image': Feature(bytes_list=BytesList(value=[image_bytes])),
}))
w.write(ex.SerializeToString())
Read a tar archive
Tar members sharing a stem are grouped under the same key:
sa_000001.jpg + sa_000001.json → key sa_000001, features jpg / json.
from tfd_utils import TarRandomAccess
reader = TarRandomAccess("archive.tar") # also: "sa1b/*.tar"
jpg_bytes = reader.get_feature("sa_000001", "jpg")
json_bytes = reader.get_feature("sa_000001", "json")
record = reader["sa_000001"] # {'jpg': bytes, 'json': bytes}
.tar, .tar.gz, and .tar.bz2 are supported (autodetected). Tar is not O(1) — for training pipelines, convert to TFRecord first (see Converting tar → TFRecord).
Pre-build the index for large datasets
The first call to TFRecordRandomAccess(...) scans every shard to record byte offsets, then caches a <file>.index next to each shard. For thousands of shards or hundreds of millions of records, this first-time scan can take minutes to hours.
Do not let your training script trigger the index build. Symptoms when you do:
- The job appears to hang with no progress, holding GPUs while doing pure CPU/IO work.
- Multi-rank launchers (
torchrun,accelerate, …) race to build the same index from every rank, multiplying cost.
Recommended workflow — pre-build once on a CPU/login node, then launch training:
# Default — builds .index for every matching shard in parallel
tfd prebuild '/path/to/shards/*.tfrecord'
# Bump concurrency on a fat CPU node (default is 2x CPU count, min 32)
tfd prebuild '/path/to/shards/*.tfrecord' --workers 128
Verify the indexes exist before submitting the training job:
ls /path/to/shards/*.index | wc -l # should equal shard count
Subsequent runs reuse the cached .index files (mtime-checked) and start instantly.
Programmatic equivalent (only if you cannot run the CLI):
from tfd_utils import TFRecordRandomAccess TFRecordRandomAccess('/path/to/shards/*.tfrecord', max_workers=128)
CLI
tfd list data.tfrecord # show features of the first record
tfd extract data.tfrecord <key> # extract a record (saves images to disk)
tfd get data.tfrecord:<key>:<feature> # extract a single feature
tfd prebuild '/path/to/shards/*.tfrecord' # build .index ahead of training
tfd prebuild '/path/to/shards/*.tfrecord' -w 128
tfd convert /path/to/sa1b/ -o /out/ # tar(s) → TFRecord(s)
tfd convert '/path/to/sa1b/sa_*.tar' -o /out/ -d -w 32
tfd install-skill # install the Claude Code skill
Converting tar → TFRecord
tfd convert reads each input tar and writes one TFRecord per source file. Each output record contains a key feature (file stem) plus one bytes feature per file extension.
tfd convert /path/to/archive.tar # single tar
tfd convert /path/to/sa1b/ --output-dir /out/ # directory of tars
tfd convert '/path/to/sa1b/sa_0000*.tar' --output-dir /out/ # glob
tfd convert /path/to/sa1b/ --output-dir /out/ --delete # delete sources on success
tfd convert /path/to/sa1b/ --output-dir /out/ --workers 32 # default is 16 workers
For SA-1B-style tars (paired .jpg + .json per image), each output record has:
| Feature | Type | Content |
|---|---|---|
key |
bytes | File stem, e.g. sa_226692 |
jpg |
bytes | Raw JPEG image bytes |
json |
bytes | Annotation JSON (masks, boxes…) |
API reference
Common API (both readers)
reader.get_record(key) # full record
reader.get_feature(key, feature_name) # single feature (bytes / int / float)
reader.get_feature_list(key, feature_name)
reader.get_keys() # all keys
reader.get_stats() # {'total_records': ..., 'total_files': ..., ...}
reader.contains_key(key)
reader.rebuild_index() # force rebuild
key in reader # __contains__
reader[key] # __getitem__ (raises KeyError if missing)
len(reader) # __len__
with TarRandomAccess("archive.tar") as r: # context manager
...
Constructor options
# TFRecord: custom key feature name (default 'key')
TFRecordRandomAccess("file.tfrecord", key_feature_name="id")
# Both: custom index file location
TFRecordRandomAccess("file.tfrecord", index_file="my.index")
TarRandomAccess("archive.tar", index_file="my.tar_index")
# Both: control parallelism for index build
TFRecordRandomAccess("*.tfrecord", max_workers=128)
TarRandomAccess("*.tar", max_workers=8, use_multiprocessing=True)
Example: SA-1B
import json, io
from PIL import Image
from tfd_utils import TarRandomAccess
reader = TarRandomAccess("/path/to/sa1b/*.tar") # gzip tars supported
key = reader.get_keys()[0] # e.g. 'sa_226692'
image = Image.open(io.BytesIO(reader.get_feature(key, "jpg")))
annotation = json.loads(reader.get_feature(key, "json"))
print(f"{annotation['image']['width']}x{annotation['image']['height']},",
f"{len(annotation['annotations'])} masks")
TensorFlow interoperability
Files written by tfd_utils are byte-identical to TensorFlow's TFRecord format:
import tensorflow as tf
for record in tf.data.TFRecordDataset("data.tfrecord"):
ex = tf.train.Example()
ex.ParseFromString(record.numpy())
The reverse direction works too — TFRecordRandomAccess reads files written by tf.io.TFRecordWriter.
Claude Code skill
Install the bundled skill so Claude Code can assist with the library in any project:
tfd install-skill
Copies a versioned SKILL.md to ~/.claude/skills/tfd-utils/. Re-run after upgrading the library to refresh the skill content:
pip install -U tfd-utils && tfd install-skill
Version notes
v1.0.0
- New
tfd prebuildCLI command — pre-build.indexfiles for one or more shards before training, with high default concurrency (max(2x CPU, 32)workers, override with-w). Avoids the multi-rank race + multi-minute training-startup hang on large datasets. tfd_utils.__version__is now exposed (sourced from package metadata).tfd install-skillstamps the installed library version into the bundled Claude Code skill.- README rewritten for clarity; pre-build workflow promoted to a top-level section.
v0.4.3 — concurrency fix (recommended upgrade)
Versions before 0.4.3 have a critical concurrency bug: when multiple processes built the index simultaneously, they could corrupt the .index file, causing _pickle.UnpicklingError on the next run.
v0.4.3 introduces:
- Exclusive build lock — only one process builds the index at a time; others wait and reuse the result.
- Atomic index write — index is written to
.tmpand renamed into place; a killed process can never leave a half-written index behind.
If you hit _pickle.UnpicklingError on an existing dataset, delete the stale .index files and upgrade:
rm /path/to/data/*.index
pip install --upgrade tfd-utils
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tfd_utils-1.0.0.tar.gz.
File metadata
- Download URL: tfd_utils-1.0.0.tar.gz
- Upload date:
- Size: 108.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e80a674f416e774d2b6f7aa2e8e303299416514672a6a94c09c9fe0330a23095
|
|
| MD5 |
0703417968c58015b19f8df44e8d043b
|
|
| BLAKE2b-256 |
64f4184e94a06911156a8551cfff27be3364a84f73fbda200a4eef6b42f1b88e
|
Provenance
The following attestation bundles were made for tfd_utils-1.0.0.tar.gz:
Publisher:
publish.yml on HarborYuan/tfd-utils
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tfd_utils-1.0.0.tar.gz -
Subject digest:
e80a674f416e774d2b6f7aa2e8e303299416514672a6a94c09c9fe0330a23095 - Sigstore transparency entry: 1405709148
- Sigstore integration time:
-
Permalink:
HarborYuan/tfd-utils@199f7c196f5a71cd54c4cbc43ffdd0eb3e3ed53f -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/HarborYuan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@199f7c196f5a71cd54c4cbc43ffdd0eb3e3ed53f -
Trigger Event:
release
-
Statement type:
File details
Details for the file tfd_utils-1.0.0-py3-none-any.whl.
File metadata
- Download URL: tfd_utils-1.0.0-py3-none-any.whl
- Upload date:
- Size: 33.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a363435e8c16a6367d6e4b11a425509a0f1d882cc6b03b50efc8b0b12270f9df
|
|
| MD5 |
5fd6d0a02f4a81d6ca0e2c4812a30e1d
|
|
| BLAKE2b-256 |
22a3018c8db10e4b8e4404372c3435e33822c3d12b3a3689ce98801b991ab42f
|
Provenance
The following attestation bundles were made for tfd_utils-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on HarborYuan/tfd-utils
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tfd_utils-1.0.0-py3-none-any.whl -
Subject digest:
a363435e8c16a6367d6e4b11a425509a0f1d882cc6b03b50efc8b0b12270f9df - Sigstore transparency entry: 1405709172
- Sigstore integration time:
-
Permalink:
HarborYuan/tfd-utils@199f7c196f5a71cd54c4cbc43ffdd0eb3e3ed53f -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/HarborYuan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@199f7c196f5a71cd54c4cbc43ffdd0eb3e3ed53f -
Trigger Event:
release
-
Statement type: