Skip to main content

TensorFlow utilities for efficient TFRecord processing and random access

Project description

tfd-utils

Lightweight Python library for O(1) random access to TensorFlow TFRecord files and tar archives. No TensorFlow dependency required for the core library.

  • Unified API for TFRecord and tar (read by key in O(1))
  • Index built once and cached to disk; auto-rebuilt when the shard count changes (count is encoded in the cache filename)
  • 100% wire-compatible with tf.data.TFRecordDataset (read either direction)
  • Multi-file / glob support, parallel index build
  • tfd CLI: list, extract, get, convert, prebuild, install-skill

Installation

pip install tfd-utils

Core dependencies: numpy, protobuf, crc32c. TensorFlow is not required.


Quickstart

Read a TFRecord

from tfd_utils import TFRecordRandomAccess

reader = TFRecordRandomAccess("data.tfrecord")
# also accepts a list or glob: ["train_*.tfrecord", "val_*.tfrecord"]

image_bytes = reader.get_feature("record_1", "image")
record      = reader["record_1"]      # Example protobuf
print(len(reader), "records")

Write a TFRecord

from tfd_utils.writer import TFRecordWriter
from tfd_utils.pb2 import Example, Features, Feature, BytesList

with TFRecordWriter("data.tfrecord") as w:
    ex = Example(features=Features(feature={
        'key':   Feature(bytes_list=BytesList(value=[b'record_1'])),
        'image': Feature(bytes_list=BytesList(value=[image_bytes])),
    }))
    w.write(ex.SerializeToString())

Read a tar archive

Tar members sharing a stem are grouped under the same key: sa_000001.jpg + sa_000001.json → key sa_000001, features jpg / json.

from tfd_utils import TarRandomAccess

reader = TarRandomAccess("archive.tar")          # also: "sa1b/*.tar"
jpg_bytes  = reader.get_feature("sa_000001", "jpg")
json_bytes = reader.get_feature("sa_000001", "json")
record     = reader["sa_000001"]                  # {'jpg': bytes, 'json': bytes}

.tar, .tar.gz, and .tar.bz2 are supported (autodetected). Tar is not O(1) — for training pipelines, convert to TFRecord first (see Converting tar → TFRecord).


Pre-build the index for large datasets

The first call to TFRecordRandomAccess(...) scans every shard to record byte offsets, then caches a <file>.index next to each shard. For thousands of shards or hundreds of millions of records, this first-time scan can take minutes to hours.

Do not let your training script trigger the index build. Symptoms when you do:

  • The job appears to hang with no progress, holding GPUs while doing pure CPU/IO work.
  • Multi-rank launchers (torchrun, accelerate, …) race to build the same index from every rank, multiplying cost.

Recommended workflow — pre-build once on a CPU/login node, then launch training:

# Default — builds .index for every matching shard in parallel
tfd prebuild '/path/to/shards/*.tfrecord'

# Bump concurrency on a fat CPU node (default is 2x CPU count, min 32)
tfd prebuild '/path/to/shards/*.tfrecord' --workers 128

Verify the indexes exist before submitting the training job:

ls /path/to/shards/*.index | wc -l   # should equal shard count

Subsequent runs reuse the cached index file and start instantly. The auto-generated cache name encodes the shard count (all<N>.index for complete XXXXX_of_NNNNN.tfrecord shard sets, otherwise <first_stem>_unified_tot<N>.index), so a changed shard count automatically routes to a fresh path and triggers a rebuild — no mtime checks involved.

Programmatic equivalent (only if you cannot run the CLI):

from tfd_utils import TFRecordRandomAccess
TFRecordRandomAccess('/path/to/shards/*.tfrecord', max_workers=128)

CLI

tfd list     data.tfrecord                          # show features of the first record
tfd extract  data.tfrecord <key>                    # extract a record (saves images to disk)
tfd get      data.tfrecord:<key>:<feature>          # extract a single feature

tfd prebuild '/path/to/shards/*.tfrecord'           # build .index ahead of training
tfd prebuild '/path/to/shards/*.tfrecord' -w 128

tfd convert  /path/to/sa1b/ -o /out/                # tar(s) → TFRecord(s)
tfd convert  '/path/to/sa1b/sa_*.tar' -o /out/ -d -w 32

tfd install-skill                                   # install the Claude Code skill

Converting tar → TFRecord

tfd convert reads each input tar and writes one TFRecord per source file. Each output record contains a key feature (file stem) plus one bytes feature per file extension.

tfd convert /path/to/archive.tar                                  # single tar
tfd convert /path/to/sa1b/ --output-dir /out/                     # directory of tars
tfd convert '/path/to/sa1b/sa_0000*.tar' --output-dir /out/       # glob
tfd convert /path/to/sa1b/ --output-dir /out/ --delete            # delete sources on success
tfd convert /path/to/sa1b/ --output-dir /out/ --workers 32        # default is 16 workers

For SA-1B-style tars (paired .jpg + .json per image), each output record has:

Feature Type Content
key bytes File stem, e.g. sa_226692
jpg bytes Raw JPEG image bytes
json bytes Annotation JSON (masks, boxes…)

API reference

Common API (both readers)

reader.get_record(key)                 # full record
reader.get_feature(key, feature_name)  # single feature (bytes / int / float)
reader.get_feature_list(key, feature_name)
reader.get_keys()                      # all keys
reader.get_stats()                     # {'total_records': ..., 'total_files': ..., ...}
reader.contains_key(key)
reader.rebuild_index()                 # force rebuild

key in reader                          # __contains__
reader[key]                            # __getitem__ (raises KeyError if missing)
len(reader)                            # __len__

with TarRandomAccess("archive.tar") as r:   # context manager
    ...

Constructor options

# TFRecord: custom key feature name (default 'key')
TFRecordRandomAccess("file.tfrecord", key_feature_name="id")

# Both: custom index file location
TFRecordRandomAccess("file.tfrecord", index_file="my.index")
TarRandomAccess("archive.tar", index_file="my.tar_index")

# Both: control parallelism for index build
TFRecordRandomAccess("*.tfrecord", max_workers=128)
TarRandomAccess("*.tar", max_workers=8, use_multiprocessing=True)

Example: SA-1B

import json, io
from PIL import Image
from tfd_utils import TarRandomAccess

reader = TarRandomAccess("/path/to/sa1b/*.tar")    # gzip tars supported
key = reader.get_keys()[0]                          # e.g. 'sa_226692'

image      = Image.open(io.BytesIO(reader.get_feature(key, "jpg")))
annotation = json.loads(reader.get_feature(key, "json"))
print(f"{annotation['image']['width']}x{annotation['image']['height']},",
      f"{len(annotation['annotations'])} masks")

TensorFlow interoperability

Files written by tfd_utils are byte-identical to TensorFlow's TFRecord format:

import tensorflow as tf
for record in tf.data.TFRecordDataset("data.tfrecord"):
    ex = tf.train.Example()
    ex.ParseFromString(record.numpy())

The reverse direction works too — TFRecordRandomAccess reads files written by tf.io.TFRecordWriter.


Claude Code skill

Install the bundled skill so Claude Code can assist with the library in any project:

tfd install-skill

Copies a versioned SKILL.md to ~/.claude/skills/tfd-utils/. Re-run after upgrading the library to refresh the skill content:

pip install -U tfd-utils && tfd install-skill

Version notes

v1.2.0

  • Count-encoded index filenames — auto-generated index paths now embed the shard count. For a complete XXXXX_of_NNNNN.tfrecord shard set, the cache is named all<N>.index (where N = NNNNN + 1). Otherwise it falls back to <first_stem>_unified_tot<N>.index. Single-file readers still use <stem>.index. Validity is now a pure existence check on this path — mtime is no longer consulted.
  • HDFS-friendly build lock — the <index>.lock file now stores <owner>|<heartbeat_ns> as text content (filesystem mtime is not trusted on shared/network FS). The lock holder spawns a daemon heartbeat thread that refreshes the timestamp every 30s; other processes treat the lock as stale after 5 min without an update. After acquiring the lock, the holder waits 1s and re-reads the file to verify the owner field still matches its own — defends against non-atomic O_CREAT|O_EXCL on HDFS.
  • Stress-tested on HDFS with 168 concurrent workers across 6 rounds: every round saw exactly one builder, no overlapping build intervals, all workers loaded the complete key set.
  • Cache compatibility: existing v1.0.0 / v1.1.0 indexes will be orphaned by the new naming scheme and rebuilt once on first access. Old caches can be safely deleted: rm /path/to/data/*.index.

v1.0.0

  • New tfd prebuild CLI command — pre-build .index files for one or more shards before training, with high default concurrency (max(2x CPU, 32) workers, override with -w). Avoids the multi-rank race + multi-minute training-startup hang on large datasets.
  • tfd_utils.__version__ is now exposed (sourced from package metadata).
  • tfd install-skill stamps the installed library version into the bundled Claude Code skill.
  • README rewritten for clarity; pre-build workflow promoted to a top-level section.

v0.4.3 — concurrency fix (recommended upgrade)

Versions before 0.4.3 have a critical concurrency bug: when multiple processes built the index simultaneously, they could corrupt the .index file, causing _pickle.UnpicklingError on the next run.

v0.4.3 introduces:

  • Exclusive build lock — only one process builds the index at a time; others wait and reuse the result.
  • Atomic index write — index is written to .tmp and renamed into place; a killed process can never leave a half-written index behind.

If you hit _pickle.UnpicklingError on an existing dataset, delete the stale .index files and upgrade:

rm /path/to/data/*.index
pip install --upgrade tfd-utils

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tfd_utils-1.2.0.tar.gz (115.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tfd_utils-1.2.0-py3-none-any.whl (36.8 kB view details)

Uploaded Python 3

File details

Details for the file tfd_utils-1.2.0.tar.gz.

File metadata

  • Download URL: tfd_utils-1.2.0.tar.gz
  • Upload date:
  • Size: 115.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tfd_utils-1.2.0.tar.gz
Algorithm Hash digest
SHA256 71d7399b20c2e3f98c38f58d9fa858a94cfa3fd931afc4fc92659f4169d2b3fb
MD5 5c9e87a874a862cd6265c19267ef4fc1
BLAKE2b-256 43b0223fa5afb4fc738f5c50038462a63985a8491e40665b9f7fb06917024e7d

See more details on using hashes here.

Provenance

The following attestation bundles were made for tfd_utils-1.2.0.tar.gz:

Publisher: publish.yml on HarborYuan/tfd-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tfd_utils-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: tfd_utils-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 36.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tfd_utils-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 395aecc65c2acf30b02fa8f6339320b0caaa5ca9a6e585f21360255bdfdb9226
MD5 70f245158b3d1e7eb116323d33983f56
BLAKE2b-256 45a450d30e8d70d0511e3100319ca4d9acf92c98bf40f6062d99e30b8d68d014

See more details on using hashes here.

Provenance

The following attestation bundles were made for tfd_utils-1.2.0-py3-none-any.whl:

Publisher: publish.yml on HarborYuan/tfd-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page