TensorFlow utilities for efficient TFRecord processing and random access

Project description

tfd-utils

Lightweight Python library for O(1) random access to TensorFlow TFRecord files and tar archives. No TensorFlow dependency required for the core library.

Unified API for TFRecord and tar (read by key in O(1))
Index built once and cached to disk; auto-rebuilt when the shard count changes (count is encoded in the cache filename)
100% wire-compatible with tf.data.TFRecordDataset (read either direction)
Multi-file / glob support, parallel index build
tfd CLI: list, extract, get, convert, prebuild, install-skill

Installation

pip install tfd-utils

Core dependencies: numpy, protobuf, crc32c. TensorFlow is not required.

Quickstart

Read a TFRecord

from tfd_utils import TFRecordRandomAccess

reader = TFRecordRandomAccess("data.tfrecord")
# also accepts a list or glob: ["train_*.tfrecord", "val_*.tfrecord"]

image_bytes = reader.get_feature("record_1", "image")
record      = reader["record_1"]      # Example protobuf
print(len(reader), "records")

Write a TFRecord

from tfd_utils.writer import TFRecordWriter
from tfd_utils.pb2 import Example, Features, Feature, BytesList

with TFRecordWriter("data.tfrecord") as w:
    ex = Example(features=Features(feature={
        'key':   Feature(bytes_list=BytesList(value=[b'record_1'])),
        'image': Feature(bytes_list=BytesList(value=[image_bytes])),
    }))
    w.write(ex.SerializeToString())

Read a tar archive

Tar members sharing a stem are grouped under the same key: sa_000001.jpg + sa_000001.json → key sa_000001, features jpg / json.

from tfd_utils import TarRandomAccess

reader = TarRandomAccess("archive.tar")          # also: "sa1b/*.tar"
jpg_bytes  = reader.get_feature("sa_000001", "jpg")
json_bytes = reader.get_feature("sa_000001", "json")
record     = reader["sa_000001"]                  # {'jpg': bytes, 'json': bytes}

.tar, .tar.gz, and .tar.bz2 are supported (autodetected). Tar is not O(1) — for training pipelines, convert to TFRecord first (see Converting tar → TFRecord).

Pre-build the index for large datasets

The first call to TFRecordRandomAccess(...) scans every shard to record byte offsets, then caches a <file>.index next to each shard. For thousands of shards or hundreds of millions of records, this first-time scan can take minutes to hours.

Do not let your training script trigger the index build. Symptoms when you do:

The job appears to hang with no progress, holding GPUs while doing pure CPU/IO work.
Multi-rank launchers (torchrun, accelerate, …) race to build the same index from every rank, multiplying cost.

Recommended workflow — pre-build once on a CPU/login node, then launch training:

# Default — builds .index for every matching shard in parallel
tfd prebuild '/path/to/shards/*.tfrecord'

# Bump concurrency on a fat CPU node (default is 2x CPU count, min 32)
tfd prebuild '/path/to/shards/*.tfrecord' --workers 128

Verify the indexes exist before submitting the training job:

ls /path/to/shards/*.index | wc -l   # should equal shard count

Subsequent runs reuse the cached index file and start instantly. The auto-generated cache name encodes the shard count (all<N>.index for complete XXXXX_of_NNNNN.tfrecord shard sets, otherwise <first_stem>_unified_tot<N>.index), so a changed shard count automatically routes to a fresh path and triggers a rebuild — no mtime checks involved.

Programmatic equivalent (only if you cannot run the CLI):
from tfd_utils import TFRecordRandomAccess
TFRecordRandomAccess('/path/to/shards/*.tfrecord', max_workers=128)

CLI

tfd list     data.tfrecord                          # show features of the first record
tfd extract  data.tfrecord <key>                    # extract a record (saves images to disk)
tfd get      data.tfrecord:<key>:<feature>          # extract a single feature

tfd prebuild '/path/to/shards/*.tfrecord'           # build .index ahead of training
tfd prebuild '/path/to/shards/*.tfrecord' -w 128

tfd convert  /path/to/sa1b/ -o /out/                # tar(s) → TFRecord(s)
tfd convert  '/path/to/sa1b/sa_*.tar' -o /out/ -d -w 32

tfd install-skill                                   # install the Claude Code skill

Converting tar → TFRecord

tfd convert reads each input tar and writes one TFRecord per source file. Each output record contains a key feature (file stem) plus one bytes feature per file extension.

tfd convert /path/to/archive.tar                                  # single tar
tfd convert /path/to/sa1b/ --output-dir /out/                     # directory of tars
tfd convert '/path/to/sa1b/sa_0000*.tar' --output-dir /out/       # glob
tfd convert /path/to/sa1b/ --output-dir /out/ --delete            # delete sources on success
tfd convert /path/to/sa1b/ --output-dir /out/ --workers 32        # default is 16 workers

For SA-1B-style tars (paired .jpg + .json per image), each output record has:

Feature	Type	Content
`key`	bytes	File stem, e.g. `sa_226692`
`jpg`	bytes	Raw JPEG image bytes
`json`	bytes	Annotation JSON (masks, boxes…)

API reference

Common API (both readers)

reader.get_record(key)                 # full record
reader.get_feature(key, feature_name)  # single feature (bytes / int / float)
reader.get_feature_list(key, feature_name)
reader.get_keys()                      # all keys
reader.get_stats()                     # {'total_records': ..., 'total_files': ..., ...}
reader.contains_key(key)
reader.rebuild_index()                 # force rebuild

key in reader                          # __contains__
reader[key]                            # __getitem__ (raises KeyError if missing)
len(reader)                            # __len__

with TarRandomAccess("archive.tar") as r:   # context manager
    ...

Constructor options

# TFRecord: custom key feature name (default 'key')
TFRecordRandomAccess("file.tfrecord", key_feature_name="id")

# Both: custom index file location
TFRecordRandomAccess("file.tfrecord", index_file="my.index")
TarRandomAccess("archive.tar", index_file="my.tar_index")

# Both: control parallelism for index build
TFRecordRandomAccess("*.tfrecord", max_workers=128)
TarRandomAccess("*.tar", max_workers=8, use_multiprocessing=True)

Example: SA-1B

import json, io
from PIL import Image
from tfd_utils import TarRandomAccess

reader = TarRandomAccess("/path/to/sa1b/*.tar")    # gzip tars supported
key = reader.get_keys()[0]                          # e.g. 'sa_226692'

image      = Image.open(io.BytesIO(reader.get_feature(key, "jpg")))
annotation = json.loads(reader.get_feature(key, "json"))
print(f"{annotation['image']['width']}x{annotation['image']['height']},",
      f"{len(annotation['annotations'])} masks")

TensorFlow interoperability

Files written by tfd_utils are byte-identical to TensorFlow's TFRecord format:

import tensorflow as tf
for record in tf.data.TFRecordDataset("data.tfrecord"):
    ex = tf.train.Example()
    ex.ParseFromString(record.numpy())

The reverse direction works too — TFRecordRandomAccess reads files written by tf.io.TFRecordWriter.

Claude Code skill

Install the bundled skill so Claude Code can assist with the library in any project:

tfd install-skill

Copies a versioned SKILL.md to ~/.claude/skills/tfd-utils/. Re-run after upgrading the library to refresh the skill content:

pip install -U tfd-utils && tfd install-skill

Version notes

v1.2.0

Count-encoded index filenames — auto-generated index paths now embed the shard count. For a complete XXXXX_of_NNNNN.tfrecord shard set, the cache is named all<N>.index (where N = NNNNN + 1). Otherwise it falls back to <first_stem>_unified_tot<N>.index. Single-file readers still use <stem>.index. Validity is now a pure existence check on this path — mtime is no longer consulted.
HDFS-friendly build lock — the <index>.lock file now stores <owner>|<heartbeat_ns> as text content (filesystem mtime is not trusted on shared/network FS). The lock holder spawns a daemon heartbeat thread that refreshes the timestamp every 30s; other processes treat the lock as stale after 5 min without an update. After acquiring the lock, the holder waits 1s and re-reads the file to verify the owner field still matches its own — defends against non-atomic O_CREAT|O_EXCL on HDFS.
Stress-tested on HDFS with 168 concurrent workers across 6 rounds: every round saw exactly one builder, no overlapping build intervals, all workers loaded the complete key set.
Cache compatibility: existing v1.0.0 / v1.1.0 indexes will be orphaned by the new naming scheme and rebuilt once on first access. Old caches can be safely deleted: rm /path/to/data/*.index.

v1.0.0

New tfd prebuild CLI command — pre-build .index files for one or more shards before training, with high default concurrency (max(2x CPU, 32) workers, override with -w). Avoids the multi-rank race + multi-minute training-startup hang on large datasets.
tfd_utils.__version__ is now exposed (sourced from package metadata).
tfd install-skill stamps the installed library version into the bundled Claude Code skill.
README rewritten for clarity; pre-build workflow promoted to a top-level section.

v0.4.3 — concurrency fix (recommended upgrade)

Versions before 0.4.3 have a critical concurrency bug: when multiple processes built the index simultaneously, they could corrupt the .index file, causing _pickle.UnpicklingError on the next run.

v0.4.3 introduces:

Exclusive build lock — only one process builds the index at a time; others wait and reuse the result.
Atomic index write — index is written to .tmp and renamed into place; a killed process can never leave a half-written index behind.

If you hit _pickle.UnpicklingError on an existing dataset, delete the stale .index files and upgrade:

rm /path/to/data/*.index
pip install --upgrade tfd-utils

License

MIT

Project details

Release history Release notifications | RSS feed

This version

1.2.0

Apr 30, 2026

1.0.0

Apr 29, 2026

0.4.4

Apr 4, 2026

0.4.3

Apr 4, 2026

0.4.2

Mar 31, 2026

0.4.1

Mar 30, 2026

0.4.0

Mar 29, 2026

0.3.1

Nov 25, 2025

0.3.0

Oct 10, 2025

0.2.4

Jul 14, 2025

0.2.3

Jul 11, 2025

0.2.2

Jul 9, 2025

0.2.1

Jul 9, 2025

0.2.0

Jul 9, 2025

0.1.0

Jul 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tfd_utils-1.2.0.tar.gz (115.3 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tfd_utils-1.2.0-py3-none-any.whl (36.8 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file tfd_utils-1.2.0.tar.gz.

File metadata

Download URL: tfd_utils-1.2.0.tar.gz
Upload date: Apr 30, 2026
Size: 115.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tfd_utils-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`71d7399b20c2e3f98c38f58d9fa858a94cfa3fd931afc4fc92659f4169d2b3fb`
MD5	`5c9e87a874a862cd6265c19267ef4fc1`
BLAKE2b-256	`43b0223fa5afb4fc738f5c50038462a63985a8491e40665b9f7fb06917024e7d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tfd_utils-1.2.0.tar.gz:

Publisher: publish.yml on HarborYuan/tfd-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tfd_utils-1.2.0.tar.gz
- Subject digest: 71d7399b20c2e3f98c38f58d9fa858a94cfa3fd931afc4fc92659f4169d2b3fb
- Sigstore transparency entry: 1414877293
- Sigstore integration time: Apr 30, 2026
Source repository:
- Permalink: HarborYuan/tfd-utils@6ca59272dc145230f4c5432c530be8ba523fa0a2
- Branch / Tag: refs/tags/v1.2.0
- Owner: https://github.com/HarborYuan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6ca59272dc145230f4c5432c530be8ba523fa0a2
- Trigger Event: release

File details

Details for the file tfd_utils-1.2.0-py3-none-any.whl.

File metadata

Download URL: tfd_utils-1.2.0-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 36.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tfd_utils-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`395aecc65c2acf30b02fa8f6339320b0caaa5ca9a6e585f21360255bdfdb9226`
MD5	`70f245158b3d1e7eb116323d33983f56`
BLAKE2b-256	`45a450d30e8d70d0511e3100319ca4d9acf92c98bf40f6062d99e30b8d68d014`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tfd_utils-1.2.0-py3-none-any.whl:

Publisher: publish.yml on HarborYuan/tfd-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tfd_utils-1.2.0-py3-none-any.whl
- Subject digest: 395aecc65c2acf30b02fa8f6339320b0caaa5ca9a6e585f21360255bdfdb9226
- Sigstore transparency entry: 1414877444
- Sigstore integration time: Apr 30, 2026
Source repository:
- Permalink: HarborYuan/tfd-utils@6ca59272dc145230f4c5432c530be8ba523fa0a2
- Branch / Tag: refs/tags/v1.2.0
- Owner: https://github.com/HarborYuan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6ca59272dc145230f4c5432c530be8ba523fa0a2
- Trigger Event: release

tfd-utils 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

tfd-utils

Installation

Quickstart

Read a TFRecord

Write a TFRecord

Read a tar archive

Pre-build the index for large datasets

CLI

Converting tar → TFRecord

API reference

Common API (both readers)

Constructor options

Example: SA-1B

TensorFlow interoperability

Claude Code skill

Version notes

v1.2.0

v1.0.0

v0.4.3 — concurrency fix (recommended upgrade)

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance