TensorFlow utilities for efficient TFRecord processing and random access

Project description

tfd-utils

A lightweight Python library for efficient random access to TensorFlow TFRecord files and tar archives, without requiring TensorFlow.

Upgrade Notice — v0.4.3

Versions before 0.4.3 have a critical concurrency bug: when multiple processes build the index simultaneously, they can corrupt the .index file, causing UnpicklingError on the next run.

v0.4.3 fixes this with two changes:

Exclusive build lock — only one process builds the index at a time; others wait and then reuse the result.
Atomic index write — the index is written to a .tmp file and renamed into place, so a killed process can never leave a half-written index behind.

If you hit _pickle.UnpicklingError on an existing dataset, delete the stale .index files and upgrade:

rm /path/to/data/*.index
pip install --upgrade tfd-utils

Key Features

Unified API: Access TFRecord files and tar archives through the same interface.
Random Access: Access any record by key in O(1) time without reading the entire file.
Automatic Index Caching: Index is built once and cached to disk; rebuilt automatically when files change.
Lightweight & Standalone: TFRecord support requires only numpy, protobuf, and crc32c. No TensorFlow needed.
Full TensorFlow Compatibility: Write with tfd_utils, read with TensorFlow (or vice versa). 100% compatible.
Multiple File Support: Single files, lists of files, or glob patterns.
Tar-to-TFRecord Conversion: CLI tool to batch-convert tar archives to TFRecord format with parallel workers and optional source deletion.
Claude Code Skill: One-command install (tfd install-skill) to enable AI assistance with the library in any project.

Installation

pip install tfd-utils

Usage

TFRecord Random Access

from tfd_utils import TFRecordRandomAccess

reader = TFRecordRandomAccess("data.tfrecord")
# or multiple files / glob patterns
reader = TFRecordRandomAccess(["train_*.tfrecord", "val_*.tfrecord"])

image_bytes = reader.get_feature("record_1", "image")
record = reader["record_1"]   # all features as Example protobuf
print(f"Total records: {len(reader)}")

Tar Archive Random Access

Tar archives are expected to contain paired files sharing the same stem:

sa_000001.jpg   →  key='sa_000001', feature='jpg'
sa_000001.json  →  key='sa_000001', feature='json'

Both uncompressed (.tar) and compressed (.tar.gz, etc.) archives are supported.

from tfd_utils import TarRandomAccess

reader = TarRandomAccess("archive.tar")
# or glob / list of tars
reader = TarRandomAccess("sa1b/*.tar")

jpg_bytes  = reader.get_feature("sa_000001", "jpg")
json_bytes = reader.get_feature("sa_000001", "json")
record     = reader["sa_000001"]   # {'jpg': bytes, 'json': bytes}
print(f"Total records: {len(reader)}")

Member paths with subdirectory prefixes are handled automatically: ./subdir/foo.jpg → key subdir/foo.

Example: SA-1B Dataset

SA-1B tars are gzip-compressed and contain paired .jpg / .json files per image:

import json
from PIL import Image
import io
from tfd_utils import TarRandomAccess

# Point to one or more SA-1B tar files (compressed tars are supported)
reader = TarRandomAccess("/path/to/sa1b/sa_000020.tar")
# or load multiple shards at once
reader = TarRandomAccess("/path/to/sa1b/*.tar")

# Each key is the image ID (e.g. 'sa_226692')
keys = reader.get_keys()
print(f"Images in this shard: {len(keys)}")

key = keys[0]

# Load the JPEG image
jpg_bytes = reader.get_feature(key, "jpg")
image = Image.open(io.BytesIO(jpg_bytes))

# Load the annotation (segmentation masks, bounding boxes, …)
json_bytes = reader.get_feature(key, "json")
annotation = json.loads(json_bytes)
print(f"Image size : {annotation['image']['width']}x{annotation['image']['height']}")
print(f"Masks      : {len(annotation['annotations'])}")

Writing TFRecords

from tfd_utils.writer import TFRecordWriter
from tfd_utils.pb2 import Example, Features, Feature, BytesList

with TFRecordWriter("data.tfrecord") as writer:
    example = Example(features=Features(feature={
        'key':   Feature(bytes_list=BytesList(value=[b'record_1'])),
        'image': Feature(bytes_list=BytesList(value=[b'<image bytes>'])),
    }))
    writer.write(example.SerializeToString())

Common API (both readers)

reader.get_record(key)                    # full record
reader.get_feature(key, feature_name)     # single feature
reader.get_feature_list(key, feature_name)
reader.get_keys()                         # all keys
reader.get_stats()                        # total_records, total_files, ...
reader.contains_key(key)
reader.rebuild_index()

key in reader                             # __contains__
reader[key]                               # __getitem__ (raises KeyError if missing)
len(reader)                               # __len__

with TarRandomAccess("archive.tar") as r: # context manager
    ...

Advanced Options

# TFRecord: custom key feature name (default 'key')
reader = TFRecordRandomAccess("file.tfrecord", key_feature_name="id")

# Both: custom index file location
reader = TFRecordRandomAccess("file.tfrecord", index_file="my.index")
reader = TarRandomAccess("archive.tar", index_file="my.tar_index")

# Both: control parallelism
reader = TarRandomAccess("*.tar", max_workers=8, use_multiprocessing=True)

CLI

tfd list    /path/to/data.tfrecord
tfd extract /path/to/data.tfrecord record_key
tfd get     /path/to/data.tfrecord:record_key:feature_name

Claude Code Skill

Install the tfd-utils skill for Claude Code to get AI assistance with the library in any project:

tfd install-skill

This copies a skill file to ~/.claude/skills/tfd-utils/SKILL.md, enabling Claude to assist with TFRecord and tar access, the CLI, and best practices.

Converting Tar Archives to TFRecord

The tfd convert command converts tar archive(s) to TFRecord files. Each record stores all file extensions as bytes features plus a key feature containing the file stem.

# Convert a single tar
tfd convert /path/to/archive.tar

# Convert all tars in a directory, write to a different output directory
tfd convert /path/to/sa1b/ --output-dir /path/to/output/

# Glob pattern
tfd convert '/path/to/sa1b/sa_0000*.tar' --output-dir /path/to/output/

# Delete each source tar after successful conversion
tfd convert /path/to/sa1b/ --output-dir /path/to/output/ --delete

# Control parallelism (default: 16 workers)
tfd convert /path/to/sa1b/ --output-dir /path/to/output/ --workers 32

Each input foo.tar produces foo.tfrecord in the output directory (default: same directory as the source). A TFRecord produced from SA-1B tars contains these features per record:

Feature	Type	Content
`key`	bytes	File stem, e.g. `sa_226692`
`jpg`	bytes	Raw JPEG image bytes
`json`	bytes	Annotation JSON (masks, boxes…)

from tfd_utils import TFRecordRandomAccess
import json

reader = TFRecordRandomAccess("/path/to/output/sa_000000.tfrecord")
jpg_bytes  = reader.get_feature("sa_226692", "jpg")
json_bytes = reader.get_feature("sa_226692", "json")
annotation = json.loads(json_bytes)

TensorFlow Interoperability

import tensorflow as tf

dataset = tf.data.TFRecordDataset("data.tfrecord")
for record in dataset:
    example = tf.train.Example()
    example.ParseFromString(record.numpy())

License

MIT License

Project details

Release history Release notifications | RSS feed

1.2.0

Apr 30, 2026

1.0.0

Apr 29, 2026

This version

0.4.4

Apr 4, 2026

0.4.3

Apr 4, 2026

0.4.2

Mar 31, 2026

0.4.1

Mar 30, 2026

0.4.0

Mar 29, 2026

0.3.1

Nov 25, 2025

0.3.0

Oct 10, 2025

0.2.4

Jul 14, 2025

0.2.3

Jul 11, 2025

0.2.2

Jul 9, 2025

0.2.1

Jul 9, 2025

0.2.0

Jul 9, 2025

0.1.0

Jul 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tfd_utils-0.4.4.tar.gz (106.8 kB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tfd_utils-0.4.4-py3-none-any.whl (31.8 kB view details)

Uploaded Apr 4, 2026 Python 3

File details

Details for the file tfd_utils-0.4.4.tar.gz.

File metadata

Download URL: tfd_utils-0.4.4.tar.gz
Upload date: Apr 4, 2026
Size: 106.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tfd_utils-0.4.4.tar.gz
Algorithm	Hash digest
SHA256	`a5680dc6d0dff9cd38bbfd725479086a8b3a0270e7dddb80ca4ed8880ff3cae1`
MD5	`2389b0778de84ec92472fee590e078b6`
BLAKE2b-256	`3ccbc8d6018ed5b7ff7bc9a4557d445db9c033600ce4796aeb3e08dba9fbb033`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tfd_utils-0.4.4.tar.gz:

Publisher: publish.yml on HarborYuan/tfd-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tfd_utils-0.4.4.tar.gz
- Subject digest: a5680dc6d0dff9cd38bbfd725479086a8b3a0270e7dddb80ca4ed8880ff3cae1
- Sigstore transparency entry: 1230266276
- Sigstore integration time: Apr 4, 2026
Source repository:
- Permalink: HarborYuan/tfd-utils@21773e6c408f9f2df4f21dc2663a1eb7061286b2
- Branch / Tag: refs/tags/v0.4.4
- Owner: https://github.com/HarborYuan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@21773e6c408f9f2df4f21dc2663a1eb7061286b2
- Trigger Event: release

File details

Details for the file tfd_utils-0.4.4-py3-none-any.whl.

File metadata

Download URL: tfd_utils-0.4.4-py3-none-any.whl
Upload date: Apr 4, 2026
Size: 31.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tfd_utils-0.4.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`92b57b3b0444bca7bf2d48ce4c3f905038f442f3a97281188e947f2d7de6bd84`
MD5	`6d0f173ca973a94a51493d7167497507`
BLAKE2b-256	`716d264c2a00634ac91c3c7a25c0ceba48a1f16a5fec39b6d254ee699b94383b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tfd_utils-0.4.4-py3-none-any.whl:

Publisher: publish.yml on HarborYuan/tfd-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tfd_utils-0.4.4-py3-none-any.whl
- Subject digest: 92b57b3b0444bca7bf2d48ce4c3f905038f442f3a97281188e947f2d7de6bd84
- Sigstore transparency entry: 1230266317
- Sigstore integration time: Apr 4, 2026
Source repository:
- Permalink: HarborYuan/tfd-utils@21773e6c408f9f2df4f21dc2663a1eb7061286b2
- Branch / Tag: refs/tags/v0.4.4
- Owner: https://github.com/HarborYuan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@21773e6c408f9f2df4f21dc2663a1eb7061286b2
- Trigger Event: release

tfd-utils 0.4.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

tfd-utils

Upgrade Notice — v0.4.3

Key Features

Installation

Usage

TFRecord Random Access

Tar Archive Random Access

Example: SA-1B Dataset

Writing TFRecords

Common API (both readers)

Advanced Options

CLI

Claude Code Skill

Converting Tar Archives to TFRecord

TensorFlow Interoperability

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance