Skip to main content

TensorFlow utilities for efficient TFRecord processing and random access

Project description

TensorFlow TFRecord Utils

A lightweight Python library for efficient TensorFlow TFRecord processing with random access support, without requiring TensorFlow.

Key Features

  • Full TensorFlow Compatibility: Write with tfd_utils, read with TensorFlow, or vice versa. 100% compatible and verified in tests.
  • Random Access Support: Access any record by key in O(1) time without reading the entire file.
  • Lightweight & Standalone: No TensorFlow installation required. Works with just numpy, protobuf, and crc32c.
  • Simple API: Ready to use with automatic index caching and zero configuration.
  • Multiple File Support: Handle single files, lists of files, or glob patterns seamlessly.
  • Memory Efficient: Only loads requested records into memory, not the entire dataset.

Installation

Install via pip:

pip install tfd-utils

Or for development with optional TensorFlow support:

git clone https://github.com/HarborYuan/tfd-utils.git
cd tfd-utils
pip install -e ".[dev]"

Usage

Writing TFRecords

Create TFRecord files that TensorFlow can read:

from tfd_utils.writer import TFRecordWriter
from tfd_utils.pb2 import Example, Features, Feature, BytesList

with TFRecordWriter("data.tfrecord") as writer:
    example = Example(features=Features(feature={
        'key': Feature(bytes_list=BytesList(value=[b'record_1'])),
        'image': Feature(bytes_list=BytesList(value=[b'your_image_bytes'])),
        'label': Feature(bytes_list=BytesList(value=[b'cat']))
    }))
    writer.write(example.SerializeToString())

Random Access Reading

Initialize with a single file, or with multiple files/patterns:

from tfd_utils.random_access import TFRecordRandomAccess

# Single file
reader = TFRecordRandomAccess("data.tfrecord")

# Multiple files/patterns
reader = TFRecordRandomAccess([
    "train_*.tfrecord",
    "validation_*.tfrecord"
])

# Access any record instantly by key
record = reader.get_record("record_1")
image_bytes = reader.get_feature("record_1", "image")

# Dictionary-like access
if "record_1" in reader:
    record = reader["record_1"]

# Get statistics
print(f"Total records: {len(reader)}")

Command-Line Interface (CLI)

tfd-utils comes with a handy command-line tool, tfd, for quick inspection of TFRecord files.

Listing Keys

To list all keys in one or more TFRecord files:

tfd list /path/to/your/data.tfrecord

You can also use glob patterns:

tfd list 'data_part_*.tfrecord'

Extracting Records

To extract a single record by its key:

tfd extract /path/to/your/data.tfrecord your_record_key

The tool will attempt to automatically detect the content type:

  • Images (JPEG, PNG, GIF) are saved to a file (e.g., your_record_key_image_0.jpeg).
  • Text is printed to the console.
  • Other binary or numerical data is displayed in a readable format.

TensorFlow Interoperability

Read tfd_utils files with TensorFlow:

import tensorflow as tf

dataset = tf.data.TFRecordDataset("data.tfrecord")
for record in dataset:
    example = tf.train.Example()
    example.ParseFromString(record.numpy())
    # Process as usual...

Advanced Usage

Custom Key Feature

Use a different feature as the key (default is 'key'):

reader = TFRecordRandomAccess("file.tfrecord", key_feature_name="id")

Custom Index Caching

Specify a custom index location:

reader = TFRecordRandomAccess(
    "file.tfrecord",
    index_file="my_custom_index.cache"
)

# Force rebuild index if data changes (usually not needed)
reader.rebuild_index()

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tfd_utils-0.3.0.tar.gz (94.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tfd_utils-0.3.0-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file tfd_utils-0.3.0.tar.gz.

File metadata

  • Download URL: tfd_utils-0.3.0.tar.gz
  • Upload date:
  • Size: 94.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tfd_utils-0.3.0.tar.gz
Algorithm Hash digest
SHA256 117d13e11c9aaa78a09a6fc7ad32ba3a4e7a002b5504441174f6ae21efe6af9c
MD5 e6cdabf305a1a2c04acc3436b2e2ef9b
BLAKE2b-256 7617c2bd6f6a58b5fec42e30c5a72775e367e5d47d1d76b310817ada2a5daede

See more details on using hashes here.

Provenance

The following attestation bundles were made for tfd_utils-0.3.0.tar.gz:

Publisher: publish.yml on HarborYuan/tfd-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tfd_utils-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: tfd_utils-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tfd_utils-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 19b3c05a47ff3ee564d4bffff8a5a07015756876a1c7790c317519ff35584f5a
MD5 e486e2168a05393d02ffb2799d61f2f0
BLAKE2b-256 364013e464dea2cf6b270f10ce6d1ad8621110a68894a925035ac66383b485df

See more details on using hashes here.

Provenance

The following attestation bundles were made for tfd_utils-0.3.0-py3-none-any.whl:

Publisher: publish.yml on HarborYuan/tfd-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page