Skip to main content

TensorFlow utilities for efficient TFRecord processing and random access

Project description

TensorFlow TFRecord Utils

A lightweight Python library for efficient TensorFlow TFRecord processing with random access support, without requiring TensorFlow.

Key Features

  • Full TensorFlow Compatibility: Write with tfd_utils, read with TensorFlow, or vice versa. 100% compatible and verified in tests.
  • Random Access Support: Access any record by key in O(1) time without reading the entire file.
  • Lightweight & Standalone: No TensorFlow installation required. Works with just numpy, protobuf, and crc32c.
  • Simple API: Ready to use with automatic index caching and zero configuration.
  • Multiple File Support: Handle single files, lists of files, or glob patterns seamlessly.
  • Memory Efficient: Only loads requested records into memory, not the entire dataset.

Installation

Install via pip:

pip install tfd-utils

Or for development with optional TensorFlow support:

git clone https://github.com/HarborYuan/tfd-utils.git
cd tfd-utils
pip install -e ".[dev]"

Usage

Writing TFRecords

Create TFRecord files that TensorFlow can read:

from tfd_utils.writer import TFRecordWriter
from tfd_utils.pb2 import Example, Features, Feature, BytesList

with TFRecordWriter("data.tfrecord") as writer:
    example = Example(features=Features(feature={
        'key': Feature(bytes_list=BytesList(value=[b'record_1'])),
        'image': Feature(bytes_list=BytesList(value=[b'your_image_bytes'])),
        'label': Feature(bytes_list=BytesList(value=[b'cat']))
    }))
    writer.write(example.SerializeToString())

Random Access Reading

Initialize with a single file, or with multiple files/patterns:

from tfd_utils.random_access import TFRecordRandomAccess

# Single file
reader = TFRecordRandomAccess("data.tfrecord")

# Multiple files/patterns
reader = TFRecordRandomAccess([
    "train_*.tfrecord",
    "validation_*.tfrecord"
])

# Access any record instantly by key
record = reader.get_record("record_1")
image_bytes = reader.get_feature("record_1", "image")

# Dictionary-like access
if "record_1" in reader:
    record = reader["record_1"]

# Get statistics
print(f"Total records: {len(reader)}")

Command-Line Interface (CLI)

tfd-utils comes with a handy command-line tool, tfd, for quick inspection of TFRecord files.

Listing Keys

To list all keys in one or more TFRecord files:

tfd list /path/to/your/data.tfrecord

You can also use glob patterns:

tfd list 'data_part_*.tfrecord'

Extracting Records

To extract a single record by its key:

tfd extract /path/to/your/data.tfrecord your_record_key

The tool will attempt to automatically detect the content type:

  • Images (JPEG, PNG, GIF) are saved to a file (e.g., your_record_key_image_0.jpeg).
  • Text is printed to the console.
  • Other binary or numerical data is displayed in a readable format.

Getting a specific feature

To get a single feature from a record by its key:

tfd get /path/to/your/data.tfrecord:your_record_key:your_feature_name

The tool will attempt to automatically detect the content type, similar to the extract command.

TensorFlow Interoperability

Read tfd_utils files with TensorFlow:

import tensorflow as tf

dataset = tf.data.TFRecordDataset("data.tfrecord")
for record in dataset:
    example = tf.train.Example()
    example.ParseFromString(record.numpy())
    # Process as usual...

Advanced Usage

Custom Key Feature

Use a different feature as the key (default is 'key'):

reader = TFRecordRandomAccess("file.tfrecord", key_feature_name="id")

Custom Index Caching

Specify a custom index location:

reader = TFRecordRandomAccess(
    "file.tfrecord",
    index_file="my_custom_index.cache"
)

# Force rebuild index if data changes (usually not needed)
reader.rebuild_index()

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tfd_utils-0.3.1.tar.gz (95.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tfd_utils-0.3.1-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file tfd_utils-0.3.1.tar.gz.

File metadata

  • Download URL: tfd_utils-0.3.1.tar.gz
  • Upload date:
  • Size: 95.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tfd_utils-0.3.1.tar.gz
Algorithm Hash digest
SHA256 e3933994a79b709fd55ca2a39d221ee73cb48b8a2f234e0476f393e823f9497c
MD5 f92b9674cbfee96198d6ab29affd5c7f
BLAKE2b-256 d9cf2b4581c26d4fdd2dafeb2ff4ad84c1d840ba18fc23be40261a616a8ceb1b

See more details on using hashes here.

Provenance

The following attestation bundles were made for tfd_utils-0.3.1.tar.gz:

Publisher: publish.yml on HarborYuan/tfd-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tfd_utils-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: tfd_utils-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 19.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tfd_utils-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3933d6d1628efbeed37ae4a04eb0833addb5667c83c0fdb63ba9098c7b8653a8
MD5 e63094b333b665771466ca56411ba267
BLAKE2b-256 5b05ae95e11db7036368c736f7bdc810fbced344df75fdced12b4c610757c410

See more details on using hashes here.

Provenance

The following attestation bundles were made for tfd_utils-0.3.1-py3-none-any.whl:

Publisher: publish.yml on HarborYuan/tfd-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page