Skip to main content

TensorFlow utilities for efficient TFRecord processing and random access

Project description

TensorFlow TFRecord Utils

A lightweight Python library for efficient TensorFlow TFRecord processing with random access support. (not requiring TensorFlow)

🤖 Context for LLMs

What is this library? tfd_utils is a Python library that provides efficient random access to TensorFlow TFRecord files without requiring TensorFlow as a dependency. It's designed for scenarios where you need to quickly access specific records by key rather than sequentially reading entire files.

Key architectural concepts:

  • TFRecord Compatibility: Reads/writes files that are 100% compatible with TensorFlow's tf.data.TFRecordDataset and tf.io.TFRecordWriter - verified through comprehensive test suite
  • Random Access Index: Automatically builds and caches an index mapping record keys to file positions for O(1) lookup
  • Protocol Buffers: Uses protobuf definitions for TensorFlow's Example and Feature structures
  • Minimal Dependencies: Only requires numpy, protobuf, and crc32c - no TensorFlow installation needed

Common usage patterns:

  1. Writing: Use TFRecordWriter to create TFRecord files with key-value structured data
  2. Random Reading: Use TFRecordRandomAccess to instantly access any record by its key
  3. Batch Processing: Process large datasets efficiently by accessing only needed records
  4. TensorFlow Interop: Seamlessly switch between this library and native TensorFlow readers/writers

File structure: The library is organized into modules for writing (writer/), random access (random_access.py), and protocol buffer definitions (pb2/).

🚀 Key Features

  • 🔄 Full TensorFlow Compatibility: Write with tfd_utils, read with TensorFlow (or vice versa) - 100% compatible, verified in tests
  • ⚡ Random Access Support: Access any record by key in O(1) time without reading the entire file
  • 🪶 Lightweight & Standalone: No TensorFlow installation required - works with just numpy, protobuf, and crc32c
  • 📦 Ready to Use: Simple API, automatic index caching, and zero configuration
  • 🗂️ Multiple File Support: Handle single files, lists of files, or glob patterns seamlessly
  • 💾 Memory Efficient: Only loads requested records into memory, not the entire dataset

Installation

Install via pip (lightweight, no TensorFlow dependency):

pip install tfd-utils

Or for development with optional TensorFlow support:

# Clone and install
git clone https://github.com/HarborYuan/tfd-utils.git
cd tfd-utils
pip install -e ".[dev]"

Why TFD Utils?

TensorFlow Compatible, TensorFlow Optional

  • Write TFRecords with tfd_utils, read with tf.data.TFRecordDataset[tested]
  • Write with tf.io.TFRecordWriter, read with tfd_utils[tested]
  • No TensorFlow installation required for basic usage

Random Access Made Simple

  • Traditional: Read entire file sequentially to find one record
  • TFD Utils: Jump directly to any record by key in O(1) time

Production Ready

  • Automatic index caching for performance
  • Robust error handling
  • Memory efficient design
  • **Tested compatibility with TensorFlow 2.19.0 (see tests/ directory)

Quick Start

Writing TFRecords (TensorFlow Compatible)

from tfd_utils.writer import TFRecordWriter
from tfd_utils.pb2 import Example, Features, Feature, BytesList

# Create TFRecord files that TensorFlow can read
with TFRecordWriter("data.tfrecord") as writer:
    example = Example(features=Features(feature={
        'key': Feature(bytes_list=BytesList(value=[b'record_1'])),
        'image': Feature(bytes_list=BytesList(value=[image_bytes])),
        'label': Feature(bytes_list=BytesList(value=[b'cat']))
    }))
    writer.write(example.SerializeToString())

Random Access Reading

from tfd_utils.random_access import TFRecordRandomAccess

# Initialize with a single file
reader = TFRecordRandomAccess("data.tfrecord")

# Or with multiple files/patterns
reader = TFRecordRandomAccess([
    "train_*.tfrecord",
    "validation_*.tfrecord"
])

# Access any record instantly by key
record = reader.get_record("record_1")
image_bytes = reader.get_feature("record_1", "image")

# Dictionary-like access
if "record_1" in reader:
    record = reader["record_1"]

# Get statistics
print(f"Total records: {len(reader)}")

TensorFlow Interoperability

# Read tfd_utils files with TensorFlow
import tensorflow as tf

dataset = tf.data.TFRecordDataset("data.tfrecord")
for record in dataset:
    example = tf.train.Example()
    example.ParseFromString(record.numpy())
    # Process as usual...

Advanced Usage

Custom Key Feature

# Use different feature as the key (default is 'key')
reader = TFRecordRandomAccess("file.tfrecord", key_feature_name="id")

Custom Index Caching

# Specify custom index location
reader = TFRecordRandomAccess(
    "file.tfrecord",
    index_file="my_custom_index.cache"
)

# Force rebuild index if data changes (usually not needed)
reader.rebuild_index()

License

MIT License


Ready to use, lightweight, and fully TensorFlow compatible! 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tfd_utils-0.2.4.tar.gz (90.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tfd_utils-0.2.4-py3-none-any.whl (18.1 kB view details)

Uploaded Python 3

File details

Details for the file tfd_utils-0.2.4.tar.gz.

File metadata

  • Download URL: tfd_utils-0.2.4.tar.gz
  • Upload date:
  • Size: 90.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for tfd_utils-0.2.4.tar.gz
Algorithm Hash digest
SHA256 ef591470b12a79665d9e2a2df1388c2950e3653bcb595f57d7578521cdcd1269
MD5 b72e35f3477c36ce04b2fe9a99a36ce1
BLAKE2b-256 b080a935d70bb3576f5424bce94905a49056136b57355fd41f867ccf80b72bf3

See more details on using hashes here.

Provenance

The following attestation bundles were made for tfd_utils-0.2.4.tar.gz:

Publisher: publish.yml on HarborYuan/tfd-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tfd_utils-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: tfd_utils-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 18.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for tfd_utils-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 9281277af9d1a6ea05c3abb3024f50e3ad6132d6f2d415db67b8f43f3d7988c1
MD5 b5d36846b0bd8b9d55676d63d5bc4f34
BLAKE2b-256 945301a83539abca16dd372553a8aa89dc6d50eb537f2580c39b0cae75132ddf

See more details on using hashes here.

Provenance

The following attestation bundles were made for tfd_utils-0.2.4-py3-none-any.whl:

Publisher: publish.yml on HarborYuan/tfd-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page