TensorFlow utilities for efficient TFRecord processing and random access
Project description
TensorFlow TFRecord Utils
A lightweight Python library for efficient TensorFlow TFRecord processing with random access support. (not requiring TensorFlow)
🤖 Context for LLMs
What is this library? tfd_utils is a Python library that provides efficient random access to TensorFlow TFRecord files without requiring TensorFlow as a dependency. It's designed for scenarios where you need to quickly access specific records by key rather than sequentially reading entire files.
Key architectural concepts:
- TFRecord Compatibility: Reads/writes files that are 100% compatible with TensorFlow's
tf.data.TFRecordDatasetandtf.io.TFRecordWriter- verified through comprehensive test suite - Random Access Index: Automatically builds and caches an index mapping record keys to file positions for O(1) lookup
- Protocol Buffers: Uses protobuf definitions for TensorFlow's
ExampleandFeaturestructures - Minimal Dependencies: Only requires
numpy,protobuf, andcrc32c- no TensorFlow installation needed
Common usage patterns:
- Writing: Use
TFRecordWriterto create TFRecord files with key-value structured data - Random Reading: Use
TFRecordRandomAccessto instantly access any record by its key - Batch Processing: Process large datasets efficiently by accessing only needed records
- TensorFlow Interop: Seamlessly switch between this library and native TensorFlow readers/writers
File structure: The library is organized into modules for writing (writer/), random access (random_access.py), and protocol buffer definitions (pb2/).
🚀 Key Features
- 🔄 Full TensorFlow Compatibility: Write with
tfd_utils, read with TensorFlow (or vice versa) - 100% compatible, verified in tests - ⚡ Random Access Support: Access any record by key in O(1) time without reading the entire file
- 🪶 Lightweight & Standalone: No TensorFlow installation required - works with just
numpy,protobuf, andcrc32c - 📦 Ready to Use: Simple API, automatic index caching, and zero configuration
- 🗂️ Multiple File Support: Handle single files, lists of files, or glob patterns seamlessly
- 💾 Memory Efficient: Only loads requested records into memory, not the entire dataset
Installation
Install via pip (lightweight, no TensorFlow dependency):
pip install tfd-utils
Or for development with optional TensorFlow support:
# Clone and install
git clone https://github.com/HarborYuan/tfd-utils.git
cd tfd-utils
pip install -e ".[dev]"
Why TFD Utils?
✅ TensorFlow Compatible, TensorFlow Optional
- Write TFRecords with
tfd_utils, read withtf.data.TFRecordDataset✅ [tested] - Write with
tf.io.TFRecordWriter, read withtfd_utils✅ [tested] - No TensorFlow installation required for basic usage
✅ Random Access Made Simple
- Traditional: Read entire file sequentially to find one record
- TFD Utils: Jump directly to any record by key in O(1) time
✅ Production Ready
- Automatic index caching for performance
- Robust error handling
- Memory efficient design
- **Tested compatibility with TensorFlow 2.19.0 (see
tests/directory)
Quick Start
Writing TFRecords (TensorFlow Compatible)
from tfd_utils.writer import TFRecordWriter
from tfd_utils.pb2 import Example, Features, Feature, BytesList
# Create TFRecord files that TensorFlow can read
with TFRecordWriter("data.tfrecord") as writer:
example = Example(features=Features(feature={
'key': Feature(bytes_list=BytesList(value=[b'record_1'])),
'image': Feature(bytes_list=BytesList(value=[image_bytes])),
'label': Feature(bytes_list=BytesList(value=[b'cat']))
}))
writer.write(example.SerializeToString())
Random Access Reading
from tfd_utils.random_access import TFRecordRandomAccess
# Initialize with a single file
reader = TFRecordRandomAccess("data.tfrecord")
# Or with multiple files/patterns
reader = TFRecordRandomAccess([
"train_*.tfrecord",
"validation_*.tfrecord"
])
# Access any record instantly by key
record = reader.get_record("record_1")
image_bytes = reader.get_feature("record_1", "image")
# Dictionary-like access
if "record_1" in reader:
record = reader["record_1"]
# Get statistics
print(f"Total records: {len(reader)}")
TensorFlow Interoperability
# Read tfd_utils files with TensorFlow
import tensorflow as tf
dataset = tf.data.TFRecordDataset("data.tfrecord")
for record in dataset:
example = tf.train.Example()
example.ParseFromString(record.numpy())
# Process as usual...
Advanced Usage
Custom Key Feature
# Use different feature as the key (default is 'key')
reader = TFRecordRandomAccess("file.tfrecord", key_feature_name="id")
Custom Index Caching
# Specify custom index location
reader = TFRecordRandomAccess(
"file.tfrecord",
index_file="my_custom_index.cache"
)
# Force rebuild index if data changes (usually not needed)
reader.rebuild_index()
License
MIT License
Ready to use, lightweight, and fully TensorFlow compatible! 🚀
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tfd_utils-0.2.4.tar.gz.
File metadata
- Download URL: tfd_utils-0.2.4.tar.gz
- Upload date:
- Size: 90.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef591470b12a79665d9e2a2df1388c2950e3653bcb595f57d7578521cdcd1269
|
|
| MD5 |
b72e35f3477c36ce04b2fe9a99a36ce1
|
|
| BLAKE2b-256 |
b080a935d70bb3576f5424bce94905a49056136b57355fd41f867ccf80b72bf3
|
Provenance
The following attestation bundles were made for tfd_utils-0.2.4.tar.gz:
Publisher:
publish.yml on HarborYuan/tfd-utils
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tfd_utils-0.2.4.tar.gz -
Subject digest:
ef591470b12a79665d9e2a2df1388c2950e3653bcb595f57d7578521cdcd1269 - Sigstore transparency entry: 273417147
- Sigstore integration time:
-
Permalink:
HarborYuan/tfd-utils@c43cd0a112d6a2b7ca743337f293a972d3f780b5 -
Branch / Tag:
refs/tags/v0.2.4 - Owner: https://github.com/HarborYuan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c43cd0a112d6a2b7ca743337f293a972d3f780b5 -
Trigger Event:
release
-
Statement type:
File details
Details for the file tfd_utils-0.2.4-py3-none-any.whl.
File metadata
- Download URL: tfd_utils-0.2.4-py3-none-any.whl
- Upload date:
- Size: 18.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9281277af9d1a6ea05c3abb3024f50e3ad6132d6f2d415db67b8f43f3d7988c1
|
|
| MD5 |
b5d36846b0bd8b9d55676d63d5bc4f34
|
|
| BLAKE2b-256 |
945301a83539abca16dd372553a8aa89dc6d50eb537f2580c39b0cae75132ddf
|
Provenance
The following attestation bundles were made for tfd_utils-0.2.4-py3-none-any.whl:
Publisher:
publish.yml on HarborYuan/tfd-utils
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tfd_utils-0.2.4-py3-none-any.whl -
Subject digest:
9281277af9d1a6ea05c3abb3024f50e3ad6132d6f2d415db67b8f43f3d7988c1 - Sigstore transparency entry: 273417148
- Sigstore integration time:
-
Permalink:
HarborYuan/tfd-utils@c43cd0a112d6a2b7ca743337f293a972d3f780b5 -
Branch / Tag:
refs/tags/v0.2.4 - Owner: https://github.com/HarborYuan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c43cd0a112d6a2b7ca743337f293a972d3f780b5 -
Trigger Event:
release
-
Statement type: