TensorFlow utilities for efficient TFRecord processing and random access
Project description
TensorFlow TFRecord Utils
A lightweight Python library for efficient TensorFlow TFRecord processing with random access support, without requiring TensorFlow.
Key Features
- Full TensorFlow Compatibility: Write with
tfd_utils, read with TensorFlow, or vice versa. 100% compatible and verified in tests. - Random Access Support: Access any record by key in O(1) time without reading the entire file.
- Lightweight & Standalone: No TensorFlow installation required. Works with just
numpy,protobuf, andcrc32c. - Simple API: Ready to use with automatic index caching and zero configuration.
- Multiple File Support: Handle single files, lists of files, or glob patterns seamlessly.
- Memory Efficient: Only loads requested records into memory, not the entire dataset.
Installation
Install via pip:
pip install tfd-utils
Or for development with optional TensorFlow support:
git clone https://github.com/HarborYuan/tfd-utils.git
cd tfd-utils
pip install -e ".[dev]"
Usage
Writing TFRecords
Create TFRecord files that TensorFlow can read:
from tfd_utils.writer import TFRecordWriter
from tfd_utils.pb2 import Example, Features, Feature, BytesList
with TFRecordWriter("data.tfrecord") as writer:
example = Example(features=Features(feature={
'key': Feature(bytes_list=BytesList(value=[b'record_1'])),
'image': Feature(bytes_list=BytesList(value=[b'your_image_bytes'])),
'label': Feature(bytes_list=BytesList(value=[b'cat']))
}))
writer.write(example.SerializeToString())
Random Access Reading
Initialize with a single file, or with multiple files/patterns:
from tfd_utils.random_access import TFRecordRandomAccess
# Single file
reader = TFRecordRandomAccess("data.tfrecord")
# Multiple files/patterns
reader = TFRecordRandomAccess([
"train_*.tfrecord",
"validation_*.tfrecord"
])
# Access any record instantly by key
record = reader.get_record("record_1")
image_bytes = reader.get_feature("record_1", "image")
# Dictionary-like access
if "record_1" in reader:
record = reader["record_1"]
# Get statistics
print(f"Total records: {len(reader)}")
Command-Line Interface (CLI)
tfd-utils comes with a handy command-line tool, tfd, for quick inspection of TFRecord files.
Listing Keys
To list all keys in one or more TFRecord files:
tfd list /path/to/your/data.tfrecord
You can also use glob patterns:
tfd list 'data_part_*.tfrecord'
Extracting Records
To extract a single record by its key:
tfd extract /path/to/your/data.tfrecord your_record_key
The tool will attempt to automatically detect the content type:
- Images (JPEG, PNG, GIF) are saved to a file (e.g.,
your_record_key_image_0.jpeg). - Text is printed to the console.
- Other binary or numerical data is displayed in a readable format.
Getting a specific feature
To get a single feature from a record by its key:
tfd get /path/to/your/data.tfrecord:your_record_key:your_feature_name
The tool will attempt to automatically detect the content type, similar to the extract command.
TensorFlow Interoperability
Read tfd_utils files with TensorFlow:
import tensorflow as tf
dataset = tf.data.TFRecordDataset("data.tfrecord")
for record in dataset:
example = tf.train.Example()
example.ParseFromString(record.numpy())
# Process as usual...
Advanced Usage
Custom Key Feature
Use a different feature as the key (default is 'key'):
reader = TFRecordRandomAccess("file.tfrecord", key_feature_name="id")
Custom Index Caching
Specify a custom index location:
reader = TFRecordRandomAccess(
"file.tfrecord",
index_file="my_custom_index.cache"
)
# Force rebuild index if data changes (usually not needed)
reader.rebuild_index()
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tfd_utils-0.3.1.tar.gz.
File metadata
- Download URL: tfd_utils-0.3.1.tar.gz
- Upload date:
- Size: 95.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3933994a79b709fd55ca2a39d221ee73cb48b8a2f234e0476f393e823f9497c
|
|
| MD5 |
f92b9674cbfee96198d6ab29affd5c7f
|
|
| BLAKE2b-256 |
d9cf2b4581c26d4fdd2dafeb2ff4ad84c1d840ba18fc23be40261a616a8ceb1b
|
Provenance
The following attestation bundles were made for tfd_utils-0.3.1.tar.gz:
Publisher:
publish.yml on HarborYuan/tfd-utils
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tfd_utils-0.3.1.tar.gz -
Subject digest:
e3933994a79b709fd55ca2a39d221ee73cb48b8a2f234e0476f393e823f9497c - Sigstore transparency entry: 723739009
- Sigstore integration time:
-
Permalink:
HarborYuan/tfd-utils@5636783417d52f910cec5b8e1cb386b23d90d6dc -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/HarborYuan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5636783417d52f910cec5b8e1cb386b23d90d6dc -
Trigger Event:
release
-
Statement type:
File details
Details for the file tfd_utils-0.3.1-py3-none-any.whl.
File metadata
- Download URL: tfd_utils-0.3.1-py3-none-any.whl
- Upload date:
- Size: 19.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3933d6d1628efbeed37ae4a04eb0833addb5667c83c0fdb63ba9098c7b8653a8
|
|
| MD5 |
e63094b333b665771466ca56411ba267
|
|
| BLAKE2b-256 |
5b05ae95e11db7036368c736f7bdc810fbced344df75fdced12b4c610757c410
|
Provenance
The following attestation bundles were made for tfd_utils-0.3.1-py3-none-any.whl:
Publisher:
publish.yml on HarborYuan/tfd-utils
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tfd_utils-0.3.1-py3-none-any.whl -
Subject digest:
3933d6d1628efbeed37ae4a04eb0833addb5667c83c0fdb63ba9098c7b8653a8 - Sigstore transparency entry: 723739034
- Sigstore integration time:
-
Permalink:
HarborYuan/tfd-utils@5636783417d52f910cec5b8e1cb386b23d90d6dc -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/HarborYuan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5636783417d52f910cec5b8e1cb386b23d90d6dc -
Trigger Event:
release
-
Statement type: