Tensorflow Record Reader with Random Access

These details have not been verified by PyPI

Project links

Repository

Project description

tfrecords-reader

Fast TensorFlow TFRecords reader for Python with Random access and Google Storage streaming support.

pip install "tfr-reader"
# + Google Storage support
pip install "tfr-reader[google]"

General Information

No TensorFlow dependency - this library implement custom TFRecord Reader
Protobuf is not required, this library contains cython decoder for TFRecord files
Compressed TFRecord files are supported

Fast random access to TFRecords i.e. you can read any example from the dataset without reading the whole dataset e.g.

import tfr_reader as tfr
tfrds = tfr.TFRecordDatasetReader("/path/to/directory/with/tfrecords")
example = tfrds[42]
image_bytes: bytes = example["image/encoded"].value[0]

Installation

Base installation with minimum requirements:

pip install "git+https://github.com/kmkolasinski/tfrecords-reader.git"

For extra Google Storage Cloud support use:

pip install "git+https://github.com/kmkolasinski/tfrecords-reader.git#egg=tfr-reader[google]"

Quick Start

import tensorflow_datasets as tfds
import tfr_reader as tfr
from PIL import Image
import ipyplot

dataset, dataset_info = tfds.load('oxford_flowers102', split='train', with_info=True)

def index_fn(feature: tfr.Feature):
    label = feature["label"].value[0]
    return {
        "label": label,
        "name": dataset_info.features["label"].int2str(label)
    }

tfrds = tfr.load_from_directory(
    dataset_info.data_dir,
    # indexing options, not required if index is already created
    filepattern="*.tfrecord*",
    index_fn=index_fn,
    override=True, # override the index if it exists
)

# example selection using polars SQL query API
rows, examples = tfrds.select("select * from index where name ~ 'rose' limit 10")
assert examples == tfrds[rows["_row_id"]]

samples, names = [], []
for k, example in enumerate(examples):
    image = Image.open(example["image"].bytes_io[0]).resize((224, 224))
    names.append(rows["name"][k])
    samples.append(image)

ipyplot.plot_images(samples, names)

demo

Usage

Dataset Inspection

inspect_dataset_example function allows you to inspect the dataset and get a sample example and its types.

import tfr_reader as tfr
dataset_dir = "/path/to/directory/with/tfrecords"
example, types = tfr.inspect_dataset_example(dataset_dir)
types
>>> Out[1]:
[{'key': 'label', 'type': 'int64_list', 'length': 1},
 {'key': 'name', 'type': 'bytes_list', 'length': 1},
 {'key': 'image_id', 'type': 'bytes_list', 'length': 1},
 {'key': 'image', 'type': 'bytes_list', 'length': 1}]

Dataset Indexing

Create an index of the dataset for fast access. The index is a dictionary with keys as the image IDs and values as the file names. The index is created by reading the dataset and parsing the examples. The index is saved in the dataset_dir directory. You can use the indexed_cols_fn function to specify the columns you want to index. The function should return a dictionary with keys as the column names and values as the column values.

[!NOTE] Indexing operation works only for local files, remote files are not supported.

import tfr_reader as tfr
dataset_dir = "/path/to/directory/with/tfrecords"

def indexed_cols_fn(feature):
    return {
        "label": feature["label"].value[0],
        "name": feature["name"].value[0].decode(),
        "image_id": feature["image/id"].value[0].decode(),
    }

tfrds = tfr.TFRecordDatasetReader.build_index_from_dataset_dir(dataset_dir, indexed_cols_fn)

tfrds.index_df[:5]
>> Out[2]:
shape: (5, 6)
┌───────────────────┬────────────────┬──────────────┬──────┬───────┬────────────┐
│ tfrecord_filename ┆ tfrecord_start ┆ tfrecord_end ┆ name ┆ label ┆ image_id   │
│ ---               ┆ ---            ┆ ---          ┆ ---  ┆ ---   ┆ ---        │
│ str               ┆ i64            ┆ i64          ┆ str  ┆ i64   ┆ str        │
╞═══════════════════╪════════════════╪══════════════╪══════╪═══════╪════════════╡
│ demo.tfrecord     ┆ 0              ┆ 79           ┆ cat  ┆ 1     ┆ image-id-0 │
│ demo.tfrecord     ┆ 79             ┆ 158          ┆ dog  ┆ 0     ┆ image-id-1 │
│ demo.tfrecord     ┆ 158            ┆ 237          ┆ cat  ┆ 1     ┆ image-id-2 │
│ demo.tfrecord     ┆ 237            ┆ 316          ┆ dog  ┆ 0     ┆ image-id-3 │
│ demo.tfrecord     ┆ 316            ┆ 395          ┆ cat  ┆ 1     ┆ image-id-4 │
└───────────────────┴────────────────┴──────────────┴──────┴───────┴────────────┘

Explanation about the index format:

tfrecord_filename: name of the tfrecord file
tfrecord_start: start byte position of the example in the tfrecord file
tfrecord_end: end byte position of the example in the tfrecord file
other columns: indexed columns from the dataset with indexed_cols_fn function

Dataset Reading

import tfr_reader as tfr

tfrds = tfr.TFRecordDatasetReader("/path/to/directory/with/tfrecords")
# assume that the dataset is indexed already
tfrds = tfr.TFRecordDatasetReader(
    "gs://bucket/path/to/directory/with/tfrecords",
    index_cache_dir="/tmp/tfr_index_cache" # Optional: caches remote index files locally
)
# selection API
selected_df, examples = tfrds.select("SELECT * FROM index WHERE name = 'cat' LIMIT 20")
# custom selection
selected_df = tfrds.index_df.sample(5)
examples = tfrds.load_records(selected_df)
# indexing API
for i in range(len(tfrds)):
    example = tfrds[i]
    # assuming image is encoded as bytes at key "image/encoded"
    image_bytes = example["image/encoded"].value[0]
    # label is encoded as int64 at key "label"
    label = example["label"].value[0]

Custom Protobuf Decoder for TFRecord files

If protobuf is not installed or it uses old and slow 'python' API decoder, this library will use custom specialized protobuf decoder written in cython. To enforce custom protobuf decoder for TFRecord files, run this command

import tfr_reader as tfr
# to use custom protobuf decoder
tfr.set_decoder_type("cython")
# to use default protobuf decoder
tfr.set_decoder_type("protobuf")

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

1.1.0

Mar 30, 2026

1.0.2

Mar 30, 2026

1.0.1

Feb 13, 2026

0.10.0

Nov 24, 2025

0.9.0

Nov 1, 2025

0.8.0

Apr 15, 2025

0.7.0

Apr 15, 2025

0.6.0

Apr 15, 2025

0.5.0

Apr 14, 2025

0.4.0

Apr 12, 2025

0.3.0

Apr 10, 2025

0.2.3

Apr 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tfr_reader-1.1.0.tar.gz (290.9 kB view details)

Uploaded Mar 30, 2026 Source

File details

Details for the file tfr_reader-1.1.0.tar.gz.

File metadata

Download URL: tfr_reader-1.1.0.tar.gz
Upload date: Mar 30, 2026
Size: 290.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for tfr_reader-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8507cd102cbed9ff0085bc4496ecc851dec2c7df971019faa2c71a18edc017a7`
MD5	`b7c8f46804d5d790b7cd8a45c33337eb`
BLAKE2b-256	`ef9622d57fe15c727e12edd43e2c19b06315196c0f159b2cb4873477abe3bfe5`

See more details on using hashes here.

tfr-reader 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tfrecords-reader

General Information

Installation

Quick Start

Usage

Dataset Inspection

Dataset Indexing

Dataset Reading

Custom Protobuf Decoder for TFRecord files

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes