Feature I/O for TensorFlow
Project description
FIO
FIO is short for Feature I/O.
FIO is a concept stemming TensorFlow's low-level API. More specifically, FIO is a concept of how to join the methods dealing with the protocol buffers that write data to the TF.Record format and from the TF.Record format.
FIO works by defining your features once in a "schema"
e.g. for "normal" features:
'my-features': {'length': 'fixed', 'dtype': tf.string, 'shape': []}
and we are done. Make this a tf.train.Example
or tf.train.SequenceExample
,
write to a record, and read from it.
For rank 2 features:
'seq': {
'length': 'fixed',
'dtype': tf.int64,
'shape': [3, 3],
'encode': 'channels',
'channel_names': ['Channel 1', 'Channel 2', 'Channel 3'],
'data_format': 'channels_last'
}
and we are done again. I can now encode this as a tf.train.Example
(with the
optional, but here provided channel names) or tf.train.SequenceExample
without the names.
See the Demo notebook
Useful resources
S.O.
How to decompose tensors for TF.Record How to recover TF Record data
Colab
Playing with Rank 2 Tensors and TF Records Recovering TF Records
Motivation
Disclaimer: I am not familiar with the intricacies of TF, Protocol Buffers, and the like. So my motivation here may actually not make sense to the experts who actually designed this for TF. From my uninformed view, I find parts of this interface cumbersome and lacking documentation and FIO is as much as an over-engineered solution to my specific problem (getting some data in and out of TF Records) as it is an exploration of what works regarding TF Records.
"Duplicate code"
Consider the exceptionally trivial data of writing and reading:
my_feature = 'hi'
from a TF.Record.
Example
To get this feature into a record we must first wrap this data as follows:
example = tf.train.Example(
features=tf.train.Features(
feature={
'my-feature': tf.train.Feature(
bytes_list=tf.train.BytesList(
value=[my_feature.encode()] # wrap as list
) # end byteslist
) # end feature
}
) # end features
) # end example
which returns:
features {
feature {
key: "my-feature"
value {
bytes_list {
value: "hi"
}
}
}
}
and we can write this to a record with:
with tf.python_io.TFRecordWriter('my_record.tfrecord') as writer:
writer.write(example.SerializeToString())
and we can retrieve it from this file with:
DATASET_FILENAMES = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(DATASET_FILENAMES).map(lambda r: parse_fn(r)).repeat().batch(1)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
for _ in range(1): # epochs
sess.run(iterator.initializer, feed_dict={DATASET_FILENAMES: ['my_record.tfrecord']})
for _ in range(1): # batch
recovered = sess.run(next_element)
Now it is very important to note that:
-
the above code is a bit more sophisticated than the base example needed (given the iterator), which I will not go into detail here about,
-
the above code will not run because
parse_fn
is not defined
Since we are using a TF Example, I will demonstrate how we can simply read the example to get the data back:
def parse_fn(record):
features = {
'my-feature': tf.FixedLenFeature([], dtype=tf.string)
}
parsed = tf.parse_single_example(record, features)
# other things can be done if needed
return parsed
now running the above code yields:
{'my-feature': array([b'hi'], dtype=object)}
So if we want to get back to {'my-feature': 'hi'}
we still need to unwrap the list and decode the string:
recovered['my-feature'] = recovered['my-feature'][0].decode()
Note: this can not be done in the parsing function (at least as I have done here).
What I would like to solve
The point of showing all of this, is that if so much care goes into converting our data for TF.Records, why must I then also write similar code to extract it out?
A goal of FIO is to define a singular schema which gets data both into a TF Record and can recover it (as we put it in).
De/re-composing tensors:
I will touch on the difference between tf.train.Example
and tf.train.SequenceExample
in the next section. Regardless of which you choose to use, any tensor of rank 2 or greater must be decomposed.
For simplicity consider the sequence:
seq = [
# ch1, ch2, channel_3
[ 1, 1, 1], # element 1
[ 2, 2, 2], # element 2
[ 3, 3, 3], # element 3
[ 4, 5, 6] # element 4
]
Both tf.train.Example
and tf.train.SequenceExample
require seq
to be decomposed by channel:
either as:
# for tf.train.Example
tf.train.Features(
feature={
'channel 1': tf.train.Feature(int64_list=tf.train.Int64List(value=seq[0])),
'channel 2': tf.train.Feature(int64_list=tf.train.Int64List(value=seq[1])),
'channel 3': tf.train.Feature(int64_list=tf.train.Int64List(value=seq[2]))
}
)
or as:
tf.train.FeatureLists(feature_list=
tf.train.FeatureList(
feature=[
tf.train.Feature(int64_list=tf.train.Int64List(value=seq[i]))
for i in range(number_of_channels(seq))
]
)
)
The difference here being that for SequenceExample
, the channels are unnamed features. For many channels, this is advantageous as one needs not worry about reassembling the sequence from all the channels. However, if one has only a few channels (e.g. rgb), then one could - prior to feeding into the model - rearrange the channels, or if there are multiple-inputs, this may be of use.
Either-way, FIO aims to handle this part, both for decomposing tensors and recomposing them (in the case of the Example
).
Singular interface
As mentioned above, a core distinction between Example
and SequenceExample
is whether or not each feature is named. However, there is one other core distinction: SequenceExample
is a tuple. Rather, I should say, SequenceExample
allows one to store "context" (metadata), where the context are features that adhere to all the restrictions of those for an Example
.
The example given in the docs is that the context might be the same across sequences, but the large sequences may vary. Thus it saves space.
I admit, while that sound like a structure, I do not understand how to utilize that in practice. A SequenceExample
requires a context (although it can be just an empty dict
). Thus if one is storing examples individually, they would most likely store the context in each record. Furthermore, given my current understanding, only sequence features can be stored in the feature_lists
part of SequenceExample
.
From my uninformed perspective, it might have been nicer to just maintain a single Example
interface, and then using something like (the imaginary) tf.train.SequenceFeature
allow everything to be stored together. Better yet, tf.train.TensorFeature
might solve a lot of issues.
Anyway, FIO aims to be a singular interface for handling Example
and SequenceExample
, making it easy to decompose tensors to (un)named features and export / import correspondingly.
Decomposing techniques
There are many ways to decompose a Tensor into its channels. I will only consider some of the possibilities for the aforementioned seq
:
- separate each channel of
seq
and store as corresponding numeric type, name and store inExample
- separate each channel of
seq
, and store as bytes after converting vianumpy.ndarray.tostring()
, name and store inExample
- separate each channel of
seq
, and convert each element to bytes, and then store as aBytesList
, name and store inExample
- store altogether by dumping
seq
to bytes vianumpy.ndarray.tostring()
, name and store inExample
- separate each channel of
seq
and store as corresponding numeric type, leave unnamed and store inSequenceExample
- separate each channel of
seq
, and store as bytes after converting vianumpy.ndarray.tostring()
, leave unnamed and store inSequenceExample
- separate each channel of
seq
, and convert each element to bytes, and then store as aBytesList
, leave unnamed and store inSequenceExample
- store altogether by dumping
seq
to bytes vianumpy.ndarray.tostring()
and wrap as a single element bytes list.
Which one of these is best? No idea.
Maybe TF encoded Float
and Int64
Features
to bytes behind the scene.
I will note that sometimes when trying to recover the data, if encoded as bytes, then the tensor is flattened. e.g. [[1,2],[3,4]]
is restored as [1,2,3,4]
.
Anyway, FIO aims to allow for tensors to be encoded and decoded via some of these methods, as I honestly do not know which is best (in terms of file size, read efficiency, etc).
Again, I think TF could do this better via a tf.train.TensorFeature
p.s. if anyone knows why how to encoded features are under tf.train
(e.g. tf.train.Feature(int64_list=tf.train.Int64List(value=value))
) and how to decoded features are just under tf
(e.g. tf.FixedLenFeature
) I would like to know. This seems odd to me...
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fio-0.0.6.tar.gz
.
File metadata
- Download URL: fio-0.0.6.tar.gz
- Upload date:
- Size: 14.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/36.4.0 requests-toolbelt/0.8.0 tqdm/4.19.5 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7deed7b1d9f8f4f32ffd82cf521137caef4c43385867e7a91de5dbff41a8e17b |
|
MD5 | 13a35e7ae1a2b392f7c1f475bd75047d |
|
BLAKE2b-256 | 7a806f689ec0239d9379f4157c665a3c04cf73394dad0588357d4654a0262e04 |
File details
Details for the file fio-0.0.6-py3-none-any.whl
.
File metadata
- Download URL: fio-0.0.6-py3-none-any.whl
- Upload date:
- Size: 13.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/36.4.0 requests-toolbelt/0.8.0 tqdm/4.19.5 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ec7c94a0a719f406991cf551680d0cceb82bab25664598e82f5dff1232675d2 |
|
MD5 | f3418962048a8cfc0c5fe2028ca6d0d4 |
|
BLAKE2b-256 | b8c9af34a5c30e4feb5256cc5b4451280d724d4d6bcf51a8309923365752ceb7 |