Utils for encoding and data set reading
Project description
tensorflow-dataset-pipe
If you are having problems with data preparation for keras model this library may help.
Suppose you have a csv fie with data you would like to train on.
Example of csv file
Product title, Category
Addias shoes, 1
Nike sneakers, 1
Wireless router, 2
....
Your input date is in the first column and output data is in the 2nd column. You need to encode both input and output. YOu could start like this:
from dataset_pipe.feeds.datasets import XYDataset
dataset = XYDataset("csv")
data = dataset.feed("file.csv")
# x is your input
# y is your output
for x, y in data:
print(x)
Run this and you will see:
OrderedDict([('input', "Addias shoes"), OrderedDict([('output', 1)])
This way you can debug if the library is reading the right data.
Filtering and mapping
Now lest map and filter (optionally) the data. Mapping is necessary for the encoding process.
from dataset_pipe.feeds.datasets import XYDataset
def mapper(data):
words = data[0].split() # split first column into words
if len(words)==1: # filter short title descriptions
return None
category = int(data[1]) # return category as int
return # return tuple of input and output
{'x': words}, # input
{'y': category} # output
dataset = XYDataset("csv")
data = dataset.map(mapper).feed("file.csv")
for x, y in data:
print(x, y)
If you ran this you should see an ordered dictionary of input and outpu data. Input should be splitted into words
Encoding
In order to encode data you need an Encoder. Encoder is a class thats implements EncoderInterface.
This is an example of OneHotEncoder. Encoder needs 4 methods.
- encode - this is where the data will be encoded
- shape - this method returns the shape of the encoded vector
- type - this method returns the tensorflow data type (tf.dtype) of encoded vector
- dim - returns dimension of the vector
from dataset_pipe.encoders.math.ops import zeros
class OneHotEncoder:
def __init__(self, dim):
self._dim = dim
self._shape = (dim,)
self._type = "float32"
def encode(self, data):
if not isinstance(data, int):
raise ValueError("Param data must be integer. {} given fo type {}".format(data, type(data)))
"""
Data is a list of category ids, e.g. 12
Return dense one hot encoded vector
"""
vector = zeros(self._shape)
vector[data] = 1.0
return vector
def shape(self):
return self._shape
def type(self):
return self._type
def dim(self):
return self._dim
Basic encoders are included in the library so you do not have to write it on your own. OneHotEncoder is also included in the library. Now will use this encoder together with DictToBinaryEncoder to encode mapped data.
from dataset_pipe.feeds.datasets import XYDataset
from dataset_pipe.encoders.dict_to_binary_encoder import DictToBinaryEncoder
from dataset_pipe.encoders.one_hot_encoder import OneHotEncoder
def mapper(data):
words = data[0].split() # split first column into words
if len(words)==1: # filter short title descriptions
return None
category = int(data[1]) # return category as int
# return tuple of input and output
return {'x': words}, {'y': category}
bag_of_words_2_idx = {
"addidas": 1,
"nike": 2
}
dataset = XYDataset("csv")
dataset.map(mapper)
dataset.encode(
{"x": DictToBinaryEncoder(bag_of_words_2_idx)},
{"y": OneHotEncoder(10)})
for x, y in dataset.feed("file.csv").batch(10):
print(x, y)
If your model requires more the one input or output map it this way:
def mapper(data):
x1 = data[0].split()
x2 = data[1]
y1 = int(data[2])
y2 = list(data[3])
# return tuple of input and output
return {'x1': x1, 'x2': x2}, {'y1': y1, 'y2': y2}
And encode it this way:
dataset.encode(
{"x1": DictToBinaryEncoder(bag_of_words_2_idx), "x1": OneHotEncoder(10)},
{"y1": OneHotEncoder(10), 'y2': BinaryEncoder(10)})
Remember order maters when maping and encoding.
train_dataset = dataset.feed("train.csv").batch(10) # pass it to fit_generator method in keras
eval_dataset = dataset.feed("eval.csv").batch(10)
...
model.fit_generator(
train_dataset,
validation_data=eval_dataset,
steps_per_epoch=10,
validation_steps=5,
epochs=10
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for dataset_pipe-0.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f66db55e241213ed60b9360bf0ecf00292fc74d72f86ffddbe038cb2578dc7e5 |
|
MD5 | 04714f1b4a0adfc0740a3359cc377180 |
|
BLAKE2b-256 | 01737c07b28684fe048f6dc4f430b583bc8e40c3e4b004ae6454c65b91a981c8 |