Skip to main content

Utils for encoding and data set reading

Project description

tensorflow-dataset-pipe

If you are having problems with data preparation for keras model this library may help.

Suppose you have a csv fie with data you would like to train on.

Example of csv file

    Product title, Category
    Addias shoes, 1
    Nike sneakers, 1
    Wireless router, 2
    ....

Your input date is in the first column and output data is in the 2nd column. You need to encode both input and output. YOu could start like this:

from dataset_pipe.feeds.datasets import XYDataset


dataset = XYDataset("csv")
data = dataset.feed("file.csv")

# x is your input
# y is your output

for x, y in data:
    print(x)

Run this and you will see:

OrderedDict([('input', "Addias shoes"), OrderedDict([('output', 1)])

This way you can debug if the library is reading the right data.

Filtering and mapping

Now lest map and filter (optionally) the data. Mapping is necessary for the encoding process.

from dataset_pipe.feeds.datasets import XYDataset

def mapper(data):
    words = data[0].split()     # split first column into words

    if len(words)==1:           # filter short title descriptions
        return None

    category = int(data[1])     # return category as int
    return                      # return tuple of input and output
        {'x': words},           # input
        {'y': category}         # output


dataset = XYDataset("csv")
data = dataset.map(mapper).feed("file.csv")
for x, y in data:
    print(x, y)

If you ran this you should see an ordered dictionary of input and outpu data. Input should be splitted into words

Encoding

In order to encode data you need an Encoder. Encoder is a class thats implements EncoderInterface.

This is an example of OneHotEncoder. Encoder needs 4 methods.

  • encode - this is where the data will be encoded
  • shape - this method returns the shape of the encoded vector
  • type - this method returns the tensorflow data type (tf.dtype) of encoded vector
  • dim - returns dimension of the vector
from dataset_pipe.encoders.math.ops import zeros


class OneHotEncoder:

    def __init__(self, dim):
        self._dim = dim
        self._shape = (dim,)
        self._type = "float32"

    def encode(self, data):
        if not isinstance(data, int):
            raise ValueError("Param data must be integer. {} given fo type {}".format(data, type(data)))

        """
        Data  is a list of category ids, e.g. 12
        Return dense one hot encoded vector
        """
        vector = zeros(self._shape)
        vector[data] = 1.0
        return vector

    def shape(self):
        return self._shape

    def type(self):
        return self._type

    def dim(self):
        return self._dim

Basic encoders are included in the library so you do not have to write it on your own. OneHotEncoder is also included in the library. Now will use this encoder together with DictToBinaryEncoder to encode mapped data.

from dataset_pipe.feeds.datasets import XYDataset
from dataset_pipe.encoders.dict_to_binary_encoder import DictToBinaryEncoder
from dataset_pipe.encoders.one_hot_encoder import OneHotEncoder


def mapper(data):
    words = data[0].split()     # split first column into words

    if len(words)==1:           # filter short title descriptions
        return None

    category = int(data[1])     # return category as int

    # return tuple of input and output
    return {'x': words}, {'y': category}     

bag_of_words_2_idx = {
    "addidas": 1,
    "nike": 2
}


dataset = XYDataset("csv")
dataset.map(mapper)
dataset.encode(
    {"x": DictToBinaryEncoder(bag_of_words_2_idx)}, 
    {"y": OneHotEncoder(10)})


for x, y in dataset.feed("file.csv").batch(10):
    print(x, y)

If your model requires more the one input or output map it this way:

def mapper(data):
    x1 = data[0].split()
    x2 = data[1]
    y1 = int(data[2])    
    y2 = list(data[3])

    # return tuple of input and output
    return {'x1': x1, 'x2': x2}, {'y1': y1, 'y2': y2} 

And encode it this way:

dataset.encode(
    {"x1": DictToBinaryEncoder(bag_of_words_2_idx), "x1": OneHotEncoder(10)}, 
    {"y1": OneHotEncoder(10), 'y2': BinaryEncoder(10)})

Remember order maters when maping and encoding.

train_dataset = dataset.feed("train.csv").batch(10)  # pass it to fit_generator method in keras
eval_dataset = dataset.feed("eval.csv").batch(10)

...

model.fit_generator(
            train_dataset,
            validation_data=eval_dataset,
            steps_per_epoch=10,
            validation_steps=5,
            epochs=10
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataset_pipe-0.7.7-py3-none-any.whl (25.8 kB view details)

Uploaded Python 3

File details

Details for the file dataset_pipe-0.7.7-py3-none-any.whl.

File metadata

  • Download URL: dataset_pipe-0.7.7-py3-none-any.whl
  • Upload date:
  • Size: 25.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.7.5

File hashes

Hashes for dataset_pipe-0.7.7-py3-none-any.whl
Algorithm Hash digest
SHA256 a7c2a11a99a355432c0c13da501ff7ba04732eb3a4bbc088d78a5961642a4bf3
MD5 63ec4ecfd73e5de66cf667f1fdb308ac
BLAKE2b-256 7da31eea60884c3e3961382f8bc919dbad3c686fd125eb3766dcab813614e3d1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page