Data generator with Keras

Tram Ho

In this article I will write about how to create Data Generator with Keras like. (too pale =)))

What will I write in this article?

  • Why Data Generator
  • Practice
  • Conclude
  • Reference

Why Data Generator

In fact, not everyone has enough money to buy a terrible machine and the data they need to train occupies more RAM than the actual RAM that our machine has. The problem here is when we have a large data set and the RAM is not enough to load at the same time and then divide the train set and test then train model. To solve this problem, we need to split the dataset into small directories and then load the data in each part during the train model. We can choose to eat noodles using Keras ImageDatagenerator available . Or we can make our own dishes the way we want by custom Data Generator.

In this article, I will guide you by practicing with Mnist.

Practice

Making custom Data Generator Keras provides us with a Sequence class and allows us to create classes that can inherit from it.

First, we need to load the dataset dataset mnist.

import tensorflow as tf
import numpy as np
from tensorflow import keras
from tensorflow.keras.utils import Sequence, to_categorical
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, Dense, Flatten, MaxPooling2D, Dropout
#load data mnist
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

Our Mnist set includes 60000 images for the train set and 10000 photos for the test set. Each image is sized 28×28. For each image of type float32, the size of each image is about 4 bytes. we will need 4 * (28 * 28) * 70000 + (70000 * 10) ~ 220Mb of RAM which is calculated but in reality we will probably lose more. So the choice of Data Generator is reasonable.

Data Generator

Init () initialization function

    def __init__(self,
                 img_paths,
                 labels, 
                 batch_size=32,
                 dim=(224, 224),
                 n_channels=3,
                 n_classes=4,
                 shuffle=True):
        self.dim = dim
        self.batch_size = batch_size
        self.labels = labels
        self.img_paths = img_paths
        self.n_channels = n_channels
        self.n_classes = n_classes
        self.shuffle = shuffle
        self.img_indexes = np.arange(len(self.img_paths))
        self.on_epoch_end()

images Shuffle: Is there shuffle data after each epoch or not

on_epoch_end ()

Every time you end or start an epoch this function will decide whether to shuffle the data or not

    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(len(self.img_paths))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)

len ()

    def __len__(self):
        'Denotes the number of batches per epoch'
        return int(np.floor(len(self.img_indexes) / self.batch_size))

Returns the number of batches on 1 epoch. The len () function is a built in function in python. We set the value to:

It is the number of steps on an epoch we will see when the train model.

get_item ()

    def __getitem__(self, index):
        'Generate one batch of data'
        # tạo ra index cho từng batch
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        #lấy list IDs trong 1 batch
        list_IDs_temps = [self.img_indexes[k] for k in indexes]

        # Generate data
        X, y = self.__data_generation(list_IDs_temps)
        return X, y

This function will generate batch for data in the order it was passed.

__data_generation ()

    def __data_generation(self, list_IDs_temps):
        X = np.empty((self.batch_size, *self.dim))
        y = []
        for i, ID in enumerate(list_IDs_temps):
            X[i,] = self.img_paths[ID]
            X = (X/255).astype('float32')
            y.append(self.labels[ID])
        X = X[:,:,:, np.newaxis]
        return X, keras.utils.to_categorical(y, num_classes=10)

__data_generation () will be called directly from the get_item () function to perform the main tasks such as reading images, processing data and returning data as desired before being included in the train model.

DataGenerator class

After understanding and defining the above functions, we will get the complete code below.

class DataGenerator(Sequence):
    def __init__(self,
                 img_paths,
                 labels, 
                 batch_size=32,
                 dim=(224, 224),
                 n_channels=3,
                 n_classes=4,
                 shuffle=True):
        self.dim = dim
        self.batch_size = batch_size
        self.labels = labels
        self.img_paths = img_paths
        self.n_channels = n_channels
        self.n_classes = n_classes
        self.shuffle = shuffle
        self.img_indexes = np.arange(len(self.img_paths))
        self.on_epoch_end()
        
    def __len__(self):
        'Denotes the number of batches per epoch'
        return int(np.floor(len(self.img_indexes) / self.batch_size))

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # Find list of IDs
        list_IDs_temps = [self.img_indexes[k] for k in indexes]

        # Generate data
        X, y = self.__data_generation(list_IDs_temps)
        return X, y
    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(len(self.img_paths))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)
    def __data_generation(self, list_IDs_temps):
        X = np.empty((self.batch_size, *self.dim))
        y = []
        for i, ID in enumerate(list_IDs_temps):
            X[i,] = self.img_paths[ID]
            X = (X/255).astype('float32')
            y.append(self.labels[ID])
        X = X[:,:,:, np.newaxis]
        return X, keras.utils.to_categorical(y, num_classes=10)

Initialize data and Training model

Here I just use the simple classification model below: Initialize the model

n_classes = 10
input_shape = (28, 28)
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=(28, 28 , 1)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(n_classes, activation='softmax'))
model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adam(),
              metrics=['accuracy'])

Initialize the train_generator and val_generator data

train_generator = DataGenerator(x_train, y_train, batch_size = 32, dim = input_shape,
 n_classes=10, shuffle=True)
val_generator = DataGenerator(x_test, y_test, batch_size=32, dim = input_shape, 
n_classes= n_classes, shuffle=True)

Next is the train model.

model.fit_generator(
 train_generator,
 steps_per_epoch=len(train_generator),
 epochs=10,
 validation_data=val_generator,
 validation_steps=len(val_generator))

Conclude

Thank you everyone for reading my article, if anything is not right looking forward to suggestions from everyone!

Reference

https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly

Share the news now

Source : Viblo