In this article I will write about how to create Data Generator with Keras like. (too pale =)))
What will I write in this article?
- Why Data Generator
- Practice
- Conclude
- Reference
Why Data Generator
In fact, not everyone has enough money to buy a terrible machine and the data they need to train occupies more RAM than the actual RAM that our machine has. The problem here is when we have a large data set and the RAM is not enough to load at the same time and then divide the train set and test then train model. To solve this problem, we need to split the dataset into small directories and then load the data in each part during the train model. We can choose to eat noodles using Keras ImageDatagenerator available . Or we can make our own dishes the way we want by custom Data Generator.
In this article, I will guide you by practicing with Mnist.
Practice
Making custom Data Generator Keras provides us with a Sequence class and allows us to create classes that can inherit from it.
First, we need to load the dataset dataset mnist.
import tensorflow as tf
import numpy as np
from tensorflow import keras
from tensorflow.keras.utils import Sequence, to_categorical
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, Dense, Flatten, MaxPooling2D, Dropout
#load data mnist
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
Our Mnist set includes 60000 images for the train set and 10000 photos for the test set. Each image is sized 28×28. For each image of type float32, the size of each image is about 4 bytes. we will need 4 * (28 * 28) * 70000 + (70000 * 10) ~ 220Mb of RAM which is calculated but in reality we will probably lose more. So the choice of Data Generator is reasonable.
Data Generator
Init () initialization function
def __init__(self,
img_paths,
labels,
batch_size=32,
dim=(224, 224),
n_channels=3,
n_classes=4,
shuffle=True):
self.dim = dim
self.batch_size = batch_size
self.labels = labels
self.img_paths = img_paths
self.n_channels = n_channels
self.n_classes = n_classes
self.shuffle = shuffle
self.img_indexes = np.arange(len(self.img_paths))
self.on_epoch_end()
images Shuffle: Is there shuffle data after each epoch or not
on_epoch_end ()
Every time you end or start an epoch this function will decide whether to shuffle the data or not
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.img_paths))
if self.shuffle == True:
np.random.shuffle(self.indexes)
len ()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.floor(len(self.img_indexes) / self.batch_size))
Returns the number of batches on 1 epoch. The len () function is a built in function in python. We set the value to:
It is the number of steps on an epoch we will see when the train model.
get_item ()
def __getitem__(self, index):
'Generate one batch of data'
# tạo ra index cho từng batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
#lấy list IDs trong 1 batch
list_IDs_temps = [self.img_indexes[k] for k in indexes]
# Generate data
X, y = self.__data_generation(list_IDs_temps)
return X, y
This function will generate batch for data in the order it was passed.
__data_generation ()
def __data_generation(self, list_IDs_temps):
X = np.empty((self.batch_size, *self.dim))
y = []
for i, ID in enumerate(list_IDs_temps):
X[i,] = self.img_paths[ID]
X = (X/255).astype('float32')
y.append(self.labels[ID])
X = X[:,:,:, np.newaxis]
return X, keras.utils.to_categorical(y, num_classes=10)
__data_generation () will be called directly from the get_item () function to perform the main tasks such as reading images, processing data and returning data as desired before being included in the train model.
DataGenerator class
After understanding and defining the above functions, we will get the complete code below.
class DataGenerator(Sequence):
def __init__(self,
img_paths,
labels,
batch_size=32,
dim=(224, 224),
n_channels=3,
n_classes=4,
shuffle=True):
self.dim = dim
self.batch_size = batch_size
self.labels = labels
self.img_paths = img_paths
self.n_channels = n_channels
self.n_classes = n_classes
self.shuffle = shuffle
self.img_indexes = np.arange(len(self.img_paths))
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.floor(len(self.img_indexes) / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temps = [self.img_indexes[k] for k in indexes]
# Generate data
X, y = self.__data_generation(list_IDs_temps)
return X, y
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.img_paths))
if self.shuffle == True:
np.random.shuffle(self.indexes)
def __data_generation(self, list_IDs_temps):
X = np.empty((self.batch_size, *self.dim))
y = []
for i, ID in enumerate(list_IDs_temps):
X[i,] = self.img_paths[ID]
X = (X/255).astype('float32')
y.append(self.labels[ID])
X = X[:,:,:, np.newaxis]
return X, keras.utils.to_categorical(y, num_classes=10)
Initialize data and Training model
Here I just use the simple classification model below: Initialize the model
n_classes = 10
input_shape = (28, 28)
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=(28, 28 , 1)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(n_classes, activation='softmax'))
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adam(),
metrics=['accuracy'])
Initialize the train_generator and val_generator data
train_generator = DataGenerator(x_train, y_train, batch_size = 32, dim = input_shape,
n_classes=10, shuffle=True)
val_generator = DataGenerator(x_test, y_test, batch_size=32, dim = input_shape,
n_classes= n_classes, shuffle=True)
Next is the train model.
model.fit_generator(
train_generator,
steps_per_epoch=len(train_generator),
epochs=10,
validation_data=val_generator,
validation_steps=len(val_generator))
Conclude
Thank you everyone for reading my article, if anything is not right looking forward to suggestions from everyone!
Reference
https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly