Loading Data and Data Processing#
There are different types of data that need to be loaded and processed.
Texts
CSV files
Human readable textual files
Images
Audio
Data may come from different sources:
Locally available data
URL of dataset
Available datasets from Tensorflow and Kaggle
Numpy and Pandas arrays
Texts#
CSV files#
Loading CSV files#
In the following exercise, we see how data is obtained from a CSV file whose URL is known to us and loaded to Tensorflow
In the first approach, we will directly load the data from the URL.
import pandas as pd # importing the pandas library
csv_data = pd.read_csv(
"https://storage.googleapis.com/download.tensorflow.org/data/abalone_train.csv",
names=[
"Length",
"Diameter",
"Height",
"Whole weight",
"Shucked weight",
"Viscera weight",
"Shell weight",
"Age",
],
)
Print the values of the data frame
csv_data
In the second approach, we will download the data to our disk and create a Pandas dataframe.
from tensorflow.keras import utils # We use utils for downloading the data
url = "https://storage.googleapis.com/download.tensorflow.org/data/abalone_train.csv"
dataset_dir = utils.get_file(origin=url, cache_dir="./")
Verify whether the file is downloaded. If the above download is complete, the data will be downloaded to the datasets directory in the current folder.
from os import listdir
files = list(listdir("./datasets"))
print(files)
csv_data = pd.read_csv(
"./datasets/abalone_train.csv",
names=[
"Length",
"Diameter",
"Height",
"Whole weight",
"Shucked weight",
"Viscera weight",
"Shell weight",
"Age",
],
)
csv_data
Preprocessing CSV files#
Suppose that our goal is to predict the age from the other features.
We will first see the unique values of Age.
print(csv_data["Age"].unique())
Now we will create separate dataframe for features.
features = csv_data.copy()
labels = features.pop("Age")
features
labels
We see only numerical values in the features. We will now create a numpy array of these features.
import numpy as np
np_features = np.array(features)
print(np_features)
Note It is a good practice to normalize the data before training a tensorflow model.
We will use Normalization
from tensorflow
for this purpose.
Note that the function adopt
should be run only on the test data
from tensorflow.keras import layers
normalize = layers.Normalization()
normalize.adapt(np_features)
Using the data for training#
Now we are ready to progress with the training. We have the features and the labels ready.
### Building a model
import tensorflow as tf
model = tf.keras.Sequential(
[
normalize, ## Note the use of Normalization layer
layers.Dense(64),
layers.Dense(1),
]
)
model.compile(loss=tf.losses.MeanSquaredError(), optimizer=tf.optimizers.Adam())
## Starting the training
history = model.fit(np_features, labels, epochs=10)
import matplotlib.pyplot as plt
metrics = history.history
plt.plot(history.epoch, metrics["loss"])
plt.legend(["loss"])
plt.show()
## Printing the model summary
model.summary()
Text files#
Now we will focus on the text files.
We will use Stack Overflow dataset for this part.Note that this is a zipped .tar dataset.
## Download, unzip and untar the dataset
url = "https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz"
dataset_dir = utils.get_file(
origin=url, untar=True, cache_dir="./datasets", cache_subdir=""
)
We will now see the contents of the downloaded file(s).
from os import listdir
files = list(listdir("./datasets"))
print(files)
If the download is correctly done, we can see the train
and test
folders.
files = list(listdir("./datasets/train"))
print(files)
files = list(listdir("./datasets/train/javascript"))
print(files[:5]) # print the name of first five files
files = list(listdir("./datasets/test"))
print(files)
As we can see there are only two folders: train
and test
. It is a good practice to create datasets for training, validation and testing.
In the following code, we load the training data from the train
folder and use 20% of the data for validation.
So far, we have loaded the complete data for training. We will now create batches of size 32.
batch_size = 32
validation_size_percentage = 0.2 # To ensure 80:20 training:validation
seed = 40 # To ensure proper shuffiling
train_batch = utils.text_dataset_from_directory(
"./datasets/train",
batch_size=batch_size,
validation_split=validation_size_percentage,
subset="training",
seed=seed,
)
We will now see the details of data: text and the associated label using the first batch.
for text_batch, label_batch in train_batch.take(
1
): # Taking into consideration the first batch
for i in range(batch_size):
print(f"Label: {label_batch[i].numpy()} for the text: {text_batch[i].numpy()}")
But, you may have observed that the labels are not ‘csharp’, ‘python’, ‘java’, ‘javascript’, but rather some numbers 0,1, 2, 3.
Now, we will display the associated class names.
for i, label in enumerate(train_batch.class_names):
print("Label", i, "corresponds to", label)
Now, we will create the validation dataset.
Recall that we used the value ‘training’ for the parameter subset. This time we will use the value ‘validation’.
validation_batch = utils.text_dataset_from_directory(
"./datasets/train",
batch_size=batch_size,
validation_split=validation_size_percentage,
subset="validation",
seed=seed,
)
And finally, we will create the test batch.
test_batch = utils.text_dataset_from_directory("./datasets/test", batch_size=batch_size)
However, these datasets cannot be yet used for training, since we need vectors for working with Tensorflow.
Our next goal is to convert the text data to associated vectors.
We will use two approaches:
Binary Vectorization (one-hot encoding)
‘int’ Vectorization (integer indices for each token)
We will first start with binary Vectorization
from tensorflow.keras.layers import TextVectorization
vocabulary_size = 10000
binary_vectorize_layer = TextVectorization(
max_tokens=vocabulary_size, output_mode="binary"
)
train_text = train_batch.map(lambda text, labels: text)
binary_vectorize_layer.adapt(train_text)
for text_batch, label_batch in train_batch.take(1):
for i in range(1):
print("Text", text_batch[0].numpy())
print("Label", label_batch[0].numpy())
print("Binary Vectorization:")
print(binary_vectorize_layer(text_batch[0])) # One-hot encoding
The above output corresponds to one-hot encoding value of the text.
We will first start with ‘int’ Vectorization
Unlike ‘Binary’ Vectorization, we also need the maximum sequence length.
Long texts will be truncated to this maximum length.
from tensorflow.keras.layers import TextVectorization
vocabulary_size = 10000
max_sequence_length = 200
int_vectorize_layer = TextVectorization(
max_tokens=vocabulary_size,
output_sequence_length=max_sequence_length,
output_mode="int",
)
train_text = train_batch.map(lambda text, labels: text)
int_vectorize_layer.adapt(train_text)
for text_batch, label_batch in train_batch.take(1):
for i in range(1):
print("Text", text_batch[0].numpy())
print("Label", label_batch[0].numpy())
print("Binary Vectorization:")
print(int_vectorize_layer(text_batch[0])) # Check the length of the sequence
To understand what the above sequence signify, we will look up in the vocabulary.
for i in int_vectorize_layer(text_batch[0]):
print(f"{i}: ", int_vectorize_layer.get_vocabulary()[i])
Now, we create functions to apply binary or ‘int’ vectorization to the dataset.
def binary_vectorize(text, label):
text = tf.expand_dims(text, -1)
return binary_vectorize_layer(text), label
def int_vectorize(text, label):
text = tf.expand_dims(text, -1)
return int_vectorize_layer(text), label
binary_train_batch = train_batch.map(binary_vectorize)
binary_validation_batch = validation_batch.map(binary_vectorize)
binary_test_batch = test_batch.map(binary_vectorize)
int_train_batch = train_batch.map(int_vectorize)
int_validation_batch = validation_batch.map(int_vectorize)
int_test_batch = test_batch.map(int_vectorize)
# Training using binary 'Vectors'
from tensorflow.keras import losses
num_labels = 4
binary_model = tf.keras.Sequential([layers.Dense(num_labels)])
binary_model.compile(
loss=losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer="adam",
metrics=["accuracy"],
)
history = binary_model.fit(
binary_train_batch, validation_data=binary_validation_batch, epochs=10
)
print(binary_model.summary())
metrics = history.history
plt.plot(history.epoch, metrics["loss"], metrics["val_loss"])
plt.legend(["loss", "val_loss"])
plt.show()
num_labels = 4
vocab_size = vocabulary_size + 1
int_model = tf.keras.Sequential(
[
layers.Embedding(vocab_size, 64, mask_zero=True),
layers.Conv1D(64, 5, padding="valid", activation="relu", strides=2),
layers.GlobalMaxPooling1D(),
layers.Dense(num_labels),
]
)
int_model.compile(
loss=losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer="adam",
metrics=["accuracy"],
)
history = int_model.fit(int_train_batch, validation_data=int_validation_batch, epochs=5)
print(int_model.summary())
metrics = history.history
plt.plot(history.epoch, metrics["loss"], metrics["val_loss"])
plt.legend(["loss", "val_loss"])
plt.show()
binary_loss, binary_accuracy = binary_model.evaluate(binary_test_batch)
int_loss, int_accuracy = int_model.evaluate(int_test_batch)
print("Binary model accuracy: {:2.2%}".format(binary_accuracy))
print("Int model accuracy: {:2.2%}".format(int_accuracy))
Images#
Loading Image dataset#
url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file(
origin=url, cache_dir="./datasets", cache_subdir="", untar=True
)
We will now check the contents of the downloded data
from os import listdir
files = list(listdir("./datasets"))
print(files)
files = list(listdir("./datasets/flower_photos"))
print(files)
Image data processing for building Tensorflow models#
For the processing of images, we need the values of batch size and image size (height and weight).
batch_size = 32
img_height = 180
img_width = 180
For preparing the batch, we make use of image_dataset_from_directory
. Note the parameter image_size
.
We will first create the training batch.
batch_size = 32
validation_size_percentage = 0.2 # To ensure 80:20 training:validation
seed = 40 # To ensure proper shuffiling
train_batch = utils.image_dataset_from_directory(
"./datasets/flower_photos",
batch_size=batch_size,
validation_split=validation_size_percentage,
subset="training",
image_size=(img_height, img_width),
seed=seed,
)
Next, we will first create the validation batch.
validation_batch = utils.image_dataset_from_directory(
"./datasets/flower_photos",
batch_size=batch_size,
validation_split=validation_size_percentage,
subset="validation",
image_size=(img_height, img_width),
seed=seed,
)
Let’s see how the class names have been identified by Tensorflow.
class_names = train_batch.class_names
print(class_names)
Next, we will plot the images as well as their labels.
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 10))
for images, labels in train_batch.take(1):
for i in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(images[i].numpy().astype("uint8"))
plt.title(class_names[labels[i]])
plt.axis("off")
We will now see the details of the first image from a batch: the shape of the image as well as the shape of the labels.
for images, labels in train_batch.take(1):
print(images[0].shape)
print(labels.shape)
break
As we have seen before, it is very important to normalize the images.
normalization_layer = tf.keras.layers.Rescaling(1.0 / 255)
We now create the normalized batch.
normalized_batch = train_batch.map(lambda x, y: (normalization_layer(x), y))
Let’s now see some information from an image of a normalized batch.
for images, labels in normalized_batch.take(1):
print(np.min(images[0]), np.max(images[0]))
We will now build a Tensorflow model for the image classification.
We add a normalization later in this model.
num_classes = 5
model = tf.keras.Sequential(
[
tf.keras.layers.Rescaling(1.0 / 255),
tf.keras.layers.Conv2D(32, 3, activation="relu"),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Conv2D(32, 3, activation="relu"),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Conv2D(32, 3, activation="relu"),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dense(num_classes),
]
)
model.compile(
optimizer="adam",
loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=["accuracy"],
)
We now perform the training of the model.
history = model.fit(train_batch, validation_data=validation_batch, epochs=3)
print(model.summary())
metrics = history.history
plt.plot(history.epoch, metrics["loss"], metrics["val_loss"])
plt.legend(["loss", "val_loss"])
plt.show()
Audio#
Loading Audio data#
url = "http://storage.googleapis.com/download.tensorflow.org/data/mini_speech_commands.zip"
data_dir = tf.keras.utils.get_file(
origin=url, cache_dir="./datasets", cache_subdir="", extract=True
)
from os import listdir
files = list(listdir("./datasets"))
print(files)
files = list(listdir("./datasets/mini_speech_commands"))
commands = []
for c in files:
if c != "README.md":
commands.append(c)
print(commands)
from glob import glob
filenames = glob("./datasets/mini_speech_commands/*/*")
filenames = tf.random.shuffle(filenames)
print(f"Number of examples: {len(filenames)}")
train_files = filenames[:6400]
val_files = filenames[6400 : 6400 + 800]
test_files = filenames[-800:]
test_file = tf.io.read_file(train_files[0])
test_audio, _ = tf.audio.decode_wav(contents=test_file)
test_audio.shape
def decode_audio(audio_binary):
audio, _ = tf.audio.decode_wav(contents=audio_binary)
return tf.squeeze(audio, axis=-1)
import os
def get_label_from_filepath(file_path):
parts = tf.strings.split(input=file_path, sep=os.path.sep)
print(parts[-2])
return parts[-2]
t = get_label_from_filepath(train_files[0])
print(type(t))
print(t.numpy())
def get_waveform_and_label(file_path):
print(t.numpy())
label = get_label_from_filepath(file_path)
audio_binary = tf.io.read_file(file_path)
waveform = decode_audio(audio_binary)
return waveform, label
AUTOTUNE = tf.data.AUTOTUNE
files_ds = tf.data.Dataset.from_tensor_slices(train_files)
waveform_ds = files_ds.map(map_func=get_waveform_and_label, num_parallel_calls=AUTOTUNE)
import matplotlib.pyplot as plt
import numpy as np
rows = 3
cols = 3
n = rows * cols
fig, axes = plt.subplots(rows, cols, figsize=(10, 12))
for i, (audio, label) in enumerate(waveform_ds.take(n)):
r = i // cols
c = i % cols
ax = axes[r][c]
ax.plot(audio.numpy())
ax.set_yticks(np.arange(-1.2, 1.2, 0.2))
label = label.numpy().decode("utf-8")
ax.set_title(label)
plt.show()
def get_spectrogram(waveform):
# Zero-padding for an audio waveform with less than 16,000 samples.
input_len = 16000
waveform = waveform[:input_len]
zero_padding = tf.zeros([16000] - tf.shape(waveform), dtype=tf.float32)
# Cast the waveform tensors' dtype to float32.
waveform = tf.cast(waveform, dtype=tf.float32)
# Concatenate the waveform with `zero_padding`, which ensures all audio
# clips are of the same length.
equal_length = tf.concat([waveform, zero_padding], 0)
# Convert the waveform to a spectrogram via a STFT.
spectrogram = tf.signal.stft(equal_length, frame_length=255, frame_step=128)
# Obtain the magnitude of the STFT.
spectrogram = tf.abs(spectrogram)
# Add a `channels` dimension, so that the spectrogram can be used
# as image-like input data with convolution layers (which expect
# shape (`batch_size`, `height`, `width`, `channels`).
spectrogram = spectrogram[..., tf.newaxis]
return spectrogram
from IPython import display
for waveform, label in waveform_ds.take(1):
label = label.numpy().decode("utf-8")
spectrogram = get_spectrogram(waveform)
print("Label:", label)
print("Waveform shape:", waveform.shape)
print("Spectrogram shape:", spectrogram.shape)
print("Audio playback")
display.display(display.Audio(waveform, rate=16000))
def plot_spectrogram(spectrogram, ax):
if len(spectrogram.shape) > 2:
assert len(spectrogram.shape) == 3
spectrogram = np.squeeze(spectrogram, axis=-1)
# Convert the frequencies to log scale and transpose, so that the time is
# represented on the x-axis (columns).
# Add an epsilon to avoid taking a log of zero.
log_spec = np.log(spectrogram.T + np.finfo(float).eps)
height = log_spec.shape[0]
width = log_spec.shape[1]
X = np.linspace(0, np.size(spectrogram), num=width, dtype=int)
Y = range(height)
ax.pcolormesh(X, Y, log_spec, shading="auto")
fig, axes = plt.subplots(2, figsize=(12, 8))
timescale = np.arange(waveform.shape[0])
axes[0].plot(timescale, waveform.numpy())
axes[0].set_title("Waveform")
axes[0].set_xlim([0, 16000])
plot_spectrogram(spectrogram.numpy(), axes[1])
axes[1].set_title("Spectrogram")
plt.show()
def get_spectrogram_and_label_id(audio, label):
spectrogram = get_spectrogram(audio)
label_id = tf.argmax(label == commands)
return spectrogram, label_id
spectrogram_ds = waveform_ds.map(
map_func=get_spectrogram_and_label_id, num_parallel_calls=AUTOTUNE
)
rows = 3
cols = 3
n = rows * cols
fig, axes = plt.subplots(rows, cols, figsize=(10, 10))
for i, (spectrogram, label_id) in enumerate(spectrogram_ds.take(n)):
r = i // cols
c = i % cols
ax = axes[r][c]
plot_spectrogram(spectrogram.numpy(), ax)
ax.set_title(commands[label_id.numpy()])
ax.axis("off")
plt.show()
def preprocess_dataset(files):
files_ds = tf.data.Dataset.from_tensor_slices(files)
output_ds = files_ds.map(
map_func=get_waveform_and_label, num_parallel_calls=AUTOTUNE
)
output_ds = output_ds.map(
map_func=get_spectrogram_and_label_id, num_parallel_calls=AUTOTUNE
)
return output_ds
train_ds = spectrogram_ds
val_ds = preprocess_dataset(val_files)
test_ds = preprocess_dataset(test_files)
batch_size = 64
train_ds = train_ds.batch(batch_size)
val_ds = val_ds.batch(batch_size)
train_ds = train_ds.cache().prefetch(AUTOTUNE)
val_ds = val_ds.cache().prefetch(AUTOTUNE)
for spectrogram, _ in spectrogram_ds.take(1):
input_shape = spectrogram.shape
print("Input shape:", input_shape)
num_labels = len(commands)
from tensorflow.keras import layers
from tensorflow.keras import models
# Instantiate the `tf.keras.layers.Normalization` layer.
norm_layer = layers.Normalization()
# Fit the state of the layer to the spectrograms
# with `Normalization.adapt`.
norm_layer.adapt(data=spectrogram_ds.map(map_func=lambda spec, label: spec))
model = models.Sequential(
[
layers.Input(shape=input_shape),
# Downsample the input.
layers.Resizing(32, 32),
# Normalize.
norm_layer,
layers.Conv2D(32, 3, activation="relu"),
layers.Conv2D(64, 3, activation="relu"),
layers.MaxPooling2D(),
layers.Dropout(0.25),
layers.Flatten(),
layers.Dense(128, activation="relu"),
layers.Dropout(0.5),
layers.Dense(num_labels),
]
)
model.summary()
model.compile(
optimizer=tf.keras.optimizers.Adam(),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=["accuracy"],
)
EPOCHS = 10
history = model.fit(
train_ds,
validation_data=val_ds,
epochs=EPOCHS,
callbacks=tf.keras.callbacks.EarlyStopping(verbose=1, patience=2),
)
metrics = history.history
plt.plot(history.epoch, metrics["loss"], metrics["val_loss"])
plt.legend(["loss", "val_loss"])
plt.show()
test_audio = []
test_labels = []
for audio, label in test_ds:
test_audio.append(audio.numpy())
test_labels.append(label.numpy())
test_audio = np.array(test_audio)
test_labels = np.array(test_labels)
y_pred = np.argmax(model.predict(test_audio), axis=1)
y_true = test_labels
test_acc = sum(y_pred == y_true) / len(y_true)
print(f"Test set accuracy: {test_acc:.0%}")
import seaborn as sns
confusion_mtx = tf.math.confusion_matrix(y_true, y_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(
confusion_mtx, xticklabels=commands, yticklabels=commands, annot=True, fmt="g"
)
plt.xlabel("Prediction")
plt.ylabel("Label")
plt.show()