Hands-on Tutorials
.
Authors: Andrea Yoss and Caroline Harrison
Introduction
Since many of the best models use millions of training instances and take weeks to run on robust computational resources, it is difficult for the everyday deep learning enthusiast to train comparable models from scratch. Fortunately, we can incorporate parts of those models into a completely different and domain specific model.
By using a pre-trained model, one can effectively transfer the learning from one model to another – a technique known as Transfer Learning – often used for domain adaptation and strengthening the accuracy of a model that is going to be trained on a smaller dataset. There are few different ways to apply transfer learning that depend on factors such as how similar your data is to the dataset used to train the pre-trained model, the size of your dataset, and your computational resources. For instance, the more similar your dataset is to the one used to train a particular model, the more likely it is that the model’s learned parameters and architecture will work just as well for your dataset.
In this article, we explain how we used transfer learning to build two convolutional neural networks (CNNs) to classify articles of clothing. In particular, we explore a specific kind of transfer learning in which you fine-tune a portion of a pre-trained model to make the pre-trained model’s domain applicable to your dataset by keeping certain parameters fixed and training others from scratch. In our first model, we train the parameters of an AlexNet Model from scratch using our training data. In our second model, we test how the classifier would perform if instead of retraining the entire model on the Fashion-MNIST dataset, we fine-tune the AlexNet model pre-trained on the ImageNet Dataset by only replacing and retraining the parameters of the output, fully-connected layer of the pre-trained model, while freezing the other layers.
Background
Convolutional Neural Networks
A Convolutional Neural Network (CNN) is a type of Neural Network that specializes in image processing and classification. It essentially just takes the pixel values of an image in vector/ matrix form as its input, runs it through a sequence of layers, and outputs a classification for the image.
The layers of Convolutional Neural Networks are typically made of up four types of layers:
- A Convolutional Layer captures patterns in the image by running its representative matrix through a set of learnable filters/ kernels, that represent different visual features in the image. These filters slide over the image based on a specified number of pixels, or strides. __ During these convolutions the filters each produce their own feature map; the final output of this layer is a transformation of the original image, consisting of all the feature maps stacked on top of each other.
- The Rectified Linear Unit Layer, or ReLU, is a non-linear activation function f(x) = max(x,0). Without changing its shape, ReLU transforms the elements of the output of the convolutional layer into the range from 0 to infinity, by replacing any negative values with 0.
- A Pooling Layer performs downsampling along the spatial dimensions of the image, reducing the size of the image representation. By reducing the number of features in the CNN, the model increases its computational efficiency while retaining most of the defining features of the images. It also makes the net likely to overfit.
- Fully-Connected Layer. While the convolutional layer makes local connections, each node in a Fully-Connected Layer is connected to all nodes in the previous layer.
Evolution of Pre-Trained Models
Prior to the development of CNNs, image processing mainly consisted of edge detection and other feature extraction methods using the raw pixel information. Since then there have been major improvements in the development of CNN architectures and computer processing power which have greatly improved the accuracy of CNNs for image processing. However, not every model is equal, as there is often a trade off between a model’s accuracy and the number of operations, as shown in Figure 1 below.

Unlike the typical process of building a machine learning model, a variety of deep learning libraries like Apache MxNet and Pytorch, for example, allow you to implement a pre-build CNN architecture that has already been trained on the ImageNet Dataset. Used for the annual ImageNet Large Scale Visual Recognition Competition (ILSVRC), the full ImageNet dataset contains over 15 million images and 22,000 class labels, which is far larger than a typical dataset being used for training. This alone can create a fairly accurate classifier when working with images of ordinary objects that the ImageNet dataset may have already seen. But in conjunction with transfer learning for domain adaptation you can dramatically increase the accuracy of your model.
While there are a number of pre-trained models, we chose to focus on the AlexNet because it was the first prominent model to incorporate the convolutional layers that define a CNN, which would give us a better understanding of how CNN models are built and operate. We were also limited by our computational capabilities so training a larger/deeper model such as the VGG16, for example, would require more computational power than we had available. However, using a transfer learning method significantly reduces how computationally expensive it is to build and train your CNN.
AlexNet
AlexNet was developed in 2012, and was a major breakthrough in CNN development. Not only was it the winner of that year’s ImageNet Large Scale Visual Recognition Challenge (ILSVRC) competition, but it had roughly half the error rate of the its competition. The major innovations were made through the use of training on multiple GPU’s, using augmented versions of the image data for training, using the ReLU activation function, using overlapping pools, and utilized dropout.
The architecture of AlexNet contains 60,000 total parameters within 8 total layers: five convolutional layers, and three fully connected layers. Other major innovations were made through the use of training on multiple (2) GPU’s, and using augmented versions (flipped, scaled, noised, etc.) of the images for training. Additionally, the model used ReLU (Rectified Linear Unit) activation functions, rather than tanh (hyperbolic tangent), which was standard at the time, which helped reduce the training time of the network, and was a current solution to the "vanishing gradient" problem. The pooling layers also introduce a stride (in AlexNet it was of length 4 pixels) when building the feature map, meaning that there was an overlap between each of the local receptive fields, which significantly reduced the error of their model.

AlexNet also introduced innovative methods of reducing overfitting in their model. The first way was through data augmentation, where they artificially enlarged the dataset using label-preserving transformation. This included generating image translations and horizontal reflections by extracting random 224×224 patches and training the network on those patches, and by altering the intensities of the RGB channels in the training images using PCA. The second method for reducing overfitting was through "dropout", where neurons that do not contribute to the feedforward pass and do not participate in the backpropagation are dropped. This reduces complex co-adaptations of neurons and forces the model to learn more robust features.
Modeling
Datasets


- We used the Fashion-MNIST dataset created by Zalando Research, containing 60,000 training and 10,000 test/ validation grayscale images, with each image labeled as one of ten types of clothing (such as coat, dress, sneaker, etc.). Sample images for each of the ten classes are displayed in Figure 3 above.
- The second dataset we used was our "In the Wild" dataset, which we scraped from online retailers such as J.Crew, Forever21, BrooksBrothers, and L.L. Bean, using BeautifulSoup. This dataset consists of 3,348 labeled images (Figure 4) and serves as our final test dataset. The distribution of this dataset, as well as the Fashion-MNIST dataset, is summarized in Figure 5 below.

Training/ Implementation
We imported our training and validation data directly from MXNet’s Gluon API, and then converted our datasets to dataloaders which divided up our training data into mini-batches of 64 images per batch.
Next, we imported the AlexNet pre-trained model from MxNet’s library of pre-trained CNNs, which can be found in the Model Zoo. Note that when you import one of these pre-trained models, you have the option to specify whether you want to import just the model architecture (pretrained = False) or both the architecture and trained parameter values (pretrained = True).
Figure 6 provides details on our model’s layers and 9,354 parameters.

We trained both models for 10 epochs using Adam (adaptive moment estimation) activation as our optimizer, a constant learning rate of 0.001, and a softmax cross entropy loss error as our cost function.
Results
Overall, our pre-trained model outperformed our model trained exclusively on the Fashion-MNIST dataset. Figure 7 shows the training, validation, and testing accuracies for both models.

The training and validation accuracies in our MNIST-trained model (Figure 8) started at around 65% and converged at around 75% over the course of 10 epochs, making the most significant improvements in the first 3–4 epochs. It looks as though our model was properly fit, as the loss was beginning to flatten out by the end of our training indicating that it was not underfit, the curves are generally smooth which indicates that the model was not overfit to the data, and the validation accuracies tended to keep pace with the training accuracy.

However, from Figure 7, we know that the accuracy of our "wild" dataset was significantly below this range at 12% which we discuss in the Analysis section of this article.
The training and validation accuracies of our pre-trained, fine-tuned model was considerably higher than in our original model, which indicates that the transfer learning process was successful with our training and validation datasets.

Looking at the curve progression in Figure 9, it seems that there may have been room for additional improvements had we let our model run over several more iterations. The curves in both the accuracy and loss graphs do not seem to have flattened out quite as much as they had in the original model.
The final accuracy for our test dataset with our fine-tuned model was higher than our original model had scored at 16.19%, which we would have expected with the increase in training and validation accuracies, but not as high as we would like to feel confident using this model with other "wild" test data in the future. We go into more detail about the final testing accuracy results in the results section.
Analysis
The performance of our models may have been impacted by the nature of the training data. Even though the Fashion-MNIST dataset is a clean and perfectly labeled dataset of clothing images, the images are small, have no noise, and are grayscale, while the images in our "In the Wild" dataset featured clothing worn by people, (potentially) multiple articles of clothing depending on how the image was cropped, and in color. While these images are certainly related, the model likely could not learn enough from the features in the training dataset to recognize the "wild" clothing images presented for testing.

An example of how our testing dataset may have presented issues for our model are shown in Figure 10 above, which compares three sneaker images, the first from our MNIST training dataset, and the second two scraped from online retailers. You can clearly see how the model trained on the clean grayscale image may not have been robust enough to accurately predict most of our "wild" dataset.
Future Work
To improve the performance of our classifier and generalize it to the "wild" data, we may want to take advantage of additional augmented data, similar to how the AlexNet model was trained on the ImageNet dataset, as well as additional noisy/wild data to add to the training dataset that may help the dataset classify busy and noisy images.
Last update: January 26, 2021
References
Ives, Zachary. "Notable CNN Architectures." CIS 545 Big Data Analytics, 18 Nov. 2020, University of Pennsylvania. Microsoft PowerPoint presentation.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (June 2017), 84–90. DOI:https://doi.org/10.1145/3065386