### 1. Introduction

In the past few years, the major methods for object recognition are machine learning. This is due to the relatively small datasets of labeled images. Machine learning does well with small size datasets, for example, the best error rate on the MINST digit-recognition task approaches human performance. Later, the shortcomings of large datasets was recognized and people collected some large image datasets like ImageNet, which contains millions labeled high-resolution images.

Thus, we need a model with large learning capacity and convolutional neural networks (CNNs) have that ability by varying the depth and breadth. Also, current GPUs are powerful with highly-optimized implementation of 2D convolution, and GPUs are suitable for training large CNN models. This paper provide a large convolutional neural network and trained it on subsets of ImageNet. It achieved best results in ILSVRC-2010 and ILSVRC-2012 competitions.

### 2. Related Work

Convolutional neural network comes from LeNet, and the CNN is this paper is much larger and deeper. ReLU comes from Hilton’s another paper, and this method is useful for faster learning in this CNN.

### 3. Architecture

The convolutional neural network in this paper contains eight learned layers (five convolutional and three fully-connected layers). The ReLU nonlinearity was applied in the model to reduce the training time. Multiple GPUs training, local response normalization and overlapping pooling are applied in the network for better performance. Besides, overfitting is a significant problem for this network because of its size. The paper provides two data augmentation methods (generating images translations, altering the intensity of RGB channels) and “dropout” method to prevent overfitting.

### 4. Strong Points

(1) ReLU Nonlinearity

This paper replace the normal neuron’s output function tanh (or sigmoid) with Rectified Linear Units (ReLUs). When training with gradient descent, saturating nonlinearities are slower than non-saturating nonlinearities f(x) = max(0, x). A four-layer convolutional neural networks with ReLUs reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons.

(2) Overlapping Pooling

Pooling layers summarize the outputs of neighboring groups of neurons in the same kernel map. Non overlapping pooling is the traditional way, and with overlapping pooling, the top-1 and top-5 error rates are reduced by 0.4% and 0.3%.

(3) Data Augmentation

The most common method to reduce overfitting on image data is to artificially enlarge the database using image transformations.

The first form of data augmentation is generating image translations and horizontal reflections. The paper extracts random 224 * 224 patches with their horizontal reflections from the 256 * 256 images and using these generating images for training. During testing, the network makes prediction by extracting ten 224 * 224 patches (the four corners and center patches with their horizontal reflections), and gets average predictions by softmax layer on the ten patches.

The second form of data augmentation is altering the intensities of the RGB channels in training images. The paper performs PCA on the set of RGB pixel values. This scheme reduces the top-1 error rate by over 1%.

(4) Dropout

Combining the predictions of many models is a way to reduce test errors but it is expensive for large neural networks. A very efficient version of model combination is “dropout”, which only costs about a factor of two during training. This method zero the output of each hidden neuron with probability 0.5. The neurons which dropped out do not contribute to the forward pass and do not participate in back propagation. This method samples a different architecture, and all the architectures share weights. The paper use dropout in the first two fully-connected layers, and without dropout, the network has overfitting. Dropout doubles the iteration times required to converge.

### 5. Weak Points

(1) ReLU Disadvantage

I have read some article that ReLU tends to blow up activation (there is no mechanism to constrain the output of the neuron, as "a" itself is the output). Thus maybe it cannot be used in some models. The ReLU will close the gate during back-propagation if and only if it closed the gate during forward propagation. A dead ReLU is likely to stay dead.

(2) GPU Training

In this paper, they trained the convolutional neural network on two GTX 580 GPU. This is because the 3GB memory size of GTX 580 GPU limits the maximum size of the networks can be trained on it. And they have to spread the network across two GPUs. From the testing results, we know that the two GPUs reduced top-1 and top-5 error rates by 1.7% and 1.2% than one GPU. This comparison is biased in favor of the one-GPU net, since it is bigger than “half the size” of the two-GPU net. Also, in Figure 3, the 48 kernels on GPU1 is color-agnostic and the 48 kernels on GPU2 is color-specific, and the author did not provide any explanation for this.

(3) Normal Distribution Disadvantage

Use normal distribution to initiate the weights in the neural networks cannot effective solve the problem of gradient vanishing.

(4) Local Response Normalization not Useful Enough

Many types of normalization layers have been proposed for use in CNN architectures, sometimes with the intentions of implementing inhibition schemes observed in the biological brain. However, these layers have recently fallen out of favor because in practice their contribution has been shown to be minimal, if any.

### 6. New Idea and Future Work

The future convolutional neural networks like VGG, GoogleNet and ResNet become more and more deeper. And their kernel sizes are designed smaller. Thus deeper CNN is known to have better performance, and we can make it deeper. Besides, as I told before, the normal distribution to initiate the weights cannot effective solve the problem of gradient vanishing. Maybe some good distribution applied into CNN can have better performance. Last, the ReLU will close the gate during back-propagation if and only if it closed the gate during forward propagation. Some new ReLU algorithm maybe able to solve this problem.

Reference:

[1] http://www.jixuweifeng.com/2016/07/24/AlexNet%E8%AE%BA%E6%96%87%E7%BF%BB%E8%AF%91/