Large convolutional network models have impressive performance, but there is no clear understanding of why these performances are well and how to improve the model. This paper introduced a visualization technique that reveals the input stimuli that excite individual feature maps at the model layers. This method helps us to observe the training features evolution and debug model problems.
(1) Feature activations are mapped back to input pixel space by setting other activations in the layer to zero and successively unpooling, rectifying and filtering (using the same parameters).
(2) Unpooling: The max pooling operation is non-invertible, but we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variables. In the deconvnet, the unpooling operation uses these switches to place the reconstructions from the layer above into appropriate locations, preserving the structure of the stimulus.
(3) Rectification: The convnet uses relu non-linearities, which rectify the feature maps thus ensuring the feature maps are always positive. To obtain valid feature reconstructions at each layer (which also should be positive), the deconvnet pass the reconstructed signal through a relu non-linearity.
(4) Filtering: The convnet uses learned filters to con- volve the feature maps from the previous layer. The deconvnet uses transposed versions of the same filters, but applied to the rectified maps, not the output of the layer beneath.
The paper’s model outperformed the architecture of (Krizhevsky et al. 2012), and beating their single model result by 1.7%. Other test results are also good.
2. Related Work
(1) Erhan, D., Bengio, Y., Courville, A., and Vincent, P. Visualizing higher-layer features of a deep network. In Technical report, University of Montreal, 2009.
(2) Le, Q. V., Ngiam, J., Chen, Z., Chia, D., Koh, P., and Ng, A. Y. Tiled convolutional neural networks. In NIPS, 2010.
(3) Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. DeCAF: A deep con- volutional activation feature for generic visual recog- nition. In arXiv:1310.1531, 2013.
3. Strong Points
(1) This paper provide a technique can reveal the input stimuli that excite individual feature maps at any layers. We can observe evolution of features during training. It reveal the features to be far from random, uninterpretable patterns.
(2) The paper’s method show many intuitively desirable properties such as compositionality, increasing invariance and class discrimination.
(3) The paper’s method can help to debug problems in the model. For example, the paper’s model outperformed the architecture of (Krizhevsky et al. 2012), and beating their single model result by 1.7%.
(4) The paper’s visualizations show that small transformations have a huge effect on lower layers and lesser effect on higher layers. The model is fairly stable to translation and scaling, not so much to rotation.
4. Weak Points
(1) The paper increased the size of middle convolutional layer in the ImageNet, and this enlarged the fully connected layers overfitting.
(2) For PASCAL 2012 dataset, the paper’s result was 3.2% lower than the leading result. This is because PASCAL images can contain multiple objects but the paper’s model only provide single prediction.
(3) The intuition for choice of smaller filters wasn't convincing for me.
I think this paper has many follow-ups, a lot of papers cited it. And I found that in Hinton’s paper “Dynamic Routing Between Capsules”, it provide a reconstruction module to visualize the image and also it improve the results. And its idea is similar to this paper.
6. New Ideas
(1) Add deconvnet to some deeper convolutional neural network to get better performance.
(2) Build visualization module for some other neural networks and it may help debug the model problem.