1. Introduction

This paper proposed a novel neural network architecture called capsule network that combined a group of neuron to construct a capsule and dynamically connect these capsules by the correlation between two capsules.

This paper is inspired by human vision system, like a hierarchical tree structure. Neurons will only pass the information to the part of the brain which can best handled the information. The defaults of CNN are also the motivation of this paper. Traditional CNN uses multiple layers of filters to achieve translation invariance. Max-pooling layers throw away a lot of necessary information of an object, such as positions and orientations.

Capsule neural network uses “activity vectors” to learn the pose of an object, such as position and orientations, while CNN throws these low level details of an object away through max-pooling layer and only learns shapes and colors. And iterative dynamic routing provides an alternative of calculating how a capsule is activated by using local features’ properties.

The paper used “squashing” function as the non-linear activation function, because this function can be used to the vector instead of normal scalar. And the iterative dynamic routing process determined coupling coefficients of prediction vectors.

CapsNet consists of 3 layer and an additional reconstruction loss. The first layer is ReLU Conv1 (a convolution layer with 9x9 kernels output 256 channels, stride 1, no padding with ReLU). The second layer is PrimaryCapsules (a convolution capsule layer with 9x9 kernel output 32x6x6 8-D capsule, stride 2, no padding). The final layer is DigitCaps (Capsule output computed from a Wij (16x8 matrix) between ui and vj (i from 1 to 32x6x6 and j from 1 to 10). The architecture of reconstruction consists of FC1 (Fully connected with ReLU of size 512), FC2 (Fully connected with ReLU of size 1024) and FC sigmoid (Fully connected with sigmoid of size 784 [28x28 image]).

The proposed capsule neural network architecture was trained with both MNIST and MultiMNIST dataset that consists of pictures of overlapping digits. Both of the experiments achieved better performance comparing with CNN models that have similar parameter sizes.

2. Related Work

(1)) “Learning to parse images” by Geoffrey E Hinton et al.

(2) “Transforming auto-encoders” by Geoffrey E Hinton et al.

(3) “Adam: A method for stochastic optimization” by Kingma and Ba.

(4) “A parallel computation that assigns canonical object-based frames of reference” by Geoffrey E Hinton.

(5) “Crowding is unlike ordinary masking: Distinguishing feature integration from detection” by Denis G Pelli et al.

(6) “Spatial transformer networks” by Max Jaderberg et al.

(7) “A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information” by Bruno A Olshausen et al.

3. Strong Points

(1) The paper use capsules instead of two normal layers. Capsules allow vector input and output, and the length of the output vector can represent the probability that the entity represented by the capsule is present in the current input.

(2) This activate function is applied to non-linearization of capsules. Instead of normal activate function for scalar, squad can be applied to vector.

(3) Applying margin loss instead of cross entropy to allow for multiple digits.

(4) CapsNet uses only 3 layers and get a low test error of 0.25% that was only achieved by deeper networks.

(5) CapsNet is moderately robust to affine transformations, it achieved 79% accuracy on the affnist test set (each example is an MINIST with a random small affine transformation), 13% better than the traditional model’s accuracy.

(6) Adding the reconstruction regularizer accelerate the routing performance by enforcing the encoding in the capsule vector.

4. Weak Points

(1) The paper would have been more completed if it discusses the limitations of the CapsNet design.

(2) The paper could have some explanation on the reason of not adding more capsule layers to improve the performance.

(3) The paper didn’t mention how many experiments were performed on the same dataset. The results could have been more faithful if they were supported by significant set of experiments instead of just once.

5. Follow-ups

Hinton published a new paper “MATRIX CAPSULES WITH EM ROUTING”, it show some new ideas like make capsule using matrix and a new routing algorithm.

6. New Ideas

(1) Remove the convolutional network in capsnet instead of using capsules.

(2) Improve routing algorithm because the original routing algorithm is too simple for larger neural network.

(3) Improve the clustering learning in unsupervised learning study by routing-by-agreement.

Reference:

[1] https://blog.csdn.net/maltliquor/article/details/78422557?locationNum=10&fps=1

[2] https://blog.csdn.net/zchang81/article/details/78382472

[3] https://www.zhihu.com/question/67287444