SELU — Make FNNs Great Again (SNN)

Elior Cohen
July 21, 2017

Last month I came across a recent article (published June 22nd, 2017) presenting a new concept called Self Normalizing Networks (SNN).
In this post I will review what’s different about them and show some comparisons.
Link to the article —Klambauer et al.
Code for this post is taken from bioinf-jku’s github.

The Idea

Before we get into what are SNN lets speak about the motivation to create them.
Right in the abstract of the article the writer mentions a good point; while neural networks are gaining success at many domains it seems like the main stage belongs to convolution networks and recurrent networks (LSTM, GRU) while the feed forward neural networks (FNNs) are left behind in the beginner tutorial sections.
Also noted is that the FNNs that did manage to get winning results at Kaggle were at most 4 layers deep.

When using very deep architectures, networks become prone to gradient issues which is exactly why batch normalization came to be standard — this is where the writer puts FNNs weak link, in its sensitivity to normalization in training.
SNNs are a way to instead use external normalization techniques (like batch norm), the normalization occurs inside the activation function.
To make it clear, instead of normalizing the output of the activation function — the activation function suggested (SELU — scaled exponential linear units) outputs normalized values.
For SNNs to work, they need two things, a custom weight initialization method and the SELU activation function.

Meet SELU

Before we explain it, lets take a look what it’s all about.

Figure 1 The scaled exponential linear unit, taken from the article

SELU is some kind of ELU but with a little twist.
α and λ are two fixed parameters, meaning we don’t backpropagate through them and they are not hyperparameters to make decisions about.
α and λ are derived from the inputs — I will not go into this, but you can see the math for yourself in the article (which has 93 pages appendix :O, for math).
For standard scaled inputs (mean 0, stddev 1), the values are α=1.6732~, λ=1.0507~.
Lets plot and see what it looks like for these values.


Figure 2 SELU plotted for α=1.6732~, λ=1.0507~

Looks pretty similar to leaky ReLU, but wait to see its magic.


Weight Initialization

SELU can’t make it work alone, so a custom weight initialization technique is being used.
SNNs initialize weights with zero mean and use standard deviation of the squared root of 1/(size of input).
In code this looks as follows (taken from the github mentioned in the opening)

# Standard layer
tf.Variable(tf.random_normal([n_input, n_hidden_1], stddev=np.sqrt(
1 / n_input))# Convolution layer
tf.Variable(tf.random_normal([5, 5, 1, 32], stddev=np.sqrt(1/25)))

So now that we understand the initialization and activation methods, lets put it to work.

Performance

Lets examine how SNNs, using the specified initialization and the SELU activation function, does on the MNIST and CIFAR-10 datasets.
First lets see if it really does keep the outputs normalized, using TensorBoard, on a 2 layer SNN (both hidden layers, are of 784 nodes, MNIST).
Plotting the activation function outputs of layer 1, and the weights of layer 2.
The plotting of layer1_act is not present in the github code, I added it for the sake of this histogram.


Figure 3 SELU’s output after the first layer in the MLP from the github

Figure 4 Second layer’s weights on the second layer.

Keeps up to the expectations, both the activations of the first and the resulting weights on the second layer are almost perfect zero mean (I got 0.000201 on my run).
Trust me that the histogram is pretty much the same on the first layers weights.

More important, SNNs seem to be able to perform better, as you can see from the plots taken from the mentioned github, comparing 3 convolutional networks with identical architecture only different by their activation function and initialization.
SELU vs ELU vs ReLU.

Seems like SELU converges better and gets better accuracy on the test set.

Notice that using SELU + the mentioned initialization we got improved accuracy and faster convergence on a CNN network — so don’t hesitate to try it on architectures that are not pure FNNs as it seems to be able to boost performance in other architectures as well.

Conclusion and Further Reading

It does seem like SNNs can find their place in the world of neural networks, maybe pushing a bit extra accuracy in a bit less time — but we’ll have to wait and see what result they’ll yield by themselves and more importantly incorporated into joint architectures (like the conv-snn above).
Maybe we’ll meet them in competition winning architectures, who knows.

There are some things I didn’t go over here which are in the article, like the proposed “alpha dropout” which is a dropout technique that fits the SNNs’ concept which is also implemented in the mentioned github, so your’e more than welcome to go into that.

Whether SNNs are a thing or not I really can’t tell but they are another tool to add to your kit. Hope you enjoyed this reading and learnt something new :)