Sunday 11 January 2015

Why Activation Functions must be Non-Linear?

In the previous post we looked at the basic of a neural network, the node, and how it works. For simplicity we used a linear function for the activation function - but for real neural networks we can't do this.

Why not? Let's step through the reasoning step by step.


Q1: Why do we need many nodes, arranged in a mesh?

  • We don't have a neural network of just one node, we have many, often arranged into layers. This is because a single node has limited expressive power - it can't do more than it's activation function allows it to. If that activation function is a simple linear y=ax+b then all it will ever do is learn to separate the world into two by a straight line. Even a more complex sigmoid activation function will only ever separate the world by a single line, albeit curved like a sigmoid.
  • So one node is not enough. Perhaps many nodes can together, as a network, learn to model the world in a more sophisticated complex ways? The answer is yes if you connect them as a network, with layers of nodes. The answer is no if you simply line them up, connected serially one after another. So the arrangement -  the topology - of a neural network matters.
  • OK - so we've established that the nodes must be connected as a mesh, or web, to stand a chance of modelling modelling a world more complex than the individual activation functions could allow.

Now, let's go back to our first question.

 
Q2: Why do we need activation functions to be non-linear?

Why isn't a simple linear function like y=ax+b not useful?

The reason is that the overall effect of a neural network, composed of nodes with simple linear functions, is itself a simple linear function. You've lost the benefit of having lots of nodes, and the benefit of arranging them in a mesh. You may as well just have a single node, because you don't have any more expressive power than a single node.

How is this shocking conclusion be possible?

If you consider any single node in a mesh of nodes, all it is doing is taking a linear combination of the outputs from other nodes and applying the activation function to arrive at its output. The linear combination is simply the weighted sum we are familiar with. If the activation function is linear too, then the output is a linear function of the outputs of the nodes which feed into it. That is output = linear_function(inputs).

Now if we expand our thinking from one node to nodes, we have the output of the seconf node as being a linear function of the output of the first node, which itself was a linear combination of that first node's input. So the overall output is still linear with the inputs.

If we keep going with this thinking, expanding out mental picture to more nodes, even connected as a mesh, we find that the overall output from a neural network is a linear function of the network's inputs. And this can be modelled with a single node.

So a network of nodes is equiavalent to a single node if the activation function is linear. And so you can't learn to model anything more complex than a single node with a linear activation function.

This reduction isn't possible if the activation function is non-linear. Normally we like things to be simple and reducible, but here we don't because the upheld complexity is what is needed to ensure the network can model more complex worlds.

Just to be super-clear .. a function of a function - f(f(x)) is linear if f is linear. And it's not linear if f is not linear. Linear functions are a special case.

To illustrate:
  • If f(x) = ax+b .. then f(f(x)) = a(ax+b) + b = a2x + ab + b .. still linear with respect to the input x.
  • However if f(x) is non-linear f(x) = ax2 + bx + c then f(f(x)) = a(ax2+ bx + c)2 + b(ax2+bx+c) + c which is of the order x4 .. definitely not linear.
  • Linear functions are a special case .. a function of a function of a function .. etc ..  f(f(f(f(...)))) is linear only f is linear.

The sigmoid function commonly used for neural networks 1/(1+exp(-x)) is similarly non-linear.