Sunday 11 January 2015

Why Activation Functions must be Non-Linear?

In the previous post we looked at the basic of a neural network, the node, and how it works. For simplicity we used a linear function for the activation function - but for real neural networks we can't do this.

Why not? Let's step through the reasoning step by step.


Q1: Why do we need many nodes, arranged in a mesh?

  • We don't have a neural network of just one node, we have many, often arranged into layers. This is because a single node has limited expressive power - it can't do more than it's activation function allows it to. If that activation function is a simple linear y=ax+b then all it will ever do is learn to separate the world into two by a straight line. Even a more complex sigmoid activation function will only ever separate the world by a single line, albeit curved like a sigmoid.
  • So one node is not enough. Perhaps many nodes can together, as a network, learn to model the world in a more sophisticated complex ways? The answer is yes if you connect them as a network, with layers of nodes. The answer is no if you simply line them up, connected serially one after another. So the arrangement -  the topology - of a neural network matters.
  • OK - so we've established that the nodes must be connected as a mesh, or web, to stand a chance of modelling modelling a world more complex than the individual activation functions could allow.

Now, let's go back to our first question.

 
Q2: Why do we need activation functions to be non-linear?

Why isn't a simple linear function like y=ax+b not useful?

The reason is that the overall effect of a neural network, composed of nodes with simple linear functions, is itself a simple linear function. You've lost the benefit of having lots of nodes, and the benefit of arranging them in a mesh. You may as well just have a single node, because you don't have any more expressive power than a single node.

How is this shocking conclusion be possible?

If you consider any single node in a mesh of nodes, all it is doing is taking a linear combination of the outputs from other nodes and applying the activation function to arrive at its output. The linear combination is simply the weighted sum we are familiar with. If the activation function is linear too, then the output is a linear function of the outputs of the nodes which feed into it. That is output = linear_function(inputs).

Now if we expand our thinking from one node to nodes, we have the output of the seconf node as being a linear function of the output of the first node, which itself was a linear combination of that first node's input. So the overall output is still linear with the inputs.

If we keep going with this thinking, expanding out mental picture to more nodes, even connected as a mesh, we find that the overall output from a neural network is a linear function of the network's inputs. And this can be modelled with a single node.

So a network of nodes is equiavalent to a single node if the activation function is linear. And so you can't learn to model anything more complex than a single node with a linear activation function.

This reduction isn't possible if the activation function is non-linear. Normally we like things to be simple and reducible, but here we don't because the upheld complexity is what is needed to ensure the network can model more complex worlds.

Just to be super-clear .. a function of a function - f(f(x)) is linear if f is linear. And it's not linear if f is not linear. Linear functions are a special case.

To illustrate:
  • If f(x) = ax+b .. then f(f(x)) = a(ax+b) + b = a2x + ab + b .. still linear with respect to the input x.
  • However if f(x) is non-linear f(x) = ax2 + bx + c then f(f(x)) = a(ax2+ bx + c)2 + b(ax2+bx+c) + c which is of the order x4 .. definitely not linear.
  • Linear functions are a special case .. a function of a function of a function .. etc ..  f(f(f(f(...)))) is linear only f is linear.

The sigmoid function commonly used for neural networks 1/(1+exp(-x)) is similarly non-linear.

Sunday 4 January 2015

The Workings of A Neural Node

It's always worth breaking complex things down into simpler things.

Neural networks are made up of nodes, which all behave in roughly the same way.  The following diagram shows a node, and shows what's going on (click to enlarge):


The input is just whatever is incoming to that neural node. It could be input from the outside world, the stimulus against which the neural network is workign on, or the output from another node inside the neural network.

The output is the signal that the node pushes out, after having gone through a few steps, which we'll talk about next. The output could be part of the final set of answers that the whole neural network emits, or it could go to another node as that node's input.

To calculate the output, the node could be really simple and just pass the input through unmodified. That would be ok but it wouldn't be a useful node - it's existence wouldn't be needed. So the node needs to do something.

The first thing the node does is apply a weight to the input. This is usually done by simply multiplying the input by a number to redue or magnify it. This is particularly useful when a node has many inputs (from stumuli or from other nodes) and each is weighted individually to give some more importance than others.

The node takes the input, modified by the weight, and applies an activation function to it before emitting the result as its output. This could be any function but becuase neural networks were inspired by biological brains, the function is usually one which tries to mimic how real neurons would work. Real neurons appear not to fire if the input isn't large enough. They only fire once the input is strong enough. A step function, like the Heaviside step function, would seem to be good enough, but because the maths involved when training a neural network, the smoother sigmoid function is used. What's more the smoother sigmoid function is closer to how a biological system responds, compared to the square unnatural robotic step function.

The following is a visual representation of a sigmoid function from here. Don't worry too much about the mathematical expression for now. You can see that for low input values (horizontal axis) the responce is zero, and for high input values the responce is 1, with a gradual smooth natural transition between the two.


It's always much easier to see an example working:
  • Let's imagine we have a really simple activation function f(x) = x2 to keep this example clear.
  • Let's imagine we have only one input, and that is 0.5.
  • The weight is set to 0.8
  • So the weighted input is 0.5 * 0.8 = 0.4
  • The activation function f(x) = x2 becomes 0.42 = 0.16, that's the output.

Friday 2 January 2015

Welcome!

Welcome to Make Your Own Neural Network!

This blog will follow the development of my new ebook on neural networks, discuss interesting ideas and your feedback.

My previous ebook Make Your Own Mandelbrot was a surprising success, and this ebook will follow the same core philosophy: to take a gentle journey through the mathematical concepts needed to understand neural networks, and also to introduce just enough Python to make your own.

The critical thing for me is to be able to explain the ideas in a way that any teenager can understand - because I really believe too any other guides unnecessarily put people off with their terrible explanations and jargon.

You can follow the other blog at http://makeyourownmandelbrot.blogspot.co.uk/ and on twitter @myomandelbrot