Wednesday, 6 July 2016

Error Backpropagation Revisted

A great question from Alex J has prompted a deeper look at how we take the error at the output layer of a neural network and propagate it back into the network.


Reminder: Error Informs Weight Updates

Here's a reminder why we care about the error:
  • In a neural network, it is the link weights that do the learning. They are adjusted again and again in an attempt to better match the training data.
  • This refinement of the weights is informed by the error associated with a node in the network. A small error means we don't need to change the weights much. 
  • The error at the output layer is easy - it's simply the difference between the desired target and actual output of the network.
  • However the error associated with internal hidden layer nodes is not obvious.


What's The Error Inside The Network?

There isn't a mathematically perfect answer to this question.

So we use approaches that make sense intuitively, even if there isn't a mathematically pure and precise derivation for them. These kinds of approaches are called heuristics.

These "rule of thumb" heuristics are fine ... as long as they actually help the network learn!

The following illustrates what we're trying to achieve - use the error at the output layer to work out, somehow, the error inside the network.



Previously, and in the book, we considered three ideas. An extra one is added here:
  • split the error equally amongst the connected nodes, recombine at the hidden node
  • split the error in proportion to the link weights, recombine at the hidden node
  • simply multiply the error by the link weights, recombine at the hidden node
  • the same as above but attempt to normalise by dividing by the number of hidden nodes

Let's look at these in turn, before we try them to see what performance each approach gives us.


1. Split Error Equally Amongst Links

We split the error at each output node, dividing it equally amongst the number of connected incoming links. We then recombine these pieces at each hidden layer node to arrive at an internal error.


Mathematically, and in matrix form, this looks like the following. $N$ is the number of links from hidden layer nodes into an output node - that is, the number of hidden layer nodes.

$$
e_{hidden} =
\begin{pmatrix}
1/N & 1/N & \cdots \\
1/N & 1/N & \cdots  \\
\vdots & \vdots & \ddots \\
\end{pmatrix}
\cdot e_{output}
$$

Remember that a matrix form is really useful because Python's numpy can do the calculations efficiently (quickly) and we can write very concise code.


2. Split Error In Proportion To Link Weights

We split the error, not equally, but in proportion to the link weights. The reason for this is that those links with larger weights contributed more to the error at the output layer. That makes intuitive sense - small weights contribute smaller signals to the final output layer, and should be blamed less for the overall error. These proportional bits are recombined again at the hidden layer nodes.


Again, in matrix form, this looks like the following.

$$
e_{hidden} =
\begin{pmatrix}
\frac{w_{11}}{w_{11} + w_{21} + \cdots} & \frac{w_{12}}{w_{12} + w_{22} + \cdots} & \cdots \\
\frac{w_{21}}{w_{11} + w_{21} + \cdots} & \frac{w_{22}}{w_{12} + w_{22} + \cdots} & \cdots \\
\vdots & \vdots & \ddots \\
\end{pmatrix}
\cdot e_{output}
$$

The problem is ... we can't easily write this as a simple combination of matrices we already have, like the weight matrix and the output error matrix. To code this, we'd lose the benefits of numpy being able to accelerate the calculations. Even so, let's try it to see how well it performs.


3. Error Simply Multiplied By Link Weights

We don't split the error, but simply multiply the error by the link weights. This is much simpler than the previous idea but retains the key intuition that larger weights contribute more to the networks error at the output layer.

You can see from the expression above that the output errors are multiple day the weights, and there is also a kind of normalisation division. Here we don't have that normalisation.

In matrix form this looks like the following - it is very simple!

$$
e_{hidden} = w^{T} \cdot e_{output}
$$

Let's try it - and if it works, we have a much simpler heuristic, and one that can be accelerated by numpy's ability to do matrix multiplications efficiently.


4. Same as Above But "Normalised"

This additional heuristic is the same as the previous very simple one - but with an attempt to apply some kind of normalisation. We want to see if the lack of a normalisation in the simple heuristic has a negative effect on performance. 

The expression is still simple, the above expression divided by the number of hidden nodes $N$.

$$
e_{hidden} = \frac{w^{T}}{N} \cdot e_{output}
$$

You can imagine this goes some way to allaying fears that the previous approach magnifies the error unduly. This fear goes away if you realise the weights can be $<1$ and so can have a shrinking effect, not just a growing effect.


Results!

The above heuristics were coded and compared using the MNIST challenge. We keep the number of hidden nodes at 100, and the learning rate at 0.1 We do vary he number of learning epochs over 1, 2 and 5.

The following shows the results.


We can make some interesting conclusions from these results.


Conclusions

  • Naively splitting the error equally among links doesn't work. At all! The performance of 0.1 or 10% accuracy is what you'd get randomly choosing an answer from a possible 10 answers (the digits 0-9).
  • There is no real difference between the sophisticated error splitting and the much simpler multiplication by the link weights. This is important - it means we can safely use the much simpler method and benefit from accelerated matrix multiplication.
  • Trying to normalise the simple method actually reduces performance ... by slowing down the learning rate. You can see it recover as you increase the number of learning epochs.

All this explains why we, and others, choose the simpler heuristic. It's simple, it works really well, and it can benefit from technology that accelerates matrix multiplication ... software like Python's numpy, and hardware like GPUs through openCL and CUDA.



I'll update the book so readers can benefit from a better explanation of the choice of heuristic. All ebooks can be updated for free by asking Amazon Kindle support.

18 comments:

  1. It would be easy to just multiply wT by the matrix:
    |x 0|
    |0 y|,

    where x = normalizing factor for left column and y = normalizing factor for the right column. That doesn't seem to add too much to a calcalulation

    ReplyDelete
  2. In the page 81 of your book (may be near that page somewhere), you have written - if we throw out that kind of normalization factor, the matrix multiplication is easy to spot. And you have given the same matrix before transpose and after transpose. And the before matrix looks correct itself.- can you verify please.

    ReplyDelete
    Replies
    1. I'd love some insight into this as well! The matrix, while being labeled as transposed, seems to be the same. Is this intentional?

      Delete
  3. Good question!

    The expression you see at the bottom of that page is

    error_hidden = W^T . e

    Why is that a W^T and not just simply W?

    Well .. earlier in the book .... and throughout the book we have W as

    | w_1,1 w_2,1 w_3,1 ... |
    | w_1,2 w_2,2 w_3,2 ... |
    | w_1,3 w_2,3 w_3,3 ... |

    but the expression we arrive at for the error at a hidden layer involves


    | w_1,1 w_1,2 w_1,3 ... |
    | w_2,1 w_2,2 w_2,3 ... |
    | w_3,1 w_3,2 w_3,3 ... |

    which is W^T not W.

    Earlier when we did the calculation for feed-forward signals, the expression was W not W^T ... O = W.I

    The key is to stick to a convention for which way round you make the matrices for weights. I got lost several times when I wasn't careful and consistent about which way around my matrix was. Some textbooks or guides will have it the other way around so be aware when reading different texts and blogs.

    Let me know if this doesn't help and we'll have another go!

    ReplyDelete
    Replies
    1. .. and please do get back to me if this still doesn't make sense .. I really want to get to the bottom of any issues or errors!

      Delete
  4. Hi!

    One thing I don't fully understand is why we use gradient descent to adjust our weights. Why don't we use the proportionate backpropagated error directly to change the associated weight?

    Regards/Per

    ReplyDelete
    Replies
    1. hi - the reason for using gradient descent is that the relationship between the inputs to a node (incl the weights) and the output of that node is not simple (linear) .. it is by design non-linear.

      if the relationship was simple and linear - you could use the error directly .. just like we do earlier in the book with the linear predictors .. where we use a trivial shuffling of the algebra to use the error directly to update the weight.

      also - the gradient descent means we "smooth" any errors in the example data and also avoid over-fitting ...

      Delete
    2. if I've misunderstood let me know ... how did you want to use the error?

      Delete
    3. Thanks for your answer! :)

      I'm really bad at math so I'm sorry if I've misunderstood the explanation of the gradient descent but
      what I guess it all boils down to lowering or increasing the weights right?

      When I look at the final expression for updating the weight (-error * sigmoid(sumofinput)*(1-sigmoid(sumofinput)) * ot) it looks like "error" is the only variable that drives the direction to change the weight in (+/-) (since sigmoid(x) will always be in the range 0-1) and "ot" is an output that's also in the range 0-1.

      As I said math is really not my strong suite and I've been scratching my head for days trying to figure out why gradient descent is needed and why we can't use the backpropagated error to change the weights.

      I guess the overfitting and smoothing might not be applicable if done this way?

      Delete
    4. hi - I'm glad you persisted with this .. it is a good question!

      The main reason for gradient descent is to be able to find out where a minimum for a complicated function is. That is - if it was easy, like y=x^2 .. we would't necessarily need gradient descent. However, the expression relating a network's outputs to the internal weights is complicated ... have a look at slide 60 of tutorial slides at https://goo.gl/JKsb62 .. that's a horrible expression .. which we can't easily analyse in a pure algebraic way .. so we need approximate methods like gradient descent.

      The expression you refer to ... also shown on slide 65 ... is the expression for the gradient.

      So back to your question. Why can we use an expression for gradient descent which looks manageable .. but not a direct solution?

      Because .. luckily ... the expression for the gradient of the error was much easier to handle than the full expression for error.

      Why can't we simply substitute the local node errors and get an expression for the correct weight? Because following a gradient only takes you in the direction of the desired minimum .. it doesn't always take your to the exact spot.

      On this last point - the book uses examples in the appendix showing gradients for simple curves like y=x^2 ... the gradient at a point o that curve will point you int he direction of the minimum but doesn't point to the exact spot.

      The second min reason is as you say .. to reduce over-fitting and smoothing ...

      Does this help? If not do get back to me again!

      Delete
    5. Hi again and thanks for replying. I really appreciate your help! :)

      I understand that directly trying to find a solution is extremely hard because of the complexity of the net. What I'm struggling with is some of the components of the final expression for updating the weight. You might've already answered this and I just haven't gotten it through my thick skull :P

      If we split the below expression into it's components:
      (-error * sigmoid(i)*(1-sigmoid(i)) * ot)


      1: -error <- Can be -/+

      2: sigmoid(i) <- will always be 0-1

      3: (1-sigmoid(i)) <- will always be 0-1

      4: ot <- will always be 0-1

      So "-error" is the only component that drives whether or not to increase or decrease the weight. The rest of the components will "only" weight "-error".

      Wouldn't it be enough to use -error * learning rate to update the weights?

      Regards/Per

      Delete
    6. Just to be clear. With -error I mean the proportionate error of each weight like those discussed in this blog post.

      Delete
    7. aha - yes the weights are changed in proportion to the error.

      so yes - you could just try to have (-error * learning_rate) ..

      so what's missing from this? what's missing is the other factors which scale that adjustent ..

      i haven't done extensive experiments but I suspect that the outcome could be anywhere between (1) slower convergence, an (2) failure to converge.

      i suspect that factor O(1-O)Ot is important in providing a scale that's *relative* to other nodes which may have the same error but need different weight adjustments

      .... i like your question .. it uncovers things most guides gloss over...

      are you convinced? if you do experiments i'd love to share them here and via twitter

      Delete
    8. Hi again!

      Indeed. The weight scaling might be important to the speed of the convergence. I'll have to run some tests and see what kind of impact only using -error as a heuristic would cause.

      Thanks so much for your help. I'll be sure to post my results.
      And thanks again for your great book. It's been invaluable for me to really understand backpropagation.

      Regards/Per

      Delete
    9. I can inform you that it didn't work :)
      I was running my NN on a XOR training set and what it did was it converged all outputs towards 0.5.
      One guess is that it moves the resulting outputs towards the mean value of the expected outputs.

      Regards/Per

      Delete
    10. thanks for sharing your experiment results!

      I too had many questions early on about backdrop .. it was sooo badly explained in many texts .. and some of my own tests showed a simpler idea didn't work ..

      interestingly I spent ages trying to work out why people used a squared error (t-o)^2 and not just the (t-o) .. there's a blog post on this which talks about some really revealing insight for me anyway!

      Delete
    11. I couldn't agree more. Alot of texts just follow previous examples and never go into the detail about why things are done a certain way.

      Yes! Exactly!! The (t-o)^2 thing hade me very confused until I read your book. It was another one of those "Why didn't they just tell me that in the first place"-moments :P

      Delete
  5. This comment has been removed by the author.

    ReplyDelete