I am really pleased that this happens - it means people are interested, that they care, and want to share their insights.

A few suggestions had built up over recent weeks - and I've updated the content. This is is a bigger update than normal.

### Thanks

Thanks go to Prof A Abu-Hanna, "His Divine Shadow", Andy, Joshua, Luther, ... and many others who provided valuable ideas and fixes for errors, including in the blog comments sections.### Key Updates

Some of the key updates worth mentioning are:- Error in calculus introduction appendix where the example explaining how to differentiate $s = t^3$. The second line of working out on page 204 shows $\frac{6 t^2 \Delta x + 4 \Delta x^3}{2\Delta x}$ which should be $\frac{6 t^2 \Delta x + 2 \Delta x^3}{2\Delta x}$. That 4 should be a 2.
- Another error in the calculus appendix section on functions of functions ... showed $(x^2 +x)$ which should have been $(x^3 + x)$.
- Small error on page 65 where $w_{3,1}$ is said to be 0.1 when it should be 0.4.
- Page 99 shows the summarised update expression as $\Delta{w_{jk}} = \alpha \cdot sigmoid(O_k) \cdot (1 - sigmoid(O_k)) \cdot O_j^T$ .. it should have been the much simpler ..

### Worked Examples Using Output Errors - Updated!

A few readers noticed that the example error used in the example to illustrate the weight update process is not realistic.Why? How? Here is an example diagram used in the book - click to enlarge.

The output error from the first output layer node (top right) is shown as

**1.5.**Since the output of that node is the output from a sigmoid function it must be between 0 and 1 (and not including 0 or 1). The target values must also be within this range. That means the error .. the difference between actual and target values .. can't be as large as 1.5. The error can't be bigger than 0.99999... at the very worst. That's why $e_1 = 1.5$ is unrealistic.

The calculations illustrating how we do backpropagation are still ok. The error values were chosen at random ... but it would be better if we had chosen a more realistic error.

The examples in the book have been updated with a new output error as 0.8.

Not sure if this error has been taken into account already. On Page 98, it says that weights are increased when the slope is positive. It should be the other way around.

ReplyDeleteYou are right! I've made the correction to the book.

ReplyDeleteFor othe readers .. here is the text:

new_w = old_w - ( a * (dE/dw )

The updated weight w_jk is the old weight adjusted by the negative of the error slope we just worked out. It’s negative because we want to increase the weight if we have a positive slope, and decrease it if we have a negative slope, as we saw earlier. The symbol alpha 𝛂, is a factor which moderates the strength of these changes to make sure we don’t overshoot. It’s often called a learning rate .

That should say "we want to decrease the weight if we have a positive slope, and increase it if we have a negative slope" The equation is itself correct.

Thank you this is what I have been stuck with!

DeleteThis comment has been removed by the author.

ReplyDeletethank you this is what I have been stuck with!!

ReplyDeletethanks you for getting in touch ... and helping everyone else too!

ReplyDeleteI am not sure if it another book error, but computing the hidden node errors (Oe1 = .8, Oe2 = .5) via matrices produces: Hidden-1 error as 2.1, and Hidden-2 error as 4.4. Calculating the same manually, Excel assisted, produces Hidden-1 error of .42 and Hidden-2 error of .88.

ReplyDeleteHw1,1 = 2, Hw1,2 = 1 Oe1 = .8

Hw2,1 = 3, Hw2,2 = 4 Oe2 = .5

(Hw1,1 * Oe1) + (Hw1,2 * Oe2) = He1 = 2.1

(Hw2,1 * Oe1) + (Hw2,2 * Oe2) = He2 = 4.4

Feedback would be appreciated.

Good question. page 80-81 of the book explains this but here is a summary.

ReplyDeleteWith forward propagation of the signal, the maths calculations can easily be written as a matrix multiplication of the signals and weights.

With the back propagation of the error - this is not quite so convenient. The splitting of the error according to the weights isn't symmetric enough to make the matrix expression simple. But we can simplify it so the error is proportional to the weights .. and we lose the normalising factor. This lets us write the simple matrix multiplication .. and the loss of the correct normalising factor doesn't matter so much over many iterations.

If you don't have the book - get in touch with me by email and we'll fix it.

Hey, thanks for doing all the work on this book. It has really helped me get some ideas on ways to improve my current research. I did happen to find a small error that I don't believe has been pointed out yet. On page 121, the second to last paragraph, the end of a sentence reads "..which it does by default, trying to [be] helpful". The sic is how I believe it is meant to be read.

ReplyDeleteJack - many thanks! I I will update the source content and a future updated text will have this change.

DeleteThis comment has been removed by the author.

ReplyDeleteThis comment has been removed by the author.

ReplyDeleteCongratulations for a very well thought-out and written book. On page 99 the equation which expresses delta W (also reproduced in this blog post above) contains a capital E, should it be a small "e"? According to your notations, the error is ek = tk-ok (ex graph on p.93) and E = (tk-ok)^2 is the quadratic error function, as seen e.g on p.94. The same question goes for the 2nd equation of p.98 (the matrix of "values from next layer").

ReplyDeleteOn p. 97 a small "e" is used to express the backpropagated error (not a E).

Hi LA - great question.

DeleteI pondered this as it's been a while since I had all the maths in my head .. and I think you're right.

E should be the quadratic error and small "e" should be the error at a node.

I will double check this and update the book for the next updated edition.

Thanks for taking the time to raise this.

Hello! I don't understand why we have this result

ReplyDelete1) dE/dWjk = dE/dOk * dOk/dWjk

2) dE/dWjk = -2(Tk-Ok) * dSigmoid(SUMj (Wjk+Oj)/dWjk

3) rule of derivative of sigmoid is dSigmoid(X)/dX = sigmoid(X)(1-sigmoid(X)

4) But then we have new strange variables in the end of equation ( I marked this strange variable with !!!)

5) dE/dWjk = -2(Tk-Ok)*sigmoid(sum(Wjk*Ok)(1-sigmoid(sum(Wjk-Oj)) * !!! d (sum(Wjk*Oj)/ dWjk !!!

In derivative rule (in this book) we don't need to add and take another derivate from x, but in our equation we add this derivative. Plz explain this step

good question

Deletethe reason is the chain rule again

the sigmoid referred to in dSigmoid(SUMj (Wjk+Oj)/dWjk is itself a function of Wjk .. in this case Wjk*Oj .. which is why we need the extra term

it's easier to see it with simpler expressions

dSigmoid(X)/dX = sigmoid(X)(1-sigmoid(X)

but

dSigmoid(2X)/dX = sigmoid(X)(1-sigmoid(X) * !!! d(2X)/dX !!!

= sigmoid(X)(1-sigmoid(X) * !!! 2 !!!

I hope that helps

Thank you so much! Your answer is very clear, this equation is some sort of addition of chain rule!

Delete