Sunday 15 March 2015

Backpropagation Part 2/3

This post continues from Part 1/3, and aims to explain in more detail how each weight in a neural network is refined during it's training.

Last time we covered the overarching idea that, during training, the network's overall output error is the key thing that guides how each internal weight must be tuned so that the overall output error is reduced.


Simpler Case First: Linear Neuron
Because the maths usually gets in the way of most explanations, let's illustrate the idea using an simpler model of a neuron.

Remember the neuron we described in an earlier post? We know consider an artificially simple node which has a linear activation function. Linear means the output function is of the form ax+b .... that is, (weight x input) + constant, or o=w.i+c to be more concise.


Ignore for now the fact that useful neural networks can't have such simple activation functions.

Imagine we start with an untrained neuron, with

  • weight w=0.8 
  • and for simplicity set the constant c as zero. 
  • Imagine also we have a training data with input = 1.0 and output = -1.0
Applying this input gives the output as o = w.i+c = (0.8x1.0) = 0.8. Now 0.8 is not the desired -1.0. Ther error is target-output = -1.0 - 0.8 = -1.8.

So what do we do with this error -1.8? We said in the last post that this is the value we used to refine the network's weights to improve the result.


Derivatives: How Outputs Depend on Other Factors
To answer this we need to dig a little into the activation function which turns the input into the output and uses the weight to do so. This makes intuitive sense - if we're to tweak the weight, we need to understand the function which uses the weight to produce, hopefully, a better output. This function is of course the activation function o = w.i + c.

Look at the following graphs showing o = w.i for different w. Larger w have larger slopes, negative w have slopes going in the oppositve direction. So w has a direct effect on the output. You can see that w=-0.7 would be a better weight for our training dara because with this the output would be -0.7, not so far from the desired -1.0. Much better than 0.8 above.

But we need to be able to express this connection more precisely than handwaving at a graph.



As is all too common in mathematics, you can get an insight into a function by looking at how it depends on a variable.You'll remember this is simply calculus - how does X change as a result of Y changing. The following shows the idea applied to a general activation function.


So let's do this with o = w.i + c and see how it varies with w, because we're interested in tweaking w. The derivative of o with respect to w is Δo/Δw and is simply Δo/Δw = i. This is simple school level calculus.

Let's rearrange that expression so that we have the output and weights on different sides of the equation. That becomes Δo = Δw.i and this simply says that a small change in w multiplied by i gives the small change in o. Or rearranging slightly Δw = Δo/i.

This is looking hopeful! The Δw is the change in weight we want to work out. We know i because for the training data set it was 1.0. So what is Δo? It is the change in o, the output, we want to correspond with that change in w. But what should Δo be? Its the error, the different between the desired and current output. So, using the above training data item, we have error = Δo = -1.8. This deliberately simplistic function and it's derivative tell us that the Δw should change by -1.8 too. The weight w was 0.8 but changing it by -1.8 gives the new value -1.0.

Hey presto! The new weight of -1.0, gives us a new improved output of -1.0 ... to perfectly match the target desired output.


Step Back: The Process
Ok - phew! That was along winded way of tweaking a really simple weight to get a better output... but the important thing is the process - the process of using the derivatives of the activation function to understand how the output depends on the weight. Or more precisely, how small changes in output correlate to changes in weight Δw. And using this relationship to work out Δw, the required weight change.

This is the important idea: That we improve the weights by calculating the small change dw based on the desired change in output, Δo. And the relationship between Δw and Δo is derived using calculus of the activation function.


Moderation
In reality, we don't just change the weight by the value of Δw. Instead we moderate how much of the Δw change we apply. This is often called a learning rate. We might choose to only apply half of each Δw we calculate for example. Why do this?

We moderate the "training", or changes in weight recommended by the calculation of Δw, because the training data won't necessarily perfectly fit a real world model, and trying to aim at it too strongly means risking over-fitting. Over-fitting is when a system is too closely matched to imperfect training data, and as a result performs badly on new data. In short, it doesn't generalise but merely memorises the training data.


Next Time
This time we looked at a deliberately and extremely simple neural network consisting of one neuron, and a simple activation function, just to see more clearly how we can refine the weights with each training example.

Next time we'll have to work out how to do this with a more complex network, with hidden layers.

Friday 13 March 2015

Backpropagation Part 1/3

Backpropagation is the core idea behind training neural networks. For some it's an easy algorithm to understand, for others none of the many many texts and online explanations seem to make it clear.

We'll try to talk through the idea in simple terms, to get the overall gist, and then later in a second post we'll consider some of the details.

Backpropagation Overview
The main idea is this, and it really is this simple:
  1. We refine (train) a neural network by using example data (training data) where for each example we know the question (input) and the answer (output).
  2. For each example we use the error  - the difference between the right answer and what the neural network outputs - to determine how much we refine the internals of the neural network.
  3. We keep doing this for other examples, until we're happy that the neural network gets enough answers right, or close enough. 

In slightly more detail:
  • We start with an untrained network, and a training set of examples to learn from - which consists of inputs and the desired outputs.
  • To train the network we apply the input of each example to the untrained network. This could be the human handwritten characters we're interested in for our project.
  • The untrained network will, of course, produce an output. But because it is untrained, the output is very likely incorrect. 
  • We know what the output should be, because we're using a training example set. We can compare the desired output with the actual but incorrect output - that's the error
  • Rather than disregard this error - we use it because it gives us a clue as to how to refine the neural network to improve the output. This refinement is done by adjusting the "weights" in the network, which we saw in an earlier post.




Look at the following diagram showing an input layer of nodes (A, B), an internal so-called hidden layer (C, E) and an output node (F). The diagram also shows the weights between the nodes, for example WAC for the weight between node A and C. Also feeding into C is node B with weight WBC.

Now the question that took me ages to find the answer to is this: Given the error in the output of the network, how do you use this error to update the internal weights. I understood you needed to, of course, but not exactly how. Do you only update one weight? Do you consider the error out of internal nodes to be this same error from the entire neural network?


There are probably other ways to refine the network weights, but one way - the backpropagation algorithm - splits the error in proportion to the weights. So if WCF is twice as large as WEF then it should take twice as much of the error. If the error was, say 6, then WCF should be updated as if the error from C was 4, and WEF  updated as if the error from E was 2 - because 4 is twice 2.

We'll look at how we actually update weight itself in the next part 2 of this blog. For now, we just want to understand the overall approach.


The above and below diagrams show the same idea then applied to nodes further back from the final output and error.


Once the error has been propagated back from the output layer back through the middle (known as "hidden", silly name) layers, and back to the weights from the input layers - we have completed a training session. Of course training needs to happen with lots of examples, which has the effect of slowly but surely refining the weights of the neural network to a point where the neural network gets better and better at predicting the right output.

Next time we'll look at the detail of using the error to update a weight. That involves some mathematics, but we'll introduce it gently.

Sunday 1 March 2015

The MNIST Dataset of Handwitten Digits

In the machine learning community common data sets have emerged. Having common datasets is a good way of making sure that different ideas can be tested and compared in a meaningful way - because the data they are tested against is the same.

You may have come across the famous iris flower dataset which is very common for clustering or classification algorithms.

With our neural network, we eventually want it to classify human handwritten numbers. So we'd want to train it on a dataset of handwritten numbers, with labels to tell us what the numbers should be. There is in fact a very popular such dataset called the MNIST dataset. It's a big database, with 60,000 training examples, and 10,000 for testing.

The format of the MNIST database isn't the easiest to work with, so others have created simpler CSV files, such as this one. The CSV files are:



The format of these is easy to understand:

  • The first value is the "label", that is, the actual digit that the handwriting is supposed to represent, such as a "7" or a "9". It is the answer to which the neural network is aspiring to classify.
  • The subsequent values, all comma separated, are the pixel values of the handwritten digit. The size of the pixel array is 28 by 28, so there are 784 values after the label.


These files are still big, and it would be nicer to work with much smaller ones whilst we experiment and develop our code for visualising handwriting and the neural networks themselves.

So here are smaller versions of the above CSV files, but with 100 training and 10 test items:
Next we'll try to load these files into python and see if we can display the handwritten characters.

We can load the data easily from a file as follows:

f = open("mnist_test_10.csv", 'r')
a = f.readlines()
f.close()

We can then split each line according the commas, to get the label and the bitmap values, from which we build an array. We do need to reshape the linear array to 28x28 before we display it.

f = figure(figsize=(15,15));
count=1
for line in a:
    linebits = line.split(',')
    imarray = numpy.asfarray(linebits[1:]).reshape((28,28))
    subplot(5,5,count)
    subplots_adjust(hspace=0.5)
    count += 1
    title("Label is " + linebits[0])
    imshow(imarray, cmap='Greys', interpolation='None')
    pass

The output in IPython is a series of images, You can check that the label matches the handwritten image:

c+jNZAACApzGSAAAAKkYSAABAxUgCAACoGEkAAAAVIwkAAKBiJAEAAFSMJAAAgMoLsOpZ5VnFFAkAAAAASUVORK5CYII= (841×342)


Cool - we can now import handwritten image data from the MNIST dataset and work with it in Python!

PS Yes, the "subplots" command for each loop isn't efficient but I didnt' have time to work out how to do plotting subplots properly.



UPDATE: The book is out! - and provides examples of working with the MNIST data set, as well as using your own handwriting to create a test dataset.