### Reasons for Not Using Frameworks

I avoided these frameworks because the main thing I wanted to do was to learn how neural networks actually work. That includes learning about the core concepts and the maths too. By creating our own neural networks code, from scratch, we can really start to understand them, and the issues that emerge when trying to apply them to real problems.We don't get that learning and experience if we only learned how to use someone else's library.

### Reasons for Using Frameworks - GPU Acceleration

But there are some good reasons for using such frameworks, after you've learned about how neural networks actually work.One reason is that you want to take advantage of the special hardware in some computers, called a GPU, to accelerate the core calculations done by a neural network. The

**GPU**- graphics processing unit - was traditionally used to accelerate calculations to support rich and intricate graphics, but recently that same special hardware has been used to accelerate machine learning.

The normal brain of a computer, the

**CPU**, is good at doing all kinds of tasks. But if your tasks are matrix multiplications, and lots of them in parallel, for example, then a GPU can do that kind of work much faster. That's because they have lots and lots of computing cores, and very fast access to locally stored data. Nvidia has a page explaining the advantage, with a fun video too - link. But remember, GPU's are not good for general purpose work, they're just really fast at a few specific kinds of jobs.

The following illustrates a key difference between general purpose CPUs and GPUs with many, more task-specific, compute cores:

GPU's have hundreds of cores, compared to a CPU's 2, 4 or maybe 8. |

Writing code to directly take advantage of GPU's is not fun, currently. In fact, it is extremely complex and painful. And very very unlike the joy of easy coding with Python.

This is where the neural network frameworks can help - they allows you to imagine a much simpler world - and write code in that word, which is then translated into the complex, detailed, and low-level nuts-n-bolts code that the GPUs need.

There are quite a few neural network frameworks out there .. but comparing them can be confusing. There are a few good comparisons and discussions on the web like this one - link.

### PyTorch

I'm going to use**PyTorch**for three main reasons:

- It's largely vendor independent. Tensorflow has a lot of momentum and interest, but is very much a Google product.
- It's designed to be Python - not an ugly and ill-fitting Python wrap around something that really isn't Python. Debugging is also massively easier if what you're debugging is Python itself.
- It's simple and light - preferring simplicity in design, working naturally with things like the ubiquitous numpy arrays, and avoiding hiding too much stuff as magic, something I really don't like.

Some more discussion of PyTorch can be found here - link.

### Working With PyTorch

To use PyTorch, we have to understand how it wants to be worked with. This will be a little different to the normal Python and numpy world we're used to.The main ideas are:

- build up your network architecture using the building blocks provided by PyTorch - these are things like layers of nodes and activation functions.
- you let PyTorch automatically work out how to back propagate the error - it can do this for any of the building blocks it provides, which is really convenient.
- we train the network in the normal way, and measure accuracy as usual, but pytorch provides functions for doing this.
- to make use of the GPU, we configure a setting to and push the neural network weight matrices to the GPU, and work on them there.

A key part of this auto differentiation. Let's look at that next.

### Auto Differentiation

A powerful and central part of PyTorch is the ability to create neural networks, chaining together different elements - like activation functions, convolutions, and error functions - and for PyTorch to work out the error gradients for the various parameters we want to improve.That's quite cool if it works!

Let's see it working. Imagine a simple parameter $y$ which depends on another input variable $x$. Imagine that

$$ y = x^2 + 5x + 2 $$

Let's encode this in PyTorch:

import torch

from torch.autograd import Variable

x = Variable(torch.Tensor([2.0]), requires_grad=True)

y = (x**2) + (5*x) + 2

Let's look at that more slowly. First we import torch, and also the

**Variable**from torch.autograd, the auto differentiation library. Variable is important because we need to wrap normal Python variables with it, so that PyTorch can do the differentiation. It can't do it with normal Python variables like a = 10, or b = 5*a.

**Variables**include links to where the variables came from - so that if one depends on another, PyTorch can do the correct differentiation.

We then create

**x**as a

**Variable**. You can see that it is a simple tensor of trivial size, just a single number, 2.0. We also signal that it requires a gradient to be calculated.

A

**tensor**? Think of it as just a fancy name for multi-dimensional matrices. A 2-dimensional tensor is a matrix that we're all familiar with, like bumpy arrays. A 1-dimensional tensor is like a list. A 0-dimensional one is just a single number. When we create a torch.Tensor([2.0]) w'ere just creating a single number.

We then create the next

**Variable**called

**y**. That looks like a normal Python variable by the way we've created it .. but it isn't, because it is made from

**x**, which is a PyTorch

**Variable**. Remember, the magic that

**Variable**brings is that when we define

**y**in terms of

**x**, the definition of

**y**remembers this, so we can do proper differentiation on it with respect to

**x**.

So let's do the differentiation!

y.backward()

That's it. That all that is required to ask PyTorch to use what it knows about

**y**and all the

**Variable**s it depends on to work out how to differentiate it.

Let's see if it did it correctly. Remember that $x=2$ so we're asking for

$$ \frac{\delta y}{\delta x}\Big|_{x=2} = 2x + 5 = 9$$

This is how we ask for that to be done.

x.grad

Let's see how all that works out:

It works! You can also see how

**y**is shown as type

**Variable**, not just

**x**.

So that's cool. And that's how we define our neural network, using elements that PyTorch provides us, so it can automatically work out error gradients.

### Let's Describe Our Simple Neural Network

Let's look at some super-simple skeleton code which is a common starting point for many, if not all, PyTorch neural networks.import torch

import torch.nn

class NeuralNetwork(torch.nn.Module):

def

**__init__(self)**:

....

pass

def

**forward(self, inputs)**:

....

return outputs

net = NeuralNetwork()

**Inheritance**

The neural network class is derived from

**torch.nn.Module**which brings with it the machinery of a neural network including the training and querying functions - see here for the documentation.

There is a tiny bit of boilerplate code we have to add to our initialisation function

**__init__()**.. and that's calling the initialisation of the class it was derived from. That should be the __init__() belonging to torch.nn.Module. The clean way to do this is to use

**super()**:

def __init__(self):

# call the base class's initialisation too

**super().__init__()**

pass

We're not finished yet. When we create an object from the NeuralNetwork class, we need to tell it at that time what shape it will be. We're sticking with a simple 3-layer design .. so we need to specify how many nodes there are at the input, hidden and output layers. Just like our pure Python example, we pass this information to the

**__init__()**function. We might as well create these layers during the initialisation. Our

**__init__()**now looks like this:

def

**__init__(self, inodes, hnodes, onodes)**:

# call the base class's initialisation too

super().__init__()

# define the layers and their sizes, turn off bias

**self.linear_ih = nn.Linear(inodes, hnodes, bias=False)**

**self.linear_ho = nn.Linear(hnodes, onodes, bias=False)**

# define activation function

**self.activation = nn.Sigmoid()**

pass

The

**nn.Linear()**module is the thing that creates the relationship between one layer and another and combines the network signals in a linear way .. which is what we did in our pure Python code. Because this is PyTorch, that

**nn.Linear()**creates a parameter that can be adjusted .. the link weights that we're familiar with. You can read more

**nn.Linear()**about it here.

We also create the activation function we want to use, in this case the logistic sigmoid function. Note, we're using the one provided by

**torch.nn**, not making our own.

Note that we're not using these PyTorch elements yet, we're just defining them because we have the information about the number of input, hidden and output nodes.

**Forward**

**forward()**function in our neural network class. Remember, that

**backward()**is provided automatically, but can only work if PyTorch knows how we've designed our neural network - how many layers, what those layers are doing with activation functions, what the error function is, etc.

So let's create a simple

**forward()**function

**which is the description of the network architecture**. Our example will be really simple, just like the one we created with pure Python to learn the MNIST dataset.

def

**forward**(self,

**inputs_list**):

# convert list to Variable

**inputs = Variable(inputs_list)**

# combine input layer signals into hidden layer

**hidden_inputs = self.linear_ih(inputs)**

# apply sigmiod activation function

**hidden_outputs =**

**self.activation**

**(hidden_inputs)**

# combine hidden layer signals into output layer

**final_inputs = self.linear_ho(hidden_outputs)**

# apply sigmiod activation function

**final_outputs =**

**self.activation**

**(final_inputs)**

return final_outputs

You can see the first thing we do is convert the list of numbers, a Python list, into a PyTorch

**Variable**. We must do this, otherwise PyTorch won't be able to calculate the error gradient later.

The next section is very familiar, the combination of signals at each node, in each layer, followed immediately by the activation function. Here we're using the

**nn.Linear()**elements we defined above, and the activation function we defined earlier, using the

**torch.nn.Sigmoid()**provided by PyTorch.

**Error Function**

Now that we've defined the network, we need to define the error function. This is an important bit of information because it defines how we judge the correctness of the neural network, and wrong-ness is used to update the internal parameters during training.

There are any error functions that people use, some better for some kinds of problems than others. We'll use the really simple one we developed for the pure Python network, the squared error function. It looks like the following.

error_function =

**torch.nn.MSELoss**(size_average=False)

We've set the size_average parameter to False to avoid the error function dividing by the size of the target and desired vectors.

### Optimiser

We're almost there. We've just defined the error function, which means we know how far wrong the neural network is during training. We know that PyTorch can calculate the error gradients for each parameter.When we created our simple neural network, we didn't think too much about different ways of improving the parameters based on the error function and error gradients. We simply descended down the gradients a small bit. And that is simple, and powerful.

Actually there are many refined and sophisticated approaches to doing this step. Some are designed to avoid false minimum traps, others designed to converge as quickly as possible, etc. We'll stick to the simple approach we took, and the closest in the PyTorch toolset is the stochastic gradient descent:

optimiser =

**torch.optim.SGD**(net.parameters(), lr=0.1)

We feed this optimiser the adjustable parameters of our neural network, and we also specify the familiar learning rate as

**lr**.

### Finally, Doing the Update

Finally, we can talk about doing the update - that is, updating the neural network parameters in response to the error seen with each training example.Here's how we do that

**for each training example**:

- calculate the
**output**for a training data example - use the
**error function**to calculate the difference (the**loss**, as people call it) **zero gradients**of the optimiser which might be hanging around from a previous iteration- perform
**automatic differentiatio**n to calculate new gradients - use the optimiser to
**update parameters**based on these new gradients

In code this will look like:

for inputs, target in training_set:

output = net(inputs)

# Compute and print loss

loss = error_function(output, target)

print(loss.data[0])

# Zero gradients, perform a backward pass, and update the weights.

optimiser.zero_grad()

loss.backward()

optimiser.step()

It is a common error not to zero the gradients during each iteration, so keep an eye out for that. I'm not really sure why the default is not to clear them ...

### The Final Code

Now that we have all the elements developed and understood, we can rewrite the pure python neural network we developed in the course of Make Your Own Neural Network and throughout this blog.You can find the code as a notebook on GitHub:

The only unusual thing I had to work out was that during the evaluation of performance, we keep a scorecard list, and append a 1 to it if the network's answer matches the known correct answer from the test data set. This comparison needs the actual number to be extracted from the PyTorch tensor via numpy, as follows. We couldn't just say label == correct_label.

if (

**label.data[0][0]**== correct_label):

The results seem to match our pure python code for performance - no major difference, and we expected that because we've tried to architect the network to be the same.

### Performance Comparison On a Laptop

Let's compare performance between our simple pure python (with bumpy) code and the PyTorch version. As a reminder, here are the details of the architecture and data:- MNIST training data with 60,000 examples of 28x28 images
- neural network with 3 layers: 784 nodes in input layer, 200 in hidden layer, 10 in output layer
- learning rate of 0.1
- stochastic gradient descent with mean squared error
- 5 training epochs (that is, repeat training data 5 times)
- no batching of training data

The timing was done with the following python notebook magic command in the cell that contains only the code to train the network. The options ensure only one run of the code, and the -c option ensures unix user time is used to account for other tasks taking CPU time on the same machine.

**%%timeit -n1 -r1 -c**

The results from doing this twice eon a MacBook Pro 13 (early 2015), which has no GPU for accelerating the tensor calculations, are:

**home-made simple pure python - 440 seconds, 458 seconds****simple PyTorch version - 841 seconds, 834 seconds**

**Amazing!**Our own home-made code is about 1.9 times faster ..

**roughy twice as fast!**

### GPU Accelerated Performance

One of the key reasons we chose to invest time learning a framework like PyTorch is that it makes it easy to take advantage of GPU acceleration. So let's try it.TODO