View on GitHub

Bearded-android-docs

Motivation

Download this project as a .zip file Download this project as a tar.gz file

Created Thursday 21 November 2013

Motivation for the Backpropagation Algorithm

Derivation of the Backpropagation Algorithm (Sort of)

The Gradient of the total error `E`

Consider a single-layer network with only a single neuron `(bb w,phi)` and a single training example `(bb x, bb t)`. Since the NN has a single neuron, `bb t` is a one-dimensional vector i.e., `bb t=t`, a scalar.

Then the total error is `E=E[1]`.

But wait there's more! Since the NN has only a single neuron, the output is just the value of the neuron's impulse function `f`, so

`\ E = 1/2 ||f(bb x) - t||^2 = 1/2 ||phi(bb w * bb x) - t||^2 = (phi(bb w * bb x) - t)^2`

where E is a function of `bb w`.

The gradient of E is

`\ grad E = ((dE)/(dw_0),... ,(dE)/(dw_k))`.

Let's compute a `(dE)/(dw_j)`.

`\ (dE)/(dw_j) = (d)/(dw_j) [phi(sum_k w_k x_k) - t]^2`

`t` and each `x_k` are constants. Also, each `w_i != w_j` is constant. So the expression is really of the form

`\ (dE)/(dw_j) = (d)/(dw_j) (phi(C + w_j x_j) - D)^2`

where `C` and `D` are constants. Applying the chain rule twice (once on the squared term and then on `phi`)

`\ \ \ (dE)/(dw_j)`
`\ \ = 2(phi(C + w_j x_j) - D) xx (d)/(dw_j)(phi(C + w_j x_j) - D)`
`\ \ = 2(phi(C + w_j x_j) - D) xx phi'(C + w_j x_j) xx x_j`
`\ \ = 2[phi(sum_k w_k x_k) - t] xx phi'(sum_k w_k x_k) xx x_j`

Phew! Let's make a simplification. `phi(sum_k w_k x_k)` is just the output `o` of the neuron, so

`\ \ (dE)/(dw_j) = 2(o - t) phi'(o) x_j`

The Backpropagation Algorithm

The gradient points in the direction of steepest ascent, so if we want to go to the minimum, just travel backwards along the gradient. In other words, add a negative multiple of `(dE)/(dw_j)` to the current weight `w_j` to go towards the minimum of `E`:

`\ \ w_j = w_j - eta (o - t) phi'(o) x_j`

(We drop the "2" since it gets absorbed by `eta`. `eta` is called the learning parameter.)

Let's step back for a minute. We're adjusting the weights (which are the real inputs) so that the error function E is at its global minimum (hopefully). The problem is how can we be sure its a global minimum? how can we be sure the adjustment doesn't overshoot the minimum? how can we be sure we don't oscillate near the minimum?

Enter the momentum.

`\ \ w_j = w_j - eta (o - t) phi'(o) x_j + alpha w_j[prev]`

where `alpha` is the momentum parameter and `w_j[prev]` is the value of `w_j` from the previous iteration of the backpropagation algorithm.

The backpropagation algorithm repeats this process until the error E is smaller than a preset value.

More generally, we can apply the same reasoning to an NN with more than one neuron and more than one layer. In the general case, the formula for a gradient is a function of the gradients in the next layer. Hence, the name backprogapation since you must start in the output layer and work backwards to find the weights.


Backlinks:

MachineLearning:NeuralNetworks:BackPropagation

Attachments:

diagram.dot 736b
diagram.png 3.76kb
comments powered by Disqus