Data Mining Mehmed Kantardzic (good english books to read .txt) 📖
- Author: Mehmed Kantardzic
Book online «Data Mining Mehmed Kantardzic (good english books to read .txt) 📖». Author Mehmed Kantardzic
Formalization of the backpropagation algorithm starts with the assumption that an error signal exists at the output of a neuron j at iteration n (i.e., presentation of the nth training sample). This error is defined by
We define the instantaneous value of the error energy for neuron j as 1/2 ej2(n). The total error energy for the entire network is obtained by summing instantaneous values over all neurons in the output layer. These are the only “visible” neurons for which the error signal can be calculated directly. We may thus write
where the set C includes all neurons in the output layer of the network. Let N denote the total number of samples contained in the training set. The average squared error energy is obtained by summing E(n) over all n and then normalizing it with respect to size N, as shown by
The average error energy Eav is a function of all the free parameters of the network. For a given training set, Eav represents the cost function as a measure of learning performances. The objective of the learning process is to adjust the free parameters of the network to minimize Eav. To do this minimization, the weights are updated on a sample-by-sample basis for one iteration, that is, one complete presentation of the entire training set of a network has been dealt with.
To obtain the minimization of the function Eav, we have to use two additional relations for node-level processing, which have been explained earlier in this chapter:
and
where m is the number of inputs for jth neuron. Also, we use the symbol v as a shorthand notation of the previously defined variable net. The backpropagation algorithm applies a correction Δwji(n) to the synaptic weight wji(n), which is proportional to the partial derivative δE(n)/δwji(n). Using the chain rule for derivation, this partial derivative can be expressed as
The partial derivative δE(n)/δwji(n) represents a sensitive factor, determining the direction of search in weight space. Knowing that the next relations
are valid, we can express the partial derivative ∂E(n)/∂wji(n) in the form
The correction Δwji(n) applied to wji(n) is defined by the delta rule
where η is the learning-rate parameter of the backpropagation algorithm. The use of the minus sign accounts for gradient descent in weight space, that is, a direction for weight change that reduces the value E(n). Asking for φ′(vj[n]) in the learning process is the best explanation for why we prefer continuous functions such as log-sigmoid and hyperbolic as a standard-activation function at the node level. Using the notation δj(n) = ej(n)·, where δj(n) is the local gradient, the final equation for wji(n) corrections is
The local gradient δj(n) points to the required changes in synaptic weights. According to its definition, the local gradient δj(n) for output neuron j is equal to the product of the corresponding error signal ej(n) for that neuron and the derivative φ′(vj[n]) of the associated activation function.
Derivative φ′(vj[n]) can be easily computed for a standard activation function, where differentiation is the only requirement for the function. If the activation function is sigmoid, it means that in the form
the first derivative is
and a final weight correction is
The final correction Δwji(n) is proportional to the learning rate η, the error value at this node is ej(n), and the corresponding input and output values are xi(n) and yj(n). Therefore, the process of computation for a given sample n is relatively simple and straightforward.
If the activation function is a hyperbolic tangent, a similar computation will give the final value for the first derivative φ′(vj[n]):
and
Again, the practical computation of Δwji(n) is very simple because the local-gradient derivatives depend only on the output value of the node yj(n).
In general, we may identify two different cases of computation for Δwji(n), depending on where in the network neuron j is located. In the first case, neuron j is an output node. This case is simple to handle because each output node of the network is supplied with a desired response, making it a straightforward matter to calculate the associated error signal. All previously developed relations are valid for output nodes without any modifications.
In the second case, neuron j is a hidden node. Even though hidden neurons are not directly accessible, they share responsibility for any error made at the output of the network. We may redefine the local gradient δj(n) for a hidden neuron j as the product of the associated derivative φ′(vj[n]) and the weighted sum of the local gradients computed for the neurons in the next layer (hidden or output) that are connected to neuron j
where D denotes the set of all nodes on the next layer that are connected to the node j. Going backward, all δk(n) for the nodes in the next layer are known before computation of the local gradient δj(n) for a given node on a layer closer to the inputs.
Let us analyze once more the application of the backpropagation-learning algorithm with two distinct passes of computation that are distinguished for each training example. In the first pass, which is referred to as the forward pass, the function signals of the network are computed on a neuron-by-neuron basis, starting with the nodes on first hidden layer (the input layer is without computational nodes), then the
Comments (0)