Basis of AI Backprop Hypertext Documentation

Copyright (c) 1990-97 by Donald R. Tveter

The Algorithm Menu Window

The items in this menu window deal with parameters that affect the numerical operations of the training process. A number of different variations on the original back-propagation algorithm have been proposed in order to speed up convergence and some of these are included in this program.

Activation Functions

The two choices for activations functions in this version are the standard smooth sigmoid, 1/(1+exp(-x)) where x is the input to a node and the linear activation function where the value is simply the input. For classification problems using the standard smooth sigmoid in both layers normally works best. On the other hand the range of the output unit values is limited to 0 to 1, if you have a function approximation problem you should make the hidden layer the sigmoid and the output layer linear. If you have large output values, maybe larger than about -10 to 10 you should consider scaling the values by the mean and standard deviation before submitting it to the program. If the output values that take on a very large range you should consider taking the log of the value (or log(1+t) where t is the target value).

Training a network to output real values with the linear activation function can also be unstable, the program may start printing out the string "NaN" (not a number). This problem can be minimized by using either Quickprop or Delta-Bar-Delta.

There is a button that brings up a menu to change the output layer activation function. There is also a button that changes the hidden layer activation function but there is no advantage to using the linear activation function on the hidden layer at least in terms of the type of solutions you can represent, a non-linear hidden layer adds a certain magic that you can't get from a linear hidden layer.

The typed commands to make the output layer linear or sigmoid are:

a aol
a aos
This is derived from the following, the first a is for algorithm, the second a from activation function, o for output and l for linear or s for sigmoid.

The Derivatives

The correct derivative for the standard activation function is s(1-s) where s is the activation value of a unit however when s is near 0 or 1 this term will permit only very small weight changes during the learning process. To counter this problem Scott Fahlman proposed the following one for the output layer:

0.1 + s(1-s)

and it produces faster training almost all the time.

Besides Fahlman's derivative and the original one the differential step size method (a bad name) by Chen and Mars drops the derivative term in the formula for the output layer and uses the correct derivative term for all other layers. The name "differential step size" comes from the fact that the hidden layer learning rate should be much smaller than the output layer learning rate.

A paper by Schiffmann and Joost shows that simply dropping the derivative term in the output layer is appropriate for classification problems since that is what you get if you use the cross entropy error function rather than the squared difference error function.

The typed commands are:

a dc   * use the correct derivative for whatever function
a dd   * use the differential step size derivative (default)
a df   * use Fahlman's derivative in only the output layer
a do   * use the original derivative (same as `c' above)

Update Methods

The choices are the periodic (batch) method, the "right" continuous (online) method, the "wrong" continuous (online) method, delta-bar-delta and quickprop. There are details on this methods in the D, G and Q menu windows, in most cases Quickprop will give the fastest training. The typed commands to set the update methods are:

a uC   * for the "right" continuous update method
a uc   * for the "wrong" continuous update method
a ud   * for the delta-bar-delta method
a up   * for the original periodic update method (default)
a uq   * for the quickprop algorithm

Tolerance

The program will stop training when the output values are close enough to the target values. Close enough is defined by the t command as in:

t 0.1
where every output unit for every pattern must be within 0.1 of its target value. Another looser standard is to simply make the average error smaller than some value but this program does not implement that. In practice in classification problems you only care about the right answer getting the largest output value. In the A menu window there is an entry box where you can type in a new tolerance value.