CS 441/541 F19: Neural Nets

Perceptron

Recall the Perceptron
```
h = sum[i] w[i] x[i] + w0
y = h > 0
```
The Perceptron is trained using an error term
```
w[i] += a (c - y) x[i]
w0 += a (c - y)
```

Note: The inner nonlinearity is critical. Otherwise simple algebra gives

  (w21 w111 + w22 w121) x1
  + (w21 w112 + w22 w122) x2
  + (w21 w10 + w22 w20 + w0) > 0

a1 x1 + a2 x2 + a0 > 0

and we're back where we started.

Obvious plan: just adjust all weights in the system by the error
This isn't a workable plan: it distributes error equally across all weights, including those that should go "the other way"
More traditional plan: distribute error according to effect on the output using ∂y/∂h
Sadly, the derivative of y = h > 0 is zero everywhere

The hyperbolic tangent function is
```
s(h) = (e**(2h) - 1) / (e**(2h) + 1)
```
You can plot it with gnuplot
You can adjust the constant to make it steeper or shallower,
```
s(h) = (e**(ah) - 1) / (e**(ah) + 1)
```
but customarily just pick 1
```
s(h) = (e**h - 1) / (e**h + 1)
```
s(h) has a reasonable derivative everywhere, so we can distribute error more fairly

Learning XOR is impossible

Cannot choose w1, w2 such that

x1 * w1 + x2 * w2 > 0

matches the sign

Our htan squashing function is OK, but slow to compute
Saturation is an issue
Here's the new hotness in squashing functions: "Rectified Linear Unit". We will use a "leaky" ReLU, which avoids the "stuck neuron" problem
```
  s(h) = 0.1 h (h < 0), h (otherwise)
  ds/dh = 0.1 (h < 0), 1 (otherwise)
```
Inspired by biology; asymmetric; many clever variations
Leads to sparsely-activated nets
Avoids the vanishing gradient problem where a neuron gets "stuck"

We have a discriminator that can XOR, but how to pick the right weights?
Use training (or reinforcement) to adjust the weights as with PCT
Weight assignment by error backpropagation: gradient descent on error
Assume
```
E = c - y
```
where E is error, c is target response, y is output response

Change

E[j-1] = sum[k] (w[k][i] E[k])
w[j][i] += a E[j] y[j-i][i] dy/dx
w0[j][i] += a E[j] dy/dx

(or thereabouts; subscripts get fiddly)

Typical architecture: "input layer" → hidden layers → output layer
- Input layer is just the inputs, not actual neurons
- More / wider (and fancier) hidden layers = more generalization ("deep learning")
- Training gets slow quickly
Use multicore CPUs, SIMD graphics cards, Tensor Processing Units to do weight adjustments on a whole layer in parallel

What is backgammon?
Why is BG hard?
Gerry Tesauro 199x: TD(lambda) NN reinforcement learning can learn to play good BG from rules alone!
Add shallow search for tactics, tune: best BG player ever!
Humans learn BG from TD(lambda), close circle
Recently Go, Chess, etc

Can build a neural net that
- Given: training data about time-on-job
- Predicts: expected time-on-job
Ethically, must use only socially sanctioned inputs, e.g. not race!
But
- Net can infer race from e.g. eye color, average salary, traffic stops by location, the MMPI (!)
- Net is nondeclarative: how does it use the inputs?
Net may inadvertantly "discriminate" based on hidden undesired variables
Worse, malicious person may adjust input choice and training data (even weights?) to discriminate
- Hidden Delta theorems: NPC to find intentionally-induced errors
Conclusions?

Last modified: Friday, 15 November 2019, 10:53 PM