## Neural Nets

## Perceptron

Recall the Perceptron

`h = sum[i] w[i] x[i] + w0 y = h > 0`

The Perceptron is trained using an error term

`w[i] += a (c - y) x[i] w0 += a (c - y)`

## "Neural" Nets

Idea: Glue PCTS together into a DAG so that they can learn more complicated functions

`x1 ^ x2 = (x1 | x2) & !(x1 & x2) w21 (w111 x1 + w112 x2 + w10 > 0) + w22 (w121 x1 + w122 x2 + w20 > 0) + w0 > 0`

*w111*and*w112*positive*w121*and*w122*negative*w21*positive*w22*negative*w10*and*w20*and*w0*zero

*Note:*The inner nonlinearity is critical. Otherwise simple algebra gives`(w21 w111 + w22 w121) x1 + (w21 w112 + w22 w122) x2 + (w21 w10 + w22 w20 + w0) > 0 a1 x1 + a2 x2 + a0 > 0`

and we're back where we started.

## Training A Net

Obvious plan: just adjust all weights in the system by the error

This isn't a workable plan: it distributes error equally across all weights, including those that should go "the other way"

More traditional plan: distribute error according to effect on the output using

*∂y/∂h*Sadly, the derivative of

`y = h > 0`

is zero everywhere

## htan Squashing Function

The hyperbolic tangent function is

`s(h) = (e**(2h) - 1) / (e**(2h) + 1)`

You can plot it with

`gnuplot`

You can adjust the constant to make it steeper or shallower,

`s(h) = (e**(ah) - 1) / (e**(ah) + 1)`

but customarily just pick 1

`s(h) = (e**h - 1) / (e**h + 1)`

`s(h)`

has a reasonable derivative everywhere, so we can distribute error more fairly

## Beyond Perceptrons

One PCT is kind of a poor learner

Learning XOR is impossible

`x1 x2 y 0 0 0 0 1 1 1 0 1 1 1 0`

Cannot choose

*w1*,*w2*such that`x1 * w1 + x2 * w2 > 0`

matches the sign

## Leaky ReLU

Our htan squashing function is OK, but slow to compute

Saturation is an issue

Here's the new hotness in squashing functions: "Rectified Linear Unit". We will use a "leaky" ReLU, which avoids the "stuck neuron" problem

`s(h) = 0.1 h (h < 0), h (otherwise) ds/dh = 0.1 (h < 0), 1 (otherwise)`

Inspired by biology; asymmetric; many clever variations

Leads to sparsely-activated nets

Avoids the

*vanishing gradient*problem where a neuron gets "stuck"

## Weight Assignment

We have a discriminator that can XOR, but how to pick the right weights?

Use training (or reinforcement) to adjust the weights as with PCT

Weight assignment by error

*backpropagation*: gradient descent on errorAssume

`E = c - y`

where

*E*is error,*c*is target response,*y*is output responseChange

`E[j-1] = sum[k] (w[k][i] E[k]) w[j][i] += a E[j] y[j-i][i] dy/dx w0[j][i] += a E[j] dy/dx`

(or thereabouts; subscripts get fiddly)

Input layer is handled specially since there are no weights to adjust there

## Big Nets

Typical architecture: "input layer" → hidden layers → output layer

Input layer is just the inputs, not actual neurons

More / wider (and fancier) hidden layers = more generalization ("deep learning")

Training gets slow quickly

Use multicore CPUs, SIMD graphics cards, Tensor Processing Units to do weight adjustments on a whole layer in parallel

## ML Example: Neural Net Backgammon

What is backgammon?

Why is BG hard?

Gerry Tesauro 199x:

*TD(lambda)*NN reinforcement learning can learn to play good BG from rules alone!Add shallow search for tactics, tune: best BG player ever!

Humans learn BG from

*TD(lambda)*, close circleRecently Go, Chess, etc

## CS Ethics: Neural Net Actuarials?

Can build a neural net that

- Given: training data about time-on-job
- Predicts: expected time-on-job

Ethically, must use only socially sanctioned inputs, e.g. not race!

*But*Net can infer race from e.g. eye color, average salary, traffic stops by location, the MMPI (!)

Net is nondeclarative:

*how does it use the inputs*?

Net may inadvertantly "discriminate" based on hidden undesired variables

Worse, malicious person may adjust input choice and training data (even weights?) to discriminate

- Hidden Delta theorems: NPC to find intentionally-induced errors

Conclusions?