## Machine Learning Methods

- Paper on Spam detection from long ago: http://web.cecs.pdx.edu/~bart/papers/spam.pdf

## Brute Force

Collect training instances in a bag.

Pick matching training instance from bag, and classify target identically

Problems:

- no/multiple matches
- huge bag
- slow classification

## Minimum Distance Voting (k-Nearest Neighbor)

Help with some of the problems of brute force

Pick

*k*"closest" instances from bagMetric for boolean features is usually "Hamming Distance"

`H(v1, v2) = sum[i](v1[i] ^ v2[i])`

Vote the instances

## Naive Bayesian learning

(We've already looked at this)

Binary setting: Count the number of occurrences of each feature in positive and negative setting

Compare underestimates of probabilities using products

Take logs to turn products into sums

Use m-separation to get accurate products

## Decision Trees and ID3

Want to build a "binary decision tree" that splits training set on binary features for a binary class

ID3 (Quinlan) idea:

Greedily pick a feature that splits the training set "as well as possible" into positive and negative subsets.

For each subset, recurse: pick a remaining feature to try to improve the split

Stop when the current subset is (almost) all one class

## Information Gain

Select next feature

*f*in tree to maximize information gainRecall

`U(S) = sum[x in 0, 1] -pr(x in S) log pr(x in S)`

where

`pr(x in S) = |S[c=x]| / |S|`

Now compute information gain

*Δu*for each feature*f*`S+ = S[f=1] S- = S[f=0] Δu = u(S) - (|S+|/|S|) u(S+) - (|S-|/|S|) u(S-)`

Avoid overfitting (gain is probably just training set anomaly)

Stop when gain from best split is below some threshold

Stop when statistical significance of split is low (

*chi^2*test)

Greedy is not optimal: mild independence assumption

## Perceptron

"Artificial Neuron" (Papert

*et al*): basis of neural netsHandles continuous inputs and outputs well: we will binarize anyway because binary

Idea: predict the binary class as a thresholded weighted sum of the features

`c = sum[i] w[i] x[i] + w0 > 0`

(we will use unit weights and features:

`x[i], w[i] in [0..1]`

Training consists of learning appropriate weights

*w*Assign some initial weights (random?)

Feed each training instance through the perceptron

Adjust the weights "toward the true classification"

`w[i] += a (c - y) x[i] w0 += a (c - y)`

where

*y*is the unthresholded output. Remember that*c*and*x*are 0 or 1.*a*is the "learning rate": smalller (a < 0.1) means more reliable convergence, larger can mean faster learning or divergenceRun all the training instances repeatedly until the average accuracy isn't getting better