CS 441/541 F19: Machine Learning Methods

Brute Force

Help with some of the problems of brute force
Pick k "closest" instances from bag
- Metric for boolean features is usually "Hamming Distance"
```
H(v1, v2) = sum[i](v1[i] ^ v2[i])
```
Vote the instances

(We've already looked at this)
Binary setting: Count the number of occurrences of each feature in positive and negative setting
Compare underestimates of probabilities using products
Take logs to turn products into sums
Use m-separation to get accurate products

Want to build a "binary decision tree" that splits training set on binary features for a binary class
ID3 (Quinlan) idea:
- Greedily pick a feature that splits the training set "as well as possible" into positive and negative subsets.
- For each subset, recurse: pick a remaining feature to try to improve the split
- Stop when the current subset is (almost) all one class

Select next feature f in tree to maximize information gain

Recall

U(S) = sum[x in 0, 1] -pr(x in S) log pr(x in S)

where

pr(x in S) = |S[c=x]| / |S|

Now compute information gain Δu for each feature f

S+ = S[f=1]
S- = S[f=0]
Δu = u(S) - (|S+|/|S|) u(S+) - (|S-|/|S|) u(S-)

Avoid overfitting (gain is probably just training set anomaly)
- Stop when gain from best split is below some threshold
- Stop when statistical significance of split is low (chi^2 test)
Greedy is not optimal: mild independence assumption

"Artificial Neuron" (Papert et al): basis of neural nets
Handles continuous inputs and outputs well: we will binarize anyway because binary
Idea: predict the binary class as a thresholded weighted sum of the features
```
c = sum[i] w[i] x[i] + w0 > 0
```
(we will use unit weights and features: x[i], w[i] in [0..1]
Training consists of learning appropriate weights w
- Assign some initial weights (random?)
- Feed each training instance through the perceptron
- Adjust the weights "toward the true classification"
```
w[i] += a (c - y) x[i]
w0 += a (c - y)
```
  where y is the unthresholded output. Remember that c and x are 0 or 1. a is the "learning rate": smalller (a < 0.1) means more reliable convergence, larger can mean faster learning or divergence
- Run all the training instances repeatedly until the average accuracy isn't getting better

Last modified: Friday, 15 November 2019, 10:52 PM