CS 441/541 F19: Machine Learning Issues

Practical Issues In ML

Induction with only a few samples is a fool's errand
How much is enough? There's a whole theory for this, which is outside the scope of this course
Worse, some of the samples need to be held out for evaluation. Tradeoff: more training samples = better accuracy (probably) but poorer validation

Imagine training with all instances and then evaluating performance against all instances
Brute force learner would be perfect
Need to measure generalization across as-yet-unknown instances
Typical method: hold out an evaluation set
- ugh, less data for training
- What if we are unlucky in our choice of evaluation set? Maybe training and evaluation set are not comparable anymore?

For our binary case

p c  name
0 0  true negative
1 1  true positive
0 1  false negative
1 0  false positive

Once we have counted each of these, we can form various sums and ratios depending on what we want to do
- Accuracy: (tn+tp)/|S|
- Precision: tp/(tp+fp)
- Recall: tp/(tp+fn)
https://towardsdatascience.com/precision-vs-recall-386cf9f89488

Never enough data
Learner "masters" the training set, building a model that predicts it quite accurately
- This mastery includes all the peculiarities of the data set; outliers, over-represented features, etc
- This degree of accuracy may reduce generalization, making the predictor worse on new instances

Decrease amount of data in training set (force model to generalize)? Probably not
Have some principled measure of fit (Naive Bayes, Decision Trees)
Use a validation set. Hold out more of the data and train on the training set until the performance on the validation set starts to get worse
This is what is done for Perceptron

Think of the feature vector as residing in an n-dimensional space
A "linear" learner can find an n-1 dimensional plane in that space that best separates positive and negative training instances
A "nonlinear" learner can find more complicated boundaries
Linear: Naive Bayes, Perceptron
Nonlinear: Decision Trees, k-Nearest Neighbor

Real-world training instances will have:
- Wrong classification
- Mis-measured features
- Missing features
Algorithms need to be able to cope with this

Rare for a real-world inductive ML problem to come with instances that have a vector of Boolean features
Choosing the right features makes a huge difference
- Summarize the information useful for classification
- Leave out features that can confuse the learner or kill performance
  - Consider a "random feature" that is computed for each instance by flipping a coin
  - This feature will be accidentally correlated with classification on small datasets, so learner will try to use it
  - It won't generalize well at all

Boolean features allow all algorithms, but may lose information
Set-valued features are only OK with some algorithms, require more data to exploit (hypothesis-space size)
Scalar features only work with a few algorithms, but provide a lot of information (sometimes)
Can always Booleanize a feature
- Characteristic vector for set values
- Scalar above/below mean, median
- Scalar by gain splitpoint
Not always a good idea

Last modified: Tuesday, 5 November 2019, 10:44 PM