# Machine Learning Issues

## Practical Issues In ML

Sample size

Evaluation

Overfitting

Linearity

Bad Data

Feature Selection

## Sample Size

Induction with only a few samples is a fool's errand

How much is enough? There's a whole theory for this, which is outside the scope of this course

Worse, some of the samples need to be held out for evaluation. Tradeoff: more training samples = better accuracy (probably) but poorer validation

## Evaluation

Imagine training with

*all*instances and then evaluating performance against all instancesBrute force learner would be

*perfect*Need to measure

*generalization*across as-yet-unknown instancesTypical method: hold out an

*evaluation set*ugh, less data for training

What if we are unlucky in our choice of evaluation set? Maybe training and evaluation set are not comparable anymore?

## Cross-Validation

Idea: Partition the data

*S*into*n*equal subsetsFor each subset

*S[i]*train on*S - S[i]*and evaluate on*S[i]*Do statistics on these

*n*runs to get some kind of min/max/average accuracyLimiting case: "Leave-one-out" Cross-Validation; let

*n = |S|*Cross-Validation is

*n*× as expensive

## Measures Of Accuracy

For our binary case

`p c name 0 0 true negative 1 1 true positive 0 1 false negative 1 0 false positive`

Once we have counted each of these, we can form various sums and ratios depending on what we want to do

Accuracy:

*(tn+tp)/|S|*Precision:

*tp/(tp+fp)*Recall:

*tp/(tp+fn)*

https://towardsdatascience.com/precision-vs-recall-386cf9f89488

## Overfitting

Never enough data

Learner "masters" the training set, building a model that predicts it quite accurately

This mastery includes all the peculiarities of the data set; outliers, over-represented features, etc

This degree of accuracy may

*reduce*generalization, making the predictor*worse*on new instances

## Controlling Overfitting

Decrease amount of data in training set (force model to generalize)? Probably not

Have some principled measure of fit (Naive Bayes, Decision Trees)

Use a validation set. Hold out more of the data and train on the training set until the performance on the validation set starts to get worse

This is what is done for Perceptron

## Linearity

Think of the feature vector as residing in an n-dimensional space

A "linear" learner can find an n-1 dimensional plane in that space that best separates positive and negative training instances

A "nonlinear" learner can find more complicated boundaries

Linear: Naive Bayes, Perceptron

Nonlinear: Decision Trees, k-Nearest Neighbor

## Bad Data

Real-world training instances will have:

- Wrong classification
- Mis-measured features
- Missing features

Algorithms need to be able to cope with this

## Feature Selection

Rare for a real-world inductive ML problem to come with instances that have a vector of Boolean features

Choosing the right features makes a huge difference

Summarize the information useful for classification

*Leave out*features that can confuse the learner or kill performanceConsider a "random feature" that is computed for each instance by flipping a coin

This feature will be

*accidentally correlated*with classification on small datasets, so learner will try to use itIt won't generalize well at all

## Feature Types

Boolean features allow all algorithms, but may lose information

Set-valued features are only OK with some algorithms, require more data to exploit (hypothesis-space size)

Scalar features only work with a few algorithms, but provide a lot of information (sometimes)

Can always Booleanize a feature

Characteristic vector for set values

Scalar above/below mean, median

Scalar by gain splitpoint

Not always a good idea