## Naïve Bayesian Reasoning

## Bayesian Classification

A form of machine learning

Given a bunch of evidence for and against a thing, decide whether the thing

Formulate as a classic probability problem

`pr(H|E1=v1, E2=v2, ... En=vn)`

We can use Bayes Rule to turn this around

`pr(E1=v1, E2=v2, ... En=vn|H) pr(H) = ----------------------------------- pr(E1=v1, E2=v2, ... En=vn)`

Now we "just" need the probabilities: perhaps we can examine a bunch of classified examples and compute them

## Binary Naïve Bayes

Problem: There's a huge space of possible values here; likely no way we'll get enough examples to accurately estimate probabilities

Let's reduce the state space:

Binarize the hypothesis and evidence. Now the state space size is

*2^(n+1)*for*n*featuresNaively assume that all the features are independent (they aren't). Then we can just replace the conjunction with a product

`pr(E1=v1|H) pr(E2=v2|H) ... pr(En=vn|H) pr(H) = --------------------------------------------- pr(E1=v1) pr(E2=v2) ... pr(En=vn)`

Sadly, the naive assumption means that the probabilities will be too low, since interactions won't be counted: but comparable for H true and false. So

`H iff pr(H|E) > pr(not H|E)`

Note that this comparison drops the denominator, which is nice

Games might be played to make the computation more tractable

## m-Estimation

What we have so far:

`pr(E1=v1|H) ... pr(En=vn|H) pr(H) pr(H|E1=v1, ...) = --------------------------------- pr(E1=v1) ... pr(En=vn)`

Consider the case where

`pr(Ei=vi|H)=0`

. If we try to classify an instance of this form, this will make the whole product 0, regardless of what the other terms tell us(Worse yet, consider

`pr(Ei=vi)=0`

. Oops. We drop the denominator in the comparison, but…)The training instances are

*sampled*from a notionally large space. If we see no count in*n*training instances, that means either that the frequency is less than*1/n*or we got unluckyAssume that there is some fraction

*m*associated with the undersampling — usually use ½Now adjust the probabilities:

`|S[Ei=vi|H]+m| pr(Ei=vi|H) = -------------- |S[H]+m|`

Zeros are gone, calculation

*may*be more accurate

## Numerics

We are asked to compute a product of potentially many terms:

`pr(E1=v1|H) ... pr(En=vn|H) pr(H)`

All terms are ≤ 1

When

*n*is large, this product will be*small*. Maybe too small. Floating point underflow could happenWe are comparing

`pr(E1=v1|H) ... pr(En=vn|H) pr(H) > pr(E1=v1|not H) ... pr(En=vn|not H) pr(not H) ?`

Take

*log*on both sides:`log(pr(E1=v1|H) ... pr(En=vn|H) pr(H)) > log(pr(E1=v1|not H) ... pr(En=vn|not H) pr(not H)) ? log(pr(E1=v1|H)) + ... + log(pr(En=vn|H) + log(pr(H))) > log(pr(E1=v1|not H)) + ... + log(pr(En=vn|not H) + log(pr(not H))) ?`

These sums should not underflow

## Heart Anomalies: The HW2 Dataset

We have a bunch of evidence of heart anomalies obtained from radiography

The evidence consists of a fixed list of

*features*that are present in each radiogramFurther, the radiograms have been accurately classified into one of two categories: normal or abnormal

Imagine we are given a radiogram of unknown category but with all the features measured

Our job: Is this a radiogram of a normal or abnormal heart?

See Homework 2