CS 441/541 F20: Naïve Bayesian Reasoning

Bayesian Classification

A form of machine learning
Given a bunch of evidence for and against a thing, decide whether the thing
Formulate as a classic probability problem
```
 pr(H|E1=v1, E2=v2, ... En=vn)
```

We can use Bayes Rule to turn this around

   pr(E1=v1, E2=v2, ... En=vn|H) pr(H)
 = -----------------------------------
     pr(E1=v1, E2=v2, ... En=vn)

Now we "just" need the probabilities: perhaps we can examine a bunch of classified examples and compute them

Binary Naïve Bayes

Problem: There's a huge space of possible values here; likely no way we'll get enough examples to accurately estimate probabilities
Let's reduce the state space:
- Binarize the hypothesis and evidence. Now the state space size is 2^(n+1) for n features
- Naively assume that all the features are independent (they aren't). Then we can just replace the conjunction with a product
```
   pr(E1=v1|H) pr(E2=v2|H) ... pr(En=vn|H) pr(H)
 = ---------------------------------------------
      pr(E1=v1) pr(E2=v2) ... pr(En=vn)
```
Sadly, the naive assumption means that the probabilities will be too low, since interactions won't be counted: but comparable for H true and false. So
```
H iff pr(H|E) > pr(not H|E)
```
Note that this comparison drops the denominator, which is nice
Games might be played to make the computation more tractable

m-Estimation

What we have so far:

                    pr(E1=v1|H) ... pr(En=vn|H) pr(H)
 pr(H|E1=v1, ...) = ---------------------------------
                      pr(E1=v1) ... pr(En=vn)

Consider the case where pr(Ei=vi|H)=0. If we try to classify an instance of this form, this will make the whole product 0, regardless of what the other terms tell us
(Worse yet, consider pr(Ei=vi)=0. Oops. We drop the denominator in the comparison, but…)
The training instances are sampled from a notionally large space. If we see no count in n training instances, that means either that the frequency is less than 1/n or we got unlucky
Assume that there is some fraction m associated with the undersampling — usually use ½

Now adjust the probabilities:

              |S[Ei=vi|H]+m|
pr(Ei=vi|H) = --------------
                 |S[H]+m|

Zeros are gone, calculation may be more accurate

Numerics

We are asked to compute a product of potentially many terms:
```
pr(E1=v1|H) ... pr(En=vn|H) pr(H)
```
All terms are ≤ 1
When n is large, this product will be small. Maybe too small. Floating point underflow could happen

We are comparing

pr(E1=v1|H) ... pr(En=vn|H) pr(H) >
pr(E1=v1|not H) ... pr(En=vn|not H) pr(not H) ?

Take log on both sides:

log(pr(E1=v1|H) ... pr(En=vn|H) pr(H)) >
log(pr(E1=v1|not H) ... pr(En=vn|not H) pr(not H)) ?

log(pr(E1=v1|H)) +  ... + log(pr(En=vn|H) + log(pr(H))) >
log(pr(E1=v1|not H)) +  ... + log(pr(En=vn|not H) + log(pr(not H))) ?

These sums should not underflow

Last modified: Monday, 2 November 2020, 12:19 AM