Monday, January 21, 2013

A Simple Introduction to Naive Bayesian Filters

In the field of data analysis and machine learning algorithms, naive Bayesian filters remain popular since they're relatively easy to implement yet surprisingly effective.  Most introductions to Bayesian filtering dive straight into mathematics, but Bayesian filters are simple enough that you can intuitively see how they work by stepping through an example.

Let's say we've got a box of chocolates, and some of them have nuts inside, and we don't like nuts. We'd like to be able to tell which chocolates have nuts in them.  The chocolates are all different shapes, sizes and colors; some are wrapped and some aren't. The only way to tell if a chocolate contains nuts is to take a bite and see. By sampling the box of chocolates and keeping track of the shape, size, color and wrapping of all the ones we tried, we could come up with a pretty good idea of which ones are most likely to have nuts.

You could make a list of characteristics: size, color, shape, and wrapping. Then start eating chocolates. Every time you find one with nuts, you put an 'X' beside each of the matching characteristics (large, round, dark, etc). Every time you got a chocolate without nuts, put an "O" beside it's matching characteristics.  By the time you got through half the box, the list might look something like this:

Small:       OOOOO
Medium:      XXOOO
Large:       XXXOO
---
Round:       XXXOO
Square:      XXOOO
Long:        OOOOO
---
Dark:        XXXXX
Light:       OOOOO
White:       OOOOO
---
Wrapping:    XOOO
No Wrapping: XXXX OOOOOOO

The X's tell you where the nuts were.  You can see that large, round, dark chocolates with no wrapping tend to contain nuts.  For each of the characteristics, we can count up the X's and come up with a score for how often they contained nuts. The scores would look something like this:

Small:       0/5 (0%)
Medium:      2/5 (40%)
Large:       3/5 (60%)
---
Round:       3/5 (60%)
Square:      2/5 (40%)
Long:        0/5 (0%)
---
Dark:        5/5 (100%)
Light:       0/5 (0%)
White:       0/5 (0%)
---
Wrapping:    1/4 (25%)
No Wrapping: 4/11 (~36%)

You can see that none of the small chocolates had nuts in them.  At least, not the ones we tasted. But that doesn't necessarily mean small chocolates never contain nuts. Look at the dark chocolates, for example. All the dark chocolates had nuts. So, how about a small, dark chocolate? What are the chances it has nuts? 0%? 100%? Or somewhere in between? What if its round? What if it had a wrapper?

Looking at our chart for small, round, dark chocolates with a wrapper, we see 0%, 60%, 100%, and 25%. How do we take this set of probabilities and turn it into a final score?  As you might guess, when dealing with probabilities, we just multiply them together.  The "0" value presents a problem, though. No matter how high the other values are, if there's a zero anywhere in the list, when we multiply it out, the whole result just goes straight to zero. So we fudge the numbers a bit. Instead of zero, we use a really small positive number (i.e. 0.01).

You've probably noticed that the more probabilities you multiply together, the smaller the final result becomes. This leads to weird-looking results. Even if a chocolate has all the hallmarks of nuts (large, round, dark and unwrapped) the final score comes out to only 0.13 (0.6 * 0.6 * 1 * 0.36 = 0.13).  That doesn't sound right at all! Intuitively, we're almost certain it contains nuts.  It ought to have a really high probability, right? Are we really saying it's only got a 13% chance of containing nuts? Not really. If we calculate a final score for this same chocolate not containing nuts (by multiplying the inverses) it's only 0.1% (one tenth of one percent)!  We get this value by looking up the probabilities for nuts and subtracting from 1, to get the probability of not having nuts.

Here's our table of probabilities again, this time with an extra column for "No Nuts".

             P(Nuts)     P(No Nuts)
Small:       0/5 (0%)    5/5 (100%)
Medium:      2/5 (40%)   3/5 (60%)
Large:       3/5 (60%)   2/5 (40%)
---
Round:       3/5 (60%)   2/5 (40%)
Square:      2/5 (40%)   3/5 (60%)
Long:        0/5 (0%)    5/5 (100%)
---
Dark:        5/5 (100%)  0/5 (0%)
Light:       0/5 (0%)    5/5 (100%)
White:       0/5 (0%)    5/5 (100%)
---
Wrapping:    1/4 (25%)   3/4 (75%)
No Wrapping: 4/11(36%)   7/11(64%)

The probabilities for "no nuts" give us 0.4 * 0.4 * 0.01 * 0.64 = 0.001, or 0.1%.  Comparing the odds of 13 to 0.1, it suddenly seems a lot more likely that this chocolate contains nuts, than not.

Instead of looking at a final value like 13% or 0.1%, we could ask "how much more likely is it that this particular chocolate contains nuts, than not?" In this case, the chocolate is 130 times more likely to contain nuts, than not.

At the end of the day, what we really want is a way to pick any chocolate and automatically label it "nuts" or "no nuts".  If we get this correct most of the time, we'll be happy.  The procedure is simple: pick a chocolate, look at it's characteristics, and calculate a value for "nuts" vs. "no nuts". Whichever value is higher, that's how we label it. That, basically, is a naive Bayesian binary classifier.

No comments:

Productivity and Note-taking

I told a friend of mine that I wasn't really happy with the amount of time that gets taken up by Slack and "communication and sched...