One thing I love to do to practice machine learning is code an algorithm myself. I mean without external packages, just trying to implement all the math and everything else myself.
Naive Bayes is a great algorithm for this, because it’s relatively conceptually easy, and it gets pretty decent results. But before we can talk about Naive Bayes, we have to talk about conditional probability.
To understand most statistics, we need to understand conditional probability. Bayes theorem states that
where P(A|B) is the probability that A is true, given that B is true. Most analysis these days is Bayesian, as it is a way to inject practical subjectivity into a model.
For example, what is the probability someone will have heart problems? Well, if we take into account whether or not the person smokes, drinks, exercises and eats healthy foods, we will get a better estimate. This is just a breif explanation of Bayes Theorem and if this is your first time seeing it, you can learn more here.
Naive Bayes is an algorithm using basic probability in order to classify. Let’s define a few probabilities in the context of an example so it’s more clear. In this example we are classifying students as accepted or rejected to some school (the data is from UCLA!).
The prior – The probability of the class. In this case, if A = accepted, then
where N_A = number of accepted students in the data, and N = number of total students in the data.
Likelihood – This is a conditional probability. If we assume some student (some vector x) is accepted, then we can calculate the probability of him having a gpa as low or as high as he did, given he was accepted.
If we have to calculate this for real values, we use a Normal distribution. If we are doing it for binary values, a Bernoulli distribution, and for categorical—multinoulli. We can use different distributions for different variables, each within our model.
Posterior – The probability of the prior and the likelihood, normalized. In math this looks like the prior x the likelihood, divided by the probability of the vector x. When we’re writing a Naive Bayes algorithm however, we don’t care about the normalizing, or the dividing by the probability of vector x. This is because it is constant and has no impact no our classification. So the denominator in both of these equations—ignore it.
When calculating the likelihood, we have to do it for each feature. We can simply multiply the probabilities, because the model assumes conditional indepence, a strong, likely false assumption. This is why the algorithm is called “naive”.
Which students get accepted to UCLA?
We have two classes – accepted and rejected. We have three variables — gre score, gpa, and rank. We can use each of these to write a classifier.
I’m using R. I chose it over Python for this one because it felt so statistical, but you could just as easily do this in Python.
I’ll give the important statistical ideas that lie in the model right here, but to do to this yourself or to see every line of code, go to my github.
‘train’ is the training data, which I randomly sampled by class.
First we need to calculate the prior probabilities for each class. My favorite way to do this is a subset. If you want to save memory, compress this into two lines.
note – sorry it’s a picture and not code! I really dislike WordPress for code snippets.
Next, we need to calculate the likelihood. This is a bit more complicated, as we have to use a Gaussian distribution for each variable. I made a dictionary to keep my code concise, rather than define 3 different means and standard deviations. I recommend you explore dictionaries in R a bit.
I go on to test this function and make sure it works properly. It does.
And lastly, we need to predict for each value in the test data! We do this simply by taking the maximum posterior probability.
Then, we just test the accuracy. You’ll have to look at my github for that code.
One last thing – It’s been a while since I’ve posted; I’ve been really busy with school. I’ll try to write more in the next few months! 🙂