Our training consists of the set
drawn from some
unknown distribution . Because all pairs are sampled i.i.d.,
we obtain
If we do have enough data, we could
estimate similar to the coin example in the
previous lecture, where we imagine a
gigantic die that has one side for each possible value of
. We can estimate the probability that one specific
side comes up through counting: where
if and and 0 otherwise.
Of course, if we are primarily interested in predicting the label
from the features , we may estimate directly
instead of . We can then use the Bayes Optimal Classifier for a
specific to make predictions.
So how can we estimate ? Previously we have
derived that . Similarly,
and . We can put these two together
The Venn diagram illustrates that the MLE method estimates
as
Problem: But there is a big problem with this method. The MLE
estimate is only good if there are many training vectors with the
same identical features as ! In
high dimensional spaces (or with continuous ), this
never happens! So and .
Naive Bayes
We can approach this dilemma with a simple trick, and an additional
assumption. The trick part is to estimate and instead, since, by Bayes rule, Recall from
Estimating Probabilities from Data
that estimating and is called
generative learning.
Estimating is easy. For example, if takes on discrete
binary values estimating reduces to coin tossing. We simply need
to count how many times we observe each outcome (in this case each class):
Estimating , however, is not easy! The additional
assumption that we make is the Naive Bayes assumption.
Naive Bayes Assumption:
i.e., feature values are independent given the label! This is a very
bold assumption.
For example, a setting where the Naive Bayes classifier is often used is
spam filtering. Here, the data is emails and the label is spam or
not-spam. The Naive Bayes assumption implies that the words in an
email are conditionally independent, given that you know that an email is
spam or not. Clearly this is not true. Neither the words of spam or
not-spam emails are drawn independently at random. However, the resulting
classifiers can work well in practice even if this assumption is violated.
Illustration behind the Naive Bayes algorithm. We estimate
independently in each dimension (middle two images)
and then obtain an estimate of the full data distribution by assuming
conditional independence (very right image).
So, for now, let's pretend the Naive Bayes assumption holds. Then the
Bayes Classifier can be defined as
Estimating is easy as we only need to consider one
dimension. And estimating is not affected by the assumption.
Estimating
Now that we know how we can use our assumption to make the estimation of
tractable. There are 3 notable cases in which we can use
our naive Bayes classifier.
Case #1: Categorical features
Illustration of categorical NB. For dimensional data,
there exist independent dice for each class. Each feature
has one die per class. We assume training samples were generated
by rolling one die after another. The value in dimension
corresponds to the outcome that was rolled with the
die.
Features: Each feature
falls into one of categories. (Note that the case
with binary features is just a specific case of this, where .) An example of such a setting may be medical data where one feature
could be marital status (single / married).
Model : where
is the probability of feature having
the value , given that the label is . And the constraint indicates
that must have one of the categories .
Parameter estimation: where and is a
smoothing parameter. By setting we get an MLE estimator, and
leads to MAP. If we set we get
Laplace smoothing.
In words (without the hallucinated samples) this means ssentially
the categorical feature model associates a special coin with each feature
and label. The generative model that we are assuming is that the data was
generated by first choosing the label (e.g. "healthy person"). That
label comes with a set of "dice", for each dimension one. The
generator picks each die, tosses it and fills in the feature value with
the outcome of the coin toss. So if there are possible labels and
dimensions we are estimating "dice" from the data.
However, per data point only dice are tossed (one for each
dimension). Die (for any label) has possible
"sides". Of course this is not how the data is generated in reality - but
it is a modeling assumption that we make. We then learn these models from
the data and during test time see which model is more likely given the
sample.
Prediction:
Case #2: Multinomial features
Illustration of multinomial NB. There are only as many dice as
classes. Each die has sides. The value of the
feature shows how many times this particular side was rolled.
If feature values don't represent categories (e.g. single/married) but
counts we need to use a different model. E.g. in the text document
categorization, feature value means that in this particular
document the word in my dictionary appears
times. Let us consider the example of spam filtering. Imagine the
word is indicative of being "spam". Then if
means that this email is likely spam (as word
appears 10 times in it). And another email with should
be even more likely to be spam (as the spammy word appears twice as
often). With categorical features this is not guaranteed. It could be that
the training set does not contain any email that contain word
exactly 20 times. In this case you would simply get the hallucinated
smoothing values for both spam and not-spam - and the signal is lost. We
need a model that incorporates our knowledge that features are counts -
this will help us during estimation (you don't have to see a training
email with exactly the same number of word occurrences) and during
inference/testing (as you will obtain these monotonicities that one might
expect). The multinomial distribution does exactly that.
Features: Each feature represents
a count and m is the length of the sequence. An example of this could be the
count of a specific word in a document of length and
is the size of the vocabulary.
Model :
Use the multinomial distribution where is
the probability of selecting and . So, we can use this to generate a spam email, i.e.,
a document of class by picking
words independently at random from the vocabulary of words using
.
Parameter estimation:
where denotes the number of words in
document . The numerator sums up all counts for feature
and the denominator sums up all counts of all features across all data
points. E.g., Again,
is the smoothing parameter.
Prediction:
Case #3: Continuous features (Gaussian Naive Bayes)
Illustration of Gaussian NB. Each class conditional feature
distribution is assumed to originate from an
independent Gaussian distribution with its own mean
and variance .
Features:
Model : Use Gaussian distribution
Note that the model specified above is based on
our assumption about the data - that each feature comes from a
class-conditional Gaussian distribution. The full distribution
, where
is a diagonal covariance matrix with
.
Parameter estimation:
As always, we estimate the parameters of the distributions for each
dimension and class independently. Gaussian distributions only have two
parameters, the mean and variance. The mean is
estimated by the average feature value of dimension from all
samples with label . The (squared) standard deviation is simply the
variance of this estimate.
Naive Bayes is a linear classifier
Naive Bayes leads to a linear decision boundary in many common
cases. Illustrated here is the case where is
Gaussian and where is identical for all
(but can differ across dimensions ). The boundary of the
ellipsoids indicate regions of equal probabilities
. The red decision line indicates the decision
boundary where .
1. Suppose that and features are multinomial
We can show that That is,
As before, we define
and
: If we use the above to do classification, we can
compute for
Simplifying this further leads to
2. In the case of continuous features (Gaussian Naive Bayes), we can show
that This model is also known as logistic regression. NB
and LR produce asymptotically the same model if the Naive Bayes assumption
holds.