All models are wrong, but some are useful
In reality, a doctor doesn’t diagnose the flu just based on whether you have a fever; she takes a whole bunch of symptoms into account, including whether you have a cough, a sore throat, a runny nose, a headache, chills, and so on. So what we really need to compute is P(flu | fever, cough, sore throat, runny nose, headache, chills, … ) . By Bayes’ theorem, we know that this is proportional to P(fever, cough, sore throat, runny nose, headache, chills, … | flu) . But now we run into a problem. How are we supposed to estimate this probability? If each symptom is a Boolean variable (you either have it or you don’t) and the doctor takes n symptoms into account, a patient could have 2 n possible combinations of symptoms. If we have, say, twenty symptoms and a database of ten thousand patients, we’ve only seen a small fraction of the roughly one million possible combinations. Worse still, to accurately estimate the probability of a particular combination, we need at least tens of observations of it, meaning the database would need to include tens of millions of patients. Add another ten symptoms, and we’d need more patients than there are people on Earth. With a hundred symptoms, even if we were somehow able to magically get the data, there wouldn’t be enough space on all the hard disks in the world to store all the probabilities. And if a patient walks in with a combination of symptoms we haven’t seen before, we won’t know how to diagnose him. We’re face-to-face with our old foe: the combinatorial explosion.
Therefore we do what we always have to do in life: compromise. We make simplifying assumptions that whittle the number of probabilities we have to estimate down to something manageable. A very simple and popular assumption is that all the effects are independent given the cause. This means that, for example, having a fever doesn’t change how likely you are to also have a cough, if we already know you have the flu. Mathematically, this is saying that P(fever, cough | flu) is just P(fever | flu) × P(cough | flu) . Lo and behold: each of these is easy to estimate from a small number of observations. In fact, we did it for fever in the previous section, and it would be no different for cough or any other symptom. The number of observations we need no longer goes up exponentially with the number of symptoms; in fact, it doesn’t go up at all.
Notice that we’re only saying that fever and cough are independent given that you have the flu, not overall. Clearly, if we don’t know whether you have the flu, fever and cough are highly correlated, since you’re much more likely to have a cough if you already have a fever. P(fever, cough) is not equal to P(fever) × P(cough) . All we’re saying is that, if we know you have the flu, knowing whether you have a fever gives us no additional information about whether you have a cough. Likewise, if you don’t know the sun is about to rise and you see the stars fade, your expectation that the sky will lighten increases; but if you already know that sunrise is imminent, seeing the stars fade makes no difference.
Notice also that it’s only thanks to Bayes’ theorem that we were able to pull off this trick. If we wanted to directly estimate P(flu | fever, cough, etc.) , without first turning it into P(fever, cough, etc. | flu) using the theorem, we’d still need an exponential number of probabilities, one for each combination of symptoms and flu/not flu.
A learner that uses Bayes’ theorem and assumes the effects are independent given the cause is called a Naïve Bayes classifier. That’s because, well, that’s such a naïve assumption. In reality, having a fever makes having a cough more likely, even if you already know you have the flu, because (for example) it makes you more likely to have a bad flu. But machine learning is the art of making false assumptions and getting away with it. As the statistician George Box famously put it: “All models are wrong, but some are useful.” An oversimplified model that you have enough data to estimate is better than a perfect one that you don’t. It’s astonishing how simultaneously very wrong and very useful some models can be. The economist Milton Friedman even argued in a highly influential essay that the best theories are the most oversimplified, provided their predictions are accurate, because they explain the most with the least. That seems to me like a bridge too far, but it illustrates that, counter to Einstein’s dictum, science often progresses by making things as simple as possible, and then some.
No one is sure who invented the Naïve Bayes algorithm. It was mentioned without attribution in a 1973 pattern recognition textbook, but it only took off in the 1990s, when researchers noticed that, surprisingly, it was often more accurate than much more sophisticated learners. I was a graduate student at the time, and when I belatedly decided to include Naïve Bayes in my experiments, I was shocked to find it did better than all the other algorithms I was comparing, save one-luckily, the algorithm I was developing for my thesis, or I might not be here now.
Naïve Bayes is now very widely used. For example, it forms the basis of many spam filters. It all began when David Heckerman, a prominent Bayesian researcher who is also a medical doctor, had the idea of treating spam as a disease whose symptoms are the words in the e-mail: Viagra is a symptom, and so is free , but your best friend’s first name probably signals a legit e-mail. We can then use Naïve Bayes to classify e-mails into spam and nonspam, provided spammers generate e-mails by picking words at random. That’s a ridiculous assumption, of course: it would only be true if sentences had no syntax and no content. But that summer Mehran Sahami, then a Stanford graduate student, tried it out during an internship at Microsoft Research, and it worked great. When Bill Gates asked Heckerman how this could be, he pointed out that to identify spam you don’t need to understand the details of the message; it’s enough to get the gist of it by seeing which words it contains.
A basic search engine also uses an algorithm quite similar to Naïve Bayes to decide which web pages to return in answer to your query. The main difference is that, instead of spam/not-spam, it’s trying to predict relevant/not-relevant. The list of prediction problems Naïve Bayes has been applied to is practically endless. Peter Norvig, director of research at Google, told me at one point that it was the most widely used learner there, and Google uses machine learning in every nook and cranny of what it does. It’s not hard to see why Naïve Bayes would be popular among Googlers. Surprising accuracy aside, it scales great; learning a Naïve Bayes classifier is just a matter of counting how many times each attribute co-occurs with each class and takes barely longer than reading the data from disk.
You could even use Naïve Bayes, tongue-in-cheek, on a much larger scale than Google’s: to model the whole universe. Indeed, if you believe in an omnipotent God, then you can model the universe as a vast Naïve Bayes distribution where everything that happens is independent given God’s will. The catch, of course, is that we can’t read God’s mind, but in Chapter 8 we’ll investigate how to learn Naïve Bayes models even when we don’t know the classes of the examples.
It might not seem so at first, but Naïve Bayes is closely related to the perceptron algorithm. The perceptron adds weights and Naïve Bayes multiplies probabilities, but if you take a logarithm, the latter reduces to the former. Both can be seen as generalizations of simple If … then … rules, where each antecedent can count more or less toward the conclusion instead of being “all or none.” This is just one example of the deeper connections among learners that hint at a Master Algorithm. You may not consciously know Bayes’ theorem (well, now you do), but in a way every one of the ten billion neurons in your brain is a tiny instance of it.
Читать дальше