Organizing the world into objects and categories is second nature to an adult but not to an infant, and even less to Robby the robot. We could endow him with a visual cortex in the form of a multilayer perceptron and show him labeled examples of all the objects and categories in the world-here’s Mommy close up, here’s Mommy far away-but we’d never be done. What we need is an algorithm that will spontaneously group together similar objects, or different images of the same object. This is the problem of clustering, and it’s one of the most intensively studied in machine learning.
A cluster is a set of similar entities, or at a minimum, a set of entities that are more similar to each other than to members of other clusters. It’s human nature to cluster things, and it’s often the first step on the road to knowledge. When we look up at the night sky, we can’t help seeing clusters of stars, and then we fancifully name them after shapes they resemble. Noticing that certain sets of elements had very similar chemical properties was the first step in discovering the periodic table. Each of those sets is now a column in it. Everything we perceive is a cluster, from friends’ faces to speech sounds. Without them, we’d be lost: children can’t learn a language before they learn to identify the characteristic sounds it’s made of, which they do in their first year of life, and all the words they then learn mean nothing without the clusters of real things they refer to. Confronted with big data-a very large number of objects-our first recourse is to group them into a more manageable number of clusters. A whole market is too coarse, and individual customers are too fine, so marketers divide markets into segments, which is their word for clusters. Even objects themselves are at bottom clusters of their observations, from all the different angles light falls on Mommy’s face to all the different sound waves baby hears as the word mommy . And we can’t think without objects, which is perhaps why quantum mechanics is so unintuitive: we want to visualize the subatomic world as particles colliding, or waves interfering, but it’s not really either.
We can represent a cluster by its prototypical element: the image of your mother that you see with your mind’s eye or the quintessential cat, sports car, country house, or tropical beach. Peoria, Illinois, is the average American town, according to marketing lore. Bob Burns, a fifty-three-year-old building maintenance supervisor in Windham, Connecticut, is America’s most ordinary citizen-at least if you believe Kevin O’Keefe’s book The Average American . Anything described by numeric attributes-say, people’s heights, weights, girths, shoe sizes, hair lengths, and so on-makes it easy to compute the average member: his height is the average height of all the cluster members, his weight the average of all the weights, and so on. For categorical attributes, like gender, hair color, zip code, or favorite sport, the “average” is simply the most frequent value. The average member described by this set of attributes may or may not be a real person, but either way it’s a useful reference to have: if you’re brainstorming how to market a new product, picturing Peoria as the town where you’re launching it or Bob Burns as your target customer beats thinking of abstract entities like “the market” or “the consumer.”
As useful as such averages are, we can do even better; indeed the whole point of big data and machine learning is to avoid thinking at such a coarse level. Our clusters can be very specialized sets of people or even different aspects of the same person: Alice buying books for work, for leisure, or as Christmas presents; Alice in a good mood versus Alice with the blues. Amazon would like to distinguish the books Alice buys for herself from the ones she buys for her boyfriend, as this would allow it to make appropriate recommendations at appropriate times. Unfortunately, purchases don’t come labeled with “self-gift” or “for Bob,” and Amazon needs to figure out how to group them.
Suppose the entities in Robby’s world fall into five clusters (people, furniture, toys, food, and animals), but we don’t know which things belong to which clusters. This is the type of problem that Robby faces when we switch him on. One simple option for sorting entities into clusters is to pick five random objects as the cluster prototypes and then compare each entity with each prototype and assign it to the most similar prototype’s cluster. (As in analogical learning, the choice of similarity measure is important. If the attributes are numeric, it can be as simple as Euclidean distance, but there are many other options.) We now need to update the prototypes. After all, a cluster’s prototype is supposed to be the average of its members, and although that was necessarily the case when each cluster had only one member, it generally won’t be after we have added a bunch of new members to each cluster. So for each cluster, we compute the average properties of its members and make that the new prototype. At this point, we need to update the cluster memberships again: since the prototypes have moved, the closest prototype to a given entity may also have changed. Let’s imagine the prototype of one category was a teddy bear and the prototype of another was a banana. Perhaps on our first run we grouped an animal cracker with the bear, but on the second we grouped it with the banana. An animal cracker initially looked like a toy, but now it looks more like food. Once I reclassify animal crackers in the banana group, perhaps the prototypical item for that group also changes, from a banana to a cookie. This virtuous cycle, with entities assigned to better and better clusters, continues until the assignment of entities to clusters doesn’t change (and therefore neither do the cluster prototypes).
This algorithm is called k -means, and its origins go back to the fifties. It’s nice and simple and quite popular, but it has several shortcomings, some of which are easier to solve than others. For one, we need to fix the number of clusters in advance, but in the real world, Robby is always running into new kinds of objects. One option is to let an object start a new cluster if it’s too different from the existing ones. Another is to allow clusters to split and merge as we go along. Either way, we probably want the algorithm to include a preference for fewer clusters, lest we wind up with each object as its own cluster (hard to beat if we want clusters to consist of similar objects, but clearly not the goal).
A bigger issue is that k -means only works if the clusters are easy to tell apart: each cluster is roughly a spherical blob in hyperspace, the blobs are far from each other, and they all have similar volumes and include a similar number of objects. If any of these fails, ugly things can happen: an elongated cluster is split into two different ones, a smaller cluster is absorbed into a larger one nearby, and so on. Luckily, there’s a better option.
Suppose we decide that letting Robby roam around in the real world is too slow and cumbersome a way to learn. Instead, like a would-be pilot learning in a flight simulator, we’ll have him look at computer-generated images. We know what clusters the images come from, but we’re not telling Robby. Instead, we create each image by first choosing a cluster at random (toys, say) and then synthesizing an example of that cluster (small, fluffy, brown teddy bear with big black eyes, round ears, and a bow tie). We also choose the properties of the example at random: the size comes from a normal distribution with a mean of ten inches, the fur is brown with 80 percent probability and white otherwise, and so on. After Robby has seen lots of images generated in this way, he should have learned to cluster them into people, furniture, toys, and so on, because people are more like people than furniture and so on. But the interesting question is: If we look at it from Robby’s point of view, what’s the best algorithm to discover the clusters? The answer is surprising: Naïve Bayes, which we first met as an algorithm for supervised learning. The difference is that now Robby doesn’t know the classes, so he’ll have to guess them!
Читать дальше