The solution is to have a set of features and learn their weights, as in Markov networks. For every person X, we can have the feature X has the flu ; for every pair of acquaintances X and Y, the feature X and Y both have the flu ; and so on. As in Markov networks, the maximum-likelihood weights are the ones that make each feature occur with the frequency observed in the data. The weight of X has the flu will be high if a lot of people have the flu. The weight of X and Y both have the flu will be high if, when person X has the flu, the odds that acquaintance Y also has the flu are higher than for a randomly chosen member of the network. If 40 percent of people have the flu and so do 16 percent of all acquaintance pairs, then the weight of X and Y both have the flu will be zero, because we don’t need that feature to correctly reproduce the data’s statistics (0.4 × 0.4 = 0.16). But if the feature has a positive weight, flu is more likely to occur in clumps than to just infect people at random, and you’re more likely to have the flu if your acquaintances do.
Notice that the network has a separate feature for each pair of people: Alice and Bob both have the flu, Alice and Chris both have the flu, and so on. But we can’t learn a separate weight for each pair, because we only have one data point per pair (whether it’s infected or not), and we wouldn’t be able to generalize to members of the network we haven’t diagnosed yet (do Yvette and Zach both have the flu?). What we can do instead is learn a single weight for all features of the same form, based on all the instances of it that we’ve seen. In effect, X and Y have the flu is a template for features that can be instantiated with each pair of acquaintances (Alice and Bob, Alice and Chris, etc.). The weights for all the instances of a template are “tied together,” in the sense that they all have the same value, and that’s how we can generalize despite having only one example (the whole network). In nonrelational learning, the parameters of a model are tied in only one way: across all the independent examples (e.g., all the patients we’ve diagnosed). In relational learning, every feature template we create ties the parameters of all its instances.
We’re not limited to pairwise or individual features. Facebook wants to predict who your friends are so it can recommend them to you. It can use the rule Friends of friends are likely to be friends for that, but each instance of it involves three people: if Alice and Bob are friends, and Bob and Chris are also friends, then Alice and Chris are potential friends. H. L. Mencken’s quip that a man is wealthy if he makes more than his wife’s sister’s husband involves four people. Each of these rules can be turned into a feature template in a relational model, and a weight for it can be learned based on how often the feature occurs in the data. As in Markov networks, the features themselves can also be learned from the data.
Relational learners can generalize from one network to another (e.g., learn a model of how flu spreads in Atlanta and apply it in Boston). They can also learn on more than one network (e.g., Atlanta and Boston, assuming, unrealistically, that no one in Atlanta is ever in contact with anyone in Boston). But unlike “regular” learning, where all examples must have exactly the same number of attributes, in relational learning networks can vary in size; a larger network will just have more instances of the same templates than a smaller one. Of course, the generalization from a smaller network to a larger one may or may not be accurate, but the point is that nothing prevents it; and large networks often do behave locally like small ones.
The neatest trick a relational learner can do is to turn a sporadic teacher into an assiduous one. For an ordinary classifier, examples without classes are useless. If I’m given a patient’s symptoms, but not the diagnosis, that doesn’t help me learn to diagnose. But if I know that some of the patient’s friends have the flu, that’s indirect evidence that he may have the flu as well. Diagnosing a few people in a network and then propagating those diagnoses to their friends, and their friends’ friends, is the next best thing to diagnosing everyone. The inferred diagnoses may be noisy, but the overall statistics of how symptoms correlate with the flu will probably be a lot more accurate and complete than if I had only a handful of isolated diagnoses to draw on. Children are very good at making the most of the sporadic supervision they get (provided they don’t choose to ignore it). Relational learners share some of that ability.
All this power comes at a cost, however. In an ordinary classifier, such as a decision tree or a perceptron, inferring an entity’s class from its attributes is a matter of a few lookups and a bit of arithmetic. In a network, each node’s class depends indirectly on all the others’, and we can’t infer it in isolation. We can resort to the same kinds of inference techniques we used for Bayesian networks, like loopy belief propagation or MCMC, but the scale is different. A typical Bayesian network has perhaps thousands of variables, but a typical social network has millions of nodes or more. Luckily, because the model of the network consists of many repetitions of the same features with the same weights, we can often condense the network into “supernodes,” each consisting of many nodes that we know will have the same probabilities, and solve a much smaller problem with the same result.
Relational learning has a long history, going back to at least the seventies and symbolist techniques like inverse deduction. But it acquired a new impetus with the advent of the Internet. Suddenly networks were everywhere, and modeling them was urgent. One phenomenon I found particularly intriguing was word of mouth. How does information propagate in a social network? Can we measure each member’s influence and target just enough of the most influential members to set off a wave of word of mouth? With my student Matt Richardson, I designed an algorithm that did just that. We applied it to Epinions, a product review site that allowed members to say whose reviews they trusted. We found, among other things, that marketing a product to the single most influential member-trusted by many followers who were in turn trusted by many others, and so on-was as good as marketing to a third of all the members in isolation. An avalanche of other research on this problem followed. Since then, I’ve applied relational learning to many others, including predicting who will form links in a social network, integrating databases, and enabling robots to build maps of their surroundings.
If you want to understand how the world works, relational learning is a good tool to have. In Isaac Asimov’s Foundation , the scientist Hari Seldon manages to mathematically predict the future of humanity and thereby save it from decadence. Paul Krugman, among others, has confessed that this seductive dream was what made him become an economist. According to Seldon, people are like molecules in a gas, and the law of large numbers ensures that even if individuals are unpredictable, whole societies aren’t. Relational learning reveals why this is not the case. If people were independent, each making decisions in isolation, societies would indeed be predictable, because all those random decisions would add up to a fairly constant average. But when people interact, larger assemblies can be less predictable than smaller ones, not more. If confidence and fear are contagious, each will dominate for a while, but every now and then an entire society will swing from one to the other. It’s not all bad news, though. If we can measure how strongly people influence each other, we can estimate how long it will be before a swing occurs, even if it’s the first one-another way in which black swans are not necessarily unpredictable.
Читать дальше