Judea Pearl’s pioneering work on Bayesian networks appears in his book Probabilistic Reasoning in Intelligent Systems * (Morgan Kaufmann, 1988). “Bayesian networks without tears,”* by Eugene Charniak ( AI Magazine , 1991), is a largely nonmathematical introduction to them. “Probabilistic interpretation for MYCIN’s certainty factors,”* by David Heckerman ( Proceedings of the Second Conference on Uncertainty in Artificial Intelligence , 1986), explains when sets of rules with confidence estimates are and aren’t a reasonable approximation to Bayesian networks. “Module networks: Identifying regulatory modules and their condition-specific regulators from gene expression data,” by Eran Segal et al. ( Nature Genetics , 2003), is an example of using Bayesian networks to model gene regulation. “Microsoft virus fighter: Spam may be more difficult to stop than HIV,” by Ben Paynter ( Fast Company , 2012), tells how David Heckerman took inspiration from spam filters and used Bayesian networks to design a potential AIDS vaccine. The probabilistic or “noisy” OR is explained in Pearl’s book.* “Probabilistic diagnosis using a reformulation of the INTERNIST-1/QMR knowledge base,” by M. A. Shwe et al. (Parts I and II, Methods of Information in Medicine , 1991), describes a noisy-OR Bayesian network for medical diagnosis. Google’s Bayesian network for ad placement is described in Section 26.5.4 of Kevin Murphy’s Machine Learning* (MIT Press, 2012). Microsoft’s player rating system is described in “TrueSkill TM: A Bayesian skill rating system,”* by Ralf Herbrich, Tom Minka, and Thore Graepel ( Advances in Neural Information Processing Systems 19 , 2007).
Modeling and Reasoning with Bayesian Networks ,* by Adnan Darwiche (Cambridge University Press, 2009), explains the main algorithms for inference in Bayesian networks. The January/February 2000 issue* of Computing in Science and Engineering , edited by Jack Dongarra and Francis Sullivan, has articles on the top ten algorithms of the twentieth century, including MCMC. “Stanley: The robot that won the DARPA Grand Challenge,” by Sebastian Thrun et al. ( Journal of Field Robotics , 2006), explains how the eponymous self-driving car works. “Bayesian networks for data mining,”* by David Heckerman ( Data Mining and Knowledge Discovery , 1997), summarizes the Bayesian approach to learning and explains how to learn Bayesian networks from data. “Gaussian processes: A replacement for supervised neural networks?,”* by David MacKay (NIPS tutorial notes, 1997; online at www.inference.eng.cam.ac.uk/mackay/gp.pdf), gives a flavor of how the Bayesians co-opted NIPS.
The need for weighting the word probabilities in speech recognition is discussed in Section 9.6 of Speech and Language Processing ,* by Dan Jurafsky and James Martin (2nd ed., Prentice Hall, 2009). My paper on Naïve Bayes, with Mike Pazzani, is “On the optimality of the simple Bayesian classifier under zero-one loss”* ( Machine Learning , 1997; expanded journal version of the 1996 conference paper). Judea Pearl’s book,* mentioned above, discusses Markov networks along with Bayesian networks. Markov networks in computer vision are the subject of Markov Random Fields for Vision and Image Processing ,* edited by Andrew Blake, Pushmeet Kohli, and Carsten Rother (MIT Press, 2011). Markov networks that maximize conditional likelihood were introduced in “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,”* by John Lafferty, Andrew McCallum, and Fernando Pereira ( International Conference on Machine Learning , 2001).
The history of attempts to combine probability and logic is surveyed in a 2003 special issue* of the Journal of Applied Logic devoted to the subject, edited by Jon Williamson and Dov Gabbay. “From knowledge bases to decision models,”* by Michael Wellman, John Breese, and Robert Goldman ( Knowledge Engineering Review , 1992), discusses some of the early AI approaches to the problem.
Frank Abagnale details his exploits in his autobiography, Catch Me If You Can , cowritten with Stan Redding (Grosset & Dunlap, 1980). The original technical report on the nearest-neighbor algorithm by Evelyn Fix and Joe Hodges is “Discriminatory analysis: Nonparametric discrimination: Consistency properties”* (USAF School of Aviation Medicine, 1951). Nearest Neighbor (NN) Norms ,* edited by Belur Dasarathy (IEEE Computer Society Press, 1991), collects many of the key papers in this area. Locally linear regression is surveyed in “Locally weighted learning,”* by Chris Atkeson, Andrew Moore, and Stefan Schaal ( Artificial Intelligence Review , 1997). The first collaborative filtering system based on nearest neighbors is described in “GroupLens: An open architecture for collaborative filtering of netnews,”* by Paul Resnick et al. ( Proceedings of the 1994 ACM Conference on Computer-Supported Cooperative Work , 1994). Amazon’s collaborative filtering algorithm is described in “Amazon.com recommendations: Item-to-item collaborative filtering,”* by Greg Linden, Brent Smith, and Jeremy York ( IEEE Internet Computing , 2003). (See Chapter 8’s further readings for Netflix’s.) Recommender systems’ contribution to Amazon and Netflix sales is referenced in, among others, Mayer-Schönberger and Cukier’s Big Data and Siegel’s Predictive Analytics (cited earlier). The 1967 paper by Tom Cover and Peter Hart on nearest-neighbor’s error rate is “Nearest neighbor pattern classification”* ( IEEE Transactions on Information Theory ).
The curse of dimensionality is discussed in Section 2.5 of The Elements of Statistical Learning ,* by Trevor Hastie, Rob Tibshirani, and Jerry Friedman (2nd ed., Springer, 2009). “Wrappers for feature subset selection,”* by Ron Kohavi and George John ( Artificial Intelligence , 1997), compares attribute selection methods. “Similarity metric learning for a variable-kernel classifier,”* by David Lowe ( Neural Computation , 1995), is an example of a feature weighting algorithm.
“Support vector machines and kernel methods: The new generation of learning machines,”* by Nello Cristianini and Bernhard Schölkopf ( AI Magazine , 2002), is a mostly nonmathematical introduction to SVMs. The paper that started the SVM revolution was “A training algorithm for optimal margin classifiers,”* by Bernhard Boser, Isabel Guyon, and Vladimir Vapnik ( Proceedings of the Fifth Annual Workshop on Computational Learning Theory , 1992). The first paper applying SVMs to text classification was “Text categorization with support vector machines,”* by Thorsten Joachims ( Proceedings of the Tenth European Conference on Machine Learning , 1998). Chapter 5 of An Introduction to Support Vector Machines ,* by Nello Cristianini and John Shawe-Taylor (Cambridge University Press, 2000), is a brief introduction to constrained optimization in the context of SVMs.
Case-Based Reasoning ,* by Janet Kolodner (Morgan Kaufmann, 1993), is a textbook on the subject. “Using case-based retrieval for customer technical support,”* by Evangelos Simoudis ( IEEE Expert , 1992), explains its application to help desks. IPsoft’s Eliza is described in “Rise of the software machines” ( Economist , 2013) and on the company’s website. Kevin Ashley explores case-based legal reasoning in Modeling Legal Arguments * (MIT Press, 1991). David Cope summarizes his approach to automated music composition in “Recombinant music: Using the computer to explore musical style” ( IEEE Computer , 1991). Dedre Gentner proposed structure mapping in “Structure mapping: A theoretical framework for analogy”* ( Cognitive Science , 1983). “The man who would teach machines to think,” by James Somers ( Atlantic , 2013), discusses Douglas Hofstadter’s views on AI.
Читать дальше