This discussion is not meant to start a “turf war” over the priority of human intellectual invention. Far from it. If we were to do that, then we would have to also acknowledge that though Newton and Leibniz put the final touches on the calculus, the idea that they “invented” it, in the truest sense of the word, is a bit of a far cry. Priority disputesin the history of human discovery usually prove futile and virtually impossible to resolve, even among those historians who study the most ancient of roots of intellectual invention on a full-time basis. That is, even assigning priority to ancient discoveries of intellectual concepts is exceedingly difficult (especially without lawyers!), which further provides evidence that “modern” concepts are often not modern at all. As another example, the concept of a computermay not have been a modern invention. Historians have shown that its primitive origin may possibly go back to Charles Babbage and the “Analytical Engine,” and its concept probably goes far beyond that in years as well (Green, 2005). As the saying goes, the only things we do not know is the history we are unaware ofor, as Mark Twain once remarked, few if any ideas are original, and can usually be traced back to earlier ones.
1.9 “Training” and “Testing” Models: What “Statistical Learning” Means in the Age of Machine Learning and Data Science
One aspect of the “data revolution” with data science and machine learning leading the way has been the emphasis on the concept of statistical learning. As mentioned, simply because we assign a new word or phrase to something does not necessarily mean that it represents the equivalent of something entirely new. The phrase “statistical learning” is testimony to this. In its simplest and most direct form, statistical learning simply means fitting a modelor algorithmto data. The model then “learns” from the data. But what does it learn exactly? It essentially learns what its estimators (in terms of selecting optimal values for them) should be in order to maximize or minimize a function. For example, in the case of a simple linear regression, if we take a model of the type yi = α + βxi + εi and fit it to some data, the model “learns” from the data what are the best values for a and b, which are estimators for α and β, respectively. Given that we are using ordinary least-squares as our method of estimation, the regression model “learns” what a and b should be such that the sum of squared errors is kept to a minimum value (if you don’t know what all this means, no worries, you’ll learn it in Chapter 7). The point here for this discussion is that the “learning” or “training” consists simply of selecting scalars for a and b such that the function of minimizing the sum of squared errors is satisfied. Once that occurs, the model is said to have learned or been “trained” from the data. This is, at its most essential and rudimentary level, what statistical learning actually means in many (not all) contexts. If we subject that model to new data after that, thus “sharpening” its scalars, the model “updates” what its estimators should be in order to continue optimizing a function. Note that this more or less parallels the idea of human learning, in that the model (or “you”) is “learning from experience” as a new experience is incorporated into knowledge. For example, a worker learns how to maximize his or her potential in a job through trial and error, otherwise known as “experience.” If one day his or her boss corrects him or her, that new “data” is incorporated into the learning mechanism. If on another day the individual is reinforced for doing something right, that is also incorporated into the learning mechanism. Of course, we cannot see the scalars or estimators (they are largely metaphorical in this case), but you get the idea. Learning “optimizes” some function though exposure to new experience. In classical learning theory in psychology, for instance, the rat in a Skinner box learns that if he presses the lever, he will receive a pellet of food. If he doesn’t press the lever, he doesn’t receive food. The rat is optimizing the function (its in his little brain, and its metaphorical, we can’t see it) that will allow him to distinguish which response gets the food. This is learning! When the rat is “trained” enough, he starts making predictions nearly perfectly with very few errors. So it also is with the statistical model; it does an increasingly good job at “getting it right” as it is trained on increasingly more data (i.e. more “experience”). It also “learns” from what it did wrong, just as the rat learns that if he doesn’t press the lever, he doesn’t eat.
Is any of this “new?” Of course not! In a very real way, pioneers of regression in the 1890s, with the likes of Karl Pearson and George Udny Yule (see Denis and Docherty, 2007), were computing these same regression coefficients on their own data, though not with the use of computers. However, back then it was not referred to as a model learning from data; it was simply seen as a novel statistical method that could help address a social problem of the day. Even earlier than that, Legendre and Gauss in the early nineteenth century (1800s) were developing the method of least-squares that would eventually be used in applying regression methods later that century. Again, they were not called statistical learning methods back then. The idea of calling them learning methods seems to have arisen mostly in statistics, but is now center stage in data science and machine learning. However, a lot of this is due to the zeitgeist of the times, where “zeitgeist” means the “spirit of the times” we are in, which is one of computers, artificial intelligence, and the idea that if we supply a model with enough data, it can eventually “fly itself” so to speak. Hence, the idea here is of “training” as well. This idea is very popular in digit recognition, in that the model is supplied with enough data that it “learns” to discriminate between whether a number is a “2” for instance, or a “4” by learning its edges and most of the rest of what makes these numbers distinct from one another. Of course, the training of every model is not always done via ordinary least-squares regression. Other models are used, and the process can get quite complex and will not always follow this simple regression idea. Sometimes an algorithm is designed to search for patterns in data, which in this case the statistical method is considered to be unsupervisedbecause it has no a priori group structure to guide it as in so-called supervisedlearning. Principal components, exploratory factor analysis, and cluster analysisare examples of this. However, even in these cases, optimization criteria have been applied. For example, in principal components analysis, we are still maximizing values for scalars, but instead of minimizing the sum of squared (vertical) errors, we are instead maximizing the variance in the original variables subjected to the procedure (this will all become clear how this is done when we survey PCA later in the book).
Now, in the spirit of statistical learning and “training,” validatinga model has become equally emphasized, in the sense that after a model is trained on one set of data, it should be applied to a similar set of data to estimate the error rate on that new set. But what does this mean? How can we understand this idea? Easily! Here are some easy examples of where this occurs:
The pilot learns in the simulator or test flights and then his or her knowledge is “validated” on a new flight. The pilot was “trained” in landing in a thunderstorm yesterday and now that knowledge (model) will be evaluated in a new flight on a new storm.
Читать дальше