Rafael Nadal, tennis player, learns from his previous match how to not make errors when returning the ball. That learning is evaluated on new data, which is a new tennis match.
A student in a statistics class learns from the first test how to adjust his or her study strategies. That knowledge is validated on test 2 to see how much was learned.
Of course, we can go on and on. The point is that the idea of statistical learning, including concepts of machine learning, are meant to exemplify the zeitgeist we find ourselves in, which is one of increased automation, computers, artificial intelligence, and the idea of machines becoming more and more self-sufficient and learning themselves how to make “optimal” decisions (e.g. self-driving cars). However, what is really going on “behind the scenes” is essential mathematics and usually a problem of optimization of scalars to satisfy particular constraints imposed on the problem.
In this book, while it can be said that we do “train” models by fitting them, we do not cross-validate them on new data. Since it is essentially an introduction and primer, we do not take that additional step. However, you should know that such a step is often a good one to take if you have such data at your disposal to make cross-validation do-able. In many cases, scientists may not have such cross-validation data available to them, at least not yet. Hence, “splitting the sample” into a training and test set may not be do-able due to the size of the data. However, that does not necessarily mean testing cannot be done. It can be, on a new data set that is assumed to be drawn from the same population as the original test set. Techniques for cross-validation do exist that minimize having to collect very large validation samples (e.g. see James et al., 2013). Further, to use one of our previous metaphors, validating the pilot’s skill may be delayed until a new storm is available; it does not necessarily have to be done today. Hence, and in general, when you fit a model, you should always have it in mind to validate that model on new data, data that was not used in the training of the model. Why is this last point important? Quite simply because if the pilot is testing his or her skills on the same storm in which he or she was trained, it’s hardly a test at all, because he or she already knows that particular storm and knows the intricacies and details of that storm, so it is not really a test of new skills; it is more akin to a test of how well he or she remembers how to deal with that specific storm and (returning to our statistical discussion) capitalizes on chance factors. This is why if you are to cross-validate a model, it should be done on new “test” data, never the original training data. If you do not cross-validate the model, you can generally expect your model fit on the training data in most cases to be more optimisticthan not, such that it will appear that the model fits “better” than it actually would on new data. This is the primary reason why cross-validation of a model is strongly encouraged. Either way, clear communication of your results is the goal, in that if you fit a model to training data and do not cross-validate it on test data, inform your audience of this so they can know what you have done. If you do cross-validate it, likewise inform them. Hence, in this respect, it is not “essential” that you cross-validate immediately, but what is essential is that you are honest and open about what you have done with your data and clearly communicate to your readers why the current estimates of model fit are likely to be a bit inflated due to not immediately testing out the model on new data. In your next study, if you are able to collect a sample from the same population, evaluate your model on new data to see how well it fits. That will give you a more honest assessment of how good your model really is. For further details on cross-validation, see James et al. (2013), and for a more thorough and deeper theoretical treatment, see Hastie et al. (2009).
1.10 Where We Are Going From Here: How to Use This Book
This introductory chapter has surveyed a few of the more salient concepts related to applied statistics. We have surveyed and reviewed the logic of statistical inference, why inference is necessary even in the age of “big data,” as well as discussed some of the fundamental principles, both mathematical and philosophical, on which applied statistics is based. In the following chapter, we begin our discussion of Python software, the software used to demonstrate many of the methods surveyed in the book. The software, however, any software, is not in any way a panacea to understanding what underlies its use. As we will discuss in the following chapter, what is most essential is to first understand the statistical procedures and concepts that underlie the code used to communicate and run these procedures.
1 Discuss what is meant by an “axiom” in mathematics, and why such axioms are important to building the structure of theoretical and applied statistics.
2 Explore whether you consider probability to be a relative frequency or degree of belief. How would you define probability? Why would you define it in this way? Explain.
3 Summarize the overall purpose of inferential statistics. What is the “big picture” behind why statistical inference is necessary, even in the age of “big data?”
4 Why and how are measurement issues so important in science? How does this importance differ, if at all, between sciences such as physics and biology vs. psychology and economics? What are the issues at play here?
5 Explore whether self-reports actually tell you anything about what you are seeking to measure from an individual. Can you think up a situation where you can have full confidence that it is a valid measure? Can you think up a different situation where a self-report may not be measuring the information you seek to know from a research participant? Explain and explore.
6 Brainstorm and highlight a few of the measurement issues involved in the COVID-19 pandemic. What are some ways in which statistics could be misleading in reporting the status of the pandemic? Discuss as many of these as you can come up with.
7 Give a description and summary of how null hypothesis significance testing works and use a COVID-19 example in your discussion as an example to highlight its concepts and logic.
8 Distinguish between a type I and type II error and explain why virtually all decisions made in science have error rates. Which error rate would you consider the most important to minimize? Why?
9 Distinguish between point estimation and a confidence interval. Theoretically at least, when does a confidence interval become a point estimate?
10 Why is a basic understanding of the philosophy of science mandatory knowledge for the student or researcher in the applied sciences? Why is simply learning statistics and research design not enough?
11 Distinguish between deductive logic and inductive logic. Why is science necessarily inductive?
12 Explore and discuss why causation can be such a “sticky” topic in the sciences. Why is making causal statements so difficult if not impossible? Use an example or two in your exploration of this important issue.
13 Distinguish between a continuous vs. a discrete variable. Why is continuity in a research variable controversial? Explain.
14 How is it the case that numerical differences do not necessarily translate to equivalent differences on the physical variable being measured? Explore this issue with reference to scales of measurement.
15 Explore differences that may exist between data analysis, data science, machine learning, and big data. What might be the differences between these areas of investigation?
Читать дальше