2.1.1 What Is Statistics?
The term statistics is commonly used in two ways. On the one hand, we use the term statistics in day‐to‐day communication when we refer to the collection of numbers or facts. What follows are some examples of statistics:
1 In 2000, the salaries of CEOs from 10 selected companies ranged from $2 million to $5 million.
2 On average, the starting salary of engineers is 40% higher than that of technicians.
3 In 2007, over 45 million people in the United States did not have health insurance.
4 In 2008, the average tuition of private colleges soared to over $40,000.
5 In the United States, seniors spend a significant portion of their income on health care.
6 The R&D budget of the pharmaceutical division of a company is higher than the R&D budget of its biomedical division.
7 In December 2009, a total of 43 states reported rising jobless rates.
On the other hand, statistics is a scientific subject that provides the techniques of collecting, organizing, summarizing, analyzing, and interpreting the results as input to make appropriate decisions. In a broad sense, the subject of statistics can be divided into two parts: descriptive statistics and inferential statistics .
Descriptive statistics uses techniques to organize, summarize, analyze, and interpret the information contained in a data set to draw conclusions that do not go beyond the boundaries of the data set. Inferential statistics uses techniques that allow us to draw conclusions about a large body of data based on the information obtained by analyzing a small portion of these data. In this book, we study both descriptive statistics and inferential statistics. This chapter discusses the topics of descriptive statistics. Chapters 3through Chapter 7are devoted to building the necessary tools needed to study inferential statistics, and the rest of the chapters are mostly dedicated to inferential statistics.
2.1.2 Population and Sample in a Statistical Study
In a very broad sense, statistics may be defined as the science of collecting and analyzing data. The tradition of collecting data is centuries old. In European countries, numerous government agencies started keeping records on births, deaths, and marriages about four centuries ago. However, scientific methods of analyzing such data are not old. Most of the advanced techniques of analyzing data have in fact been developed only in the twentieth century, and routine use of these techniques became possible only after the invention of modern computers.
During the last four decades, the use of advanced statistical techniques has increased exponentially. The collection and analysis of various kinds of data has become essential in the fields of agriculture, pharmaceuticals, business, medicine, engineering, manufacturing, product distribution, and by government or nongovernment agencies. In a typical field, there is often need to collect quantitative information on all elements of interest, which is usually referred to as the population . The problem, however, with collecting all conceivable values of interest on all elements is that populations are usually so large that examining each element is not feasible. For instance, suppose that we are interested in determining the breaking strength of the filament in a type of electric bulb manufactured by a particular company. Clearly, in this case, examining each and every bulb means that we have to wait until each bulb dies. Thus, it is unreasonable to collect data on all the elements of interest. In other cases, as doing so may be either quite expensive, time‐consuming, or both, we cannot examine all the elements. Thus, we always end up examining only a small portion of a population that is usually referred to as a sample . More formally, we may define population and sample as follows:
A population is a collection of all elements that possess a characteristic of interest.
Populations can be finite or infinite. A population where all the elements are easily countable may be considered as finite , and a population where all the elements are not easily countable as infinite . For example, a production batch of ball bearings may be considered a finite population, whereas all the ball bearings that may be produced from a certain manufacturing line are considered conceptually as being infinite.
A portion of a population selected for study is called a sample .
The target population is the population about which we want to make inferences based on the information contained in a sample.
The population from which a sample is being selected is called a sampled population .
The population from which a sample is being selected is called a sampled population , and the population being studied is called the target population . Usually, these two populations coincide, since every effort should be made to ensure that the sampled population is the same as the target population. However, whether for financial reasons, a time constraint, a part of the population not being easily accessible, the unexpected loss of a part of the population, and so forth, we may have situations where the sampled population is not equivalent to the whole target population. In such cases, conclusions made about the sampled population are not usually applicable to the target population.
In almost all statistical studies, the conclusions about a population are based on the information drawn from a sample. In order to obtain useful information about a population by studying a sample, it is important that the sample be a representative sample; that is, the sample should possess the characteristics of the population under investigation. For example, if we are interested in studying the family incomes in the United States, then our sample must consist of representative families that are very poor, poor, middle class, rich, and very rich. One way to achieve this goal is by taking a random sample.
A sample is called a simple random sample if each element of the population has the same chance of being included in the sample.
There are several techniques of selecting a random sample, but the concept that each element of the population has the same chance of being included in a sample forms the basis of all random sampling, namely simple random sampling, systematic random sampling, stratified random sampling, and cluster random sampling . These four different types of sampling schemes are usually referred to as sample designs .
Since collecting each data point costs time and money, it is important that in taking a sample, some balance be kept between the sample size and resources available. Too small a sample may not provide much useful information, but too large a sample may result in a waste of resources. Thus, it is very important that in any sampling procedure, an appropriate sampling design is selected. In this section, we will review, very briefly, the four sample designs mentioned previously.
Before taking any sample, we need to divide the target population into nonoverlapping units, usually known as sampling units . It is important to recognize that the sampling units in a given population may not always be the same. Sampling units are in fact determined by the sample design chosen. For example, in sampling voters in a metropolitan area, the sampling units might be individual voters, all voters in a family, all voters living in a town block, or all voters in a town. Similarly, in sampling parts from a manufacturing plant, the sampling units might be an individual part or a box containing several parts.
Читать дальше