In addition, the increasing computerization of administrative records, has facilitated more extensive linking of previously disconnected administrative databases, to create more comprehensive and extensive information. Methods to link databases within administrative units based on common identifiers are easy to implement (see Chapter 9for more details). In the United States, which does not have a legal national identifier or ID document, the increased use of the Social Security Number (SSN) has facilitated linkage of government databases and among commercial data providers. In many European countries, individuals have national identifiers, and efforts are underway to allow for cross-border linkages within the European Union, in order to improve statistics on the workforce and the businesses of the common economic area created by what is now called the European Union. However, even when common identifiers are not available, linkage is possible (see Chapter 15).
The result has been that data on individuals, households, and business have become richer, collected from an increasing variety of sources, both as designed surveys and censuses, as well as organically created “administrative” data. The desire to allow policy makers and researchers to leverage the rich linked data has been held back, however, by the concerns of citizens and businesses about privacy. In the 1960s in the United States, researchers had proposed a “National Data Bank” with the goal of combining survey and administrative data for use by researchers. Congress held hearings on the matter, and ultimately the project did not go forward (Kraus 2013). Instead, and partially as a consequence, privacy laws were formalized in the 1970s. The U.S. “Privacy Act” (Public Law 93-579, 5 U.S.C. § 552a), passed in 1974, specifically prohibited “matching” programs, linking data from different agencies. More recently, the 2016 Australian Census elicited substantial controversy when the Australian Bureau of Statistics (ABS) decided to keep identifiable data collected through the census for a substantially longer time period, with the explicit goal of enabling linkages between the census and administrative data, as well as linkages across historical censuses (Australian Bureau of Statistics 2015; Karp 2016).
Subsequent decades saw a decline in public availability of highly detailed microdata on people, households, and firms, and the emergence of new access mechanisms and data protection algorithms. This chapter will provide an overview of the methods that have been developed and implemented to safeguard privacy, while providing researchers the means to draw valid conclusions from protected data. The protection mechanisms we will describe are both physical and statistical (or algorithmic), but exist because of the need to balance the privacy of the respondents, including the confidentiality protection their data receive, with society’s need and desire for ever more detailed, timely, and accurate statistics.
2.2 Paradigms of Protection
There are no methods for disclosure limitation and confidentiality protection specifically designed for linked data. Protecting data constructed by linking administrative records, survey responses, and “found” transaction records relies on the same methods as might be applied to each source individually. It is the richness inherent in the linkages, and in the administrative information available to some potential intruders, that pose novel challenges.
Statistical confidentiality can be viewed as “a body of principles, concepts, and procedures that permit confidentiality to be afforded to data, while still permitting its use of for statistical purposes” (Duncan, Elliot, and Salazar-González 2011, p. 2). In order to protect the confidentiality of the data they collect, NSOs and survey organizations (henceforth referred to generically as data custodians) employ many methods. Very often, data are released to the public as tabular summaries. Many of the protection mechanisms in use today evolved to protect published tables against disclosure. Generically, the idea is to limit the publication of cells with “too few” respondents, where the notion of “too few” is assessed heuristically.
We will not provide a detailed history or taxonomy of statistical disclosure limitation (SDL) and formal privacy models, instead will refer the reader to other publications on the topic (Duncan, Elliot, and Salazar-González 2011; Dwork and Roth 2014; FCSM 2005). We do need to set up the problem, which we will do by reviewing suppression, coarsening, swapping, and noise infusion (input and output). These are widely used techniques and the main issues that arise in applications to linked data can be understood with reference to these methods.
Suppressionis widely used to protect published tables against statistical disclosure. Suppression describes the removal of sub-tables, cells, or items in a cell from a published collection of tables if the item’s publication would pose a high risk of disclosure. This method attempts to forge a middle ground between the users of tabular summaries, who want increasingly detailed disaggregation, and publication rules based on cell count thresholds. The Bureau of Labor Statistics (BLS) uses suppression as its primary SDL technique for data releases based on business establishment censuses and surveys. From the outset, it was understood that primary suppression – not publishing easily identified data items – did not protect anything if the agency published the rest of the data, including summary statistics. Users could infer the missing items from what was published (Fellegi 1972). The BLS, and other agencies that rely on suppression, make “complementary suppressions” to reduce the probability that a user can infer the sensitive items from the published data (Holan et al. 2010). But there is no optimal complementary suppression technology – there are usually multiple complementary suppression strategies that achieve the same protection.
Researchers, however, are not indifferent to these strategies. A researcher who needs detailed geographic variation will benefit from data in which the complementary suppressions are based on removing detailed industries. A researcher who needs detailed industry variation will prefer data with complementary suppression based on geography. Ultimately, the committee that chooses the complementary suppression strategy will determine which research uses are possible and which are ruled out.
But the problem is deeper than this: suppression is a very ineffective SDL technique. Researchers working with the cooperation of the BLS have shown that the suppression strategy used in major BLS business data publications provides almost no protection if it is applied, as is currently the case, to each data release separately (Holan et al. 2010). Some agencies may use cumulative suppression strategies in their sequential data releases. In this case, once an item has been designated for either primary or complementary suppression, it would disappear from the release tables until the entire product is redesigned.
Many social scientists believe that suppression can be complemented by restricted access agreements that allow the researcher to use all of the confidential data but limit what can be published from the analysis. Such a strategy is not a complete solution because SDL must still be applied to the output of the analysis, which quickly brings the problem of which output to suppress back to the forefront.
Custom tabulations and data enclaves.Another traditional response by data custodians to the demand by researchers for more extensive and detailed summaries of confidential data, was to create a custom tabulation, a table not previously published, but generated by data custodian staff with access rights to the confidential data, and typically subject to the same suppression rules. As these requests increased, the tabulation and analysis work was offloaded onto researchers by providing them with access to protected microdata. This approach has expanded rapidly in the last two decades, and is widely used around the world. We discuss it in detail later in this chapter.
Читать дальше