Box 2.2 Sidebox: Do-It-Yourself Noise Infusion
The interested user might consult a simple example (with fake data) at https://github.com/labordynamicsinstitute/rampnoise(Vilhuber 2017) that illustrates this mechanism.
2.4 Physical and Legal Protections
The provision of very detailed micro-tabulations or public-use microdata may not be sufficient to inform certain types of research questions. In particular, for business data the thresholds that trigger SDL suppression methods are met far more often than for individuals or households. In those cases, the research community needs controlled access to confidential microdata. Three key reasons why access to microdata may be beneficial are:
1 (i) microdata permit policy makers to pose and analyze complex questions. In economics, for example, analysis of aggregate statistics does not give a sufficiently accurate view of the functioning of the economy to allow analysis of the components of productivity growth;
2 (ii) access to microdata permits analysts to calculate marginal rather than just average effects. For example, microdata enable analysts to do multivariate regressions whereby the marginal impact of specific variables can be isolated;
3 (iii) broadly speaking, widely available access to microdata enables replication of important research(United Nations 2007, p. 4)
As we’ve outlined above, many of the concerns about confidentiality have either removed or prevented creation of public-use microdata versions of linked files, exacerbating the necessity of providing alternate access to the confidential microdata.
NSOs and survey organizations usually provide access to confidential linked data within restricted-access data centers. In the United States, this means either using 1 of 30 secure sites managed by the Census Bureau as part of the Federal Statistical Research Data Center System (FSRDC), 12 or going to the headquarters of the statistical agency. Similarly, in other countries, access is usually restricted to headquarters of NSOs. Secure enclaves managed by NSOs used to be rare. In the 1990s and early 2000s, an expansion of existing networks and the creation of new, alternate methods of accessing data housed in secure enclaves occurred in several countries. Access methods may be through physical travel, remote submission, or remote processing. However, all methods rely on two fundamental elements. First, the researchers accessing the data are mostly free to choose the modeling strategy of their choice, and is not restricted to the tables or queries that the data curator has used for published statistics. Second, the output from such models is then analyzed to avoid unauthorized disclosure, and subsequently released to the researcher for publication.
Several methods are currently used by NSOs and other data collecting agencies to provide access to confidential data. Sections 2.4.1– 2.4.5will describe each of them in turn. 13
2.4.1 Statistical Data Enclaves
Statistical data enclaves, or Research Data Centers, are secure computing facilities that provide researchers with access to confidential microdata, while putting restrictions on the content that can be removed from the facility. The different advisory committees of the two largest professional association (ASA, and the American Economic Association, AEA), pushed for easier and broader access for researchers as far back as the 1960s, though the emphasis then was on the avoiding the cost of making special tabulations. The AEA suggested creating Census data centers at selected universities (Kraus 2013). In the 1990s and early 2000s, similar networks started in other countries. In Canada, the Canadian Foundation for Innovation (CFI) awarded a number of grants to open research data centers, with the first opening at McMaster University (Hamilton, Ontario) in 2000. 14 The creation of the RDCs was specifically motivated by the inability to ensure confidentiality while providing usability of longitudinally linked survey data (Currie and Fortin 2015).
In the United States, a 2004 grant by the National Science Foundation laid the groundwork for subsequent expansion of the (then Census) Research Data Center network from 8 locations, open since the mid-1990s, to over 30 locations in 2017. One of the key motivations was to make the newly available linked administrative data at LEHD accessible to researchers. The network operates under physical security constraints managed by the Census Bureau and the IRS, in locations that are considered part of the Census Bureau itself, and staffed by Census Bureau employees.
Statistical data enclaves can be central locations, in which a single location at the statistical agency is made available to approved researchers. In the United States, NCHS and BLS follow this model, in addition to using the FSRDC network. In Canada, business data can be accessed at Statistics Canada headquarters, while other data may be accessed both there and at the geographically dispersed RDCs, which obtain physical copies of the confidential data.
Some facilities are hybrid facilities. The statistical processing occurs at a central location, but the secure remote access facilities are distributed geographically. The U.S. FSRDCs have worked this way since the early 2000s. A central computing facility is housed in the Census Bureau’s primary data center. Secure remote access is provided to approved researchers at designated sites throughout the county, namely the FSRDCs. Each of the FSRDC sites is a secure Census Bureau facility that is physically located on controlled premises provided by the partner organization, often a university or Federal Reserve Bank. The German IAB locates certified thin clients in dedicated rooms at partner institutions. Secure spaces are costly to build and certify. Recently, institutions in the United Kingdom have attempted to reduce the cost by commoditizing such secure spaces (Raab, Dibben, and Burton 2015). In France, the Centre d’accès sécurisé distant aux données (CASD) has a secure central computing facility, and allows for remote access through custom secure devices from designated but otherwise ordinary university offices, which satisfy certain physical requirements, but are not dedicated facilities. Similar arrangements are used by Scandinavian NSOs, as well as by survey organizations such as the HRS. Remote access to full desktop environments within the secure data enclave, commonly referred to as “virtual desktop infrastructure” (VDI), from regular laptops or workstations, is increasingly common.
The location of remote access points is often limited to the country of the data provider (United States, Canada), or to countries with reciprocal or common enforcement mechanisms (within the European Union, for European NSOs). Cross-border access, even within the European Union, remains exceedingly rare, with only a handful of cross-border secure remote access points open in the European Union. The most prolific user of cross-border secure remote access points, as of this writing, is the German IAB, with multiple data access points in the United States and a recently opened one in the United Kingdom.
Two other alternative remote access mechanisms are often used: manual and automatic remote processing. Manual remote processing occurs when the remote “processor” is a staff member of the data provider. This can be as simple as sending programs in by email, or finding a co-author who is an employee of the data provider. The U.S. NCHS, German IAB, and Statistics Canada provide this type of access. Generally, the costs of manual remote processing are paid by the users.
More sophisticated mechanisms automate some or all of the data flow. For instance, programs may be executed automatically based on email or web submission, but disclosure review is performed manually. This method is used by the IAB’s JoSuA (Institute for Employment Research 2016). Fully automated mechanisms, such as LISSY (Luxembourg), ANDRE (U.S. NCHS), DAS (U.S. NCES), Australia’s Remote Access Data Laboratory (RADL), Canada’s Real Time Remote Access (RTRA), generally restrict the command set from the allowed statistical programming languages (SAS, Stata, and SPSS) and limit what the users can do to certain statistical procedures and languages for which known automated disclosure limitation procedures have been implemented.
Читать дальше