Secure multiparty computing may be one solution to this problem (Sanil et al. 2004; Karr et al. 2005, 2006, 2009). However, implementation of such methods, at least in the domain of the social and medical sciences cooperating with NSOs, is in its infancy (Raab, Dibben, and Burton 2015). The typical limitations are the throughput of the secure interconnection between the sources and the requirement of manual model output checking. These limitations drastically slow down any iterative procedure.
The goal of this chapter has been to illustrate how confidentiality protection methods can be and have been applied to linked administrative data. Our examples provide a guide to best-practices for data custodians endeavoring to walk the fine line between making data accessible and protecting individual privacy and confidentiality. Our examples also illustrate different paradigms of protection ranging from the more traditional approach of physical security to more modern formal privacy systems and the provision of synthetic data.
In concluding, we note that from a theoretical perspective, there does not appear to be a clear distinction between the threats to confidentiality in linked data relative to unlinked data, or in survey data relative to administrative data. Richly detailed data pose disclosure risks, irrespective of whether that richness is inherent in the data design, or comes from linkages of variables from multiple sources. Likewise, there are no special methods to protect confidentiality in linked versus unlinked data. Any data with a network, relational, panel or hierarchical structure poses special challenges to data providers to protect confidentiality while preserving analytical validity. Our example of the QWI shows one way this challenge has been successfully managed in a linked data setting, but the same tools could be effective in application to the QCEW, which uses the same frame, but does not involve worker-firm linkages.
However, from a legal perspective, linking two datasets can change the nature of confidentiality protection in a more practical manner. Any output must conform to the strongest privacy protections required across each of the linked datasets. For example, when the LEHD program links SSA data on individuals to IRS data on firms, any downstream research must comply with the confidentiality demands of all three agencies. Likewise, the data must conform to the U.S. Census Bureau publication thresholds for data involving individuals and firms. Hence, linking data can produce a maze of confidentiality requirements that are difficult to articulate, comply with, and monitor. Harmonizing or standardizing such requirements and practices across data providers, both public and private, and across jurisdictions would be helpful. Privacy and confidentiality issues also invite updated and continuing research on the demand for privacy from citizens and businesses, as well as the social benefit that arises from the dissemination of data.
2.A Appendix: Technical Terms and Acronyms
ACS – American Community Survey, a large survey conducted continuously by the U.S. Census Bureau, on topics such as jobs and occupations, educational attainment, veterans, housing characteristics, and several other topics ( https://www.census.gov/programs-surveys/acs/)
BDS – Business Dynamics Statistics, produced by the U.S. Census Bureau, see https://www.census.gov/programs-surveys/bds.htmlfor more details.
CBP – County Business Patterns, produced by the U.S. Census Bureau, see www.census.gov/programs-surveys/cbp.htmlfor more details.
COEP – Canadian Out-of-Employment Panel, a survey initially conducted by McMaster University in Canada, subsequently taken over by the Statistics Canada (Browning, Jones, and Kuhn 1995)
COMPUSTAT – a commercial database maintained by Standard and Poor’s, with information on companies in the United States and around the world ( http://www.compustat.com/).
HRS – Health and Retirement Study, a long-running survey run by the Institute for Social Research at the University of Michigan in the United States on aging in the United States population ( http://hrsonline.isr.umich.edu/)
LEHD – Longitudinal Employer-Household Dynamics Program at the U.S. Census Bureau, which links data provided by 51 state administrations to data from federal agencies and surveys ( https://lehd.ces.census.gov/)
LODES – LEHD Origin-Destination Employment Statistics describe the geographic distribution of jobs according to the place of employment and the place of worker residence, in part through the flagship webapp OnTheMap ( https://onthemap.ces.census.gov/)
QWI – Quarterly Workforce Indicators, a set of local statistics of employment and earnings, produced by the Census Bureau’s LEHD program ( https://lehd.ces.census.gov/data/)
SIPP – Survey of Income and Program Participation is conducted by the U.S. Census Bureau on topics such as economic well-being, health insurance, and food security ( https://www.census.gov/sipp/).
SSB – the SIPP Synthetic Beta File, also known as “SIPP/SSA/IRS Public Use File”
2.A.1 Other Abbreviations
ABS – Australian Bureau of Statistics, the Australian NSO ( http://abs.gov.au/)
AEA – American Economic Association ( https://www.aeaweb.org)
ASA – American Statistical Association ( https://www.amstat.org)
BLS – Bureau of Labor Statistics, the NSO in the United States providing data on “labor market activity, working conditions, and price changes in the economy.” ( https://bls.gov)
CASD – Centre d’accès sécurisé distant aux données, the French remote access system to most administrative data files ( https://casd.eu)
Census Bureau – the largest statistical agency in the United States ( https://census.gov)
CMS – Center for Medicare and Medicaid Services administers US government health programs such as Medicare, Medicaid, and others ( https://cms.gov/)
EIA – Energy Information Agency, collecting and disseminating information on energy generation and consumption in the United States ( https://eia.gov).
FICA – Federal Insurance Contribution Act, the law regulating the system of social security benefits in the United States
IAB – Institute for Employment Research at the German Ministry of Labor ( http://iab.de/en/iab-aktuell.aspx)
FSRDC – Federal Statistical Research Data Centers were originally created as the U.S. Census Bureau Research Data Centers. They provide secure facilities for authorized remote access government restricted-use microdata, and are structured as partnerships between federal statistical agencies and research institutions ( https://www.census.gov/fsrdc)
IRS – Internal Revenue Service handles tax collection for the US government ( https://irs.gov)
NCHS – National Center for Health Statistics, the US NSO charged with collecting and disseminating information on health and well-being ( https://www.cdc.gov/nchs/)
NSO – National statistical offices. Most countries have a single national statistical agency, but some countries (USA, Germany) have multiple statistical agencies
OASDI – Old Age, Survivors and Disability Insurance program, the official name for Social Security in the United States
QCEW – Quarterly Census of Employment and Wages is a program run by the BLS, collecting firm-level reports of employment and wages, and publishing quarterly estimates for about 95% of US jobs ( https://www.bls.gov/cew/)
SER – Summary Earnings Records on SSA data
SSA – Social Security Administration, administers government-provided retirement, disability, and survivors benefits in the United States ( https://ssa.gov)
SSN – Social Security Number, an identification number in the United States, originally used for management of benefits administered by the SSA, but since expanded and serving as a quasi-national identifier number
Читать дальше