Overview
Government agencies collect and manage voluminous datasets, which have great potential to support both research and economic development. The rise of data science and AI-based technologies has created new, and yet to be recognized, opportunities for use of this data. Unfortunately, much of this data is or may be sensitive, which limits public access. Data sensitivity arises in various ways. Several of the obvious include:
- Individual privacy. Data may reflect confidential information about people. Even if the data does not pose significant privacy risks, when combined with other data sources it may result in substantial risks of misuse.
- Corporate confidentiality. Data may contain important business information that is not public, and release could have substantial economic impact on the entities the data is about.
- Political sensitivity. Datasets held by agencies may contain information whose disclosure could be misinterpreted. For example, data on individuals used by the Department of Justice for statistical analysis could be misconstrued as surveillance of viewed out of context.
There are also less obvious, but still significant, impediments to sharing data widely. These
include 1) regulatory and policy-based restrictions, 2) technological impediments, such as how the
data is formatted and stored, and 3) budget-imposed constraints, such as lack of public-facing
metadata and documentation on the source, quality, and semantics of data.
Existing fora focus on narrow aspects of these problems, such as technological or policy approaches to data privacy, or users groups of specific types of data. The goal of this workshop is
to focus more broadly on impediments to broader use of government data. To accomplish this
goal, we aim to bring together data stewards from across the federal government with researchers
in policy and technology to identify promising directions and open challenges in addressing these
impediments.
The outcome of the workshop will be a report, authored primarily by the organizers but open
to contribution by workshop participants, detailing research challenges identified and outlining
proposed research directions to address them.
Background
The federal government has a long history of data collection. The collection of potentially sensitive
information on the citizenry dates back to the Constitution (while not specifically required by the
constitution, the first decennial census asked about gender, race, relationships, slaveholder status,
and in some states, occupation) and the requirement that some government-collected information be shared at least to the patent act of 1790. The 20th century saw growing concern with
privacy of that data, perhaps most famously through the 1890 article “The Right to Privacy”,
leading to a patchwork of laws, regulations, and policy to restrict access to and misuse of government
held data.
The 21st century has seen a growing awareness of the need to better utilize this data to support
public safety, leading to efforts such as the Terrorism Information Awareness program that was
established after the 9/11 attack and the 1/21/2021 Executive Order on Ensuring a Data-Driven
Response to COVID-19 and Future High-Consequence Public Health Threats. At the same time,
there has been growing concern that such programs, even using already collected data shared only
within the government, could be misused leading to substantial public harm. The rise in data
breach incidents (and the recognition that government is not immune) has made clear the external
risk of misuse of the data; more comprehensive datasets increase the potential damage of such a
breach. As a result, efforts to improve data sharing and use have proven challenging, and lagged
behind private enterprise creation of large and comprehensive (and largely unregulated or lightly
regulated) datasets.
Recent years have seen significant research advances in data sharing and data privacy. For
context, we outline a few areas here, but recognize that these field is substantially larger and more
diverse.
Data Sharing
Advances in schema integration and record linkage promise substantial improvement
in our ability to combine data from multiple sources. There has been substantial progress in areas
such as Data Provenance and self-describing data formats such as XML. There has also been
advances in privacy preserving record linkage .
While these have been motivated by the needs above, they have as yet had limited practical
impact on a broad scale. Efforts such as the Department of Defense Core Architecture Data
Model have evolved into the DoD Architecture Framework Version 2, which aims to support ”an
architectural description consistent with specific project or mission objectives” rather than a
DoD-wide logical data model. The growth in data complexity is making problems associated with
sharing more challenging.
Privacy Technology
Several advances have given us new ways to use data with reduced privacy risk. These broadly fall into:
- Formal privacy techniques. Through incorporating non-determinism into the results of data analysis, formal privacy methods limit the extent to which information specific to individuals can be reliably ascertained from the results. While showing some practical success, more complex data have proven more difficult.
- Computing with Encrypted Data. Applications of Secure Multiparty Encryption, and more recently Fully Homomorphic Encryption, provide new opportunities to use data while controlling the risk of disclosure. The basic idea, that the data remains encrypted and thus not at risk of disclosure while it is being used, seems an ideal solution. Research successes include techniques for machine learning and encrypted database management systems. However, real-world use has been limited (although they do exist).
The next wave of privacy technology research needs to focus on the real-world challenges that are not yet addressed by the technology. In many cases, the data agencies would like to share highly complex data sets, for example, survey data with hundreds/thousands of variables of different types and complex survey design that makes it very challenging to apply emerging technologies. This workshop aims to identify key cross-cutting challenges and spur research to address them.
Organizers
Speakers
Agenda
All times are listed in Eastern Daylight Savings Time (EDT)
May 21, 2021
Opening Remarks
Erwin Gianchandani, National Science Foundation
Lynne Parker, White House Office of Science and Technology Policy
Keynote: Confidentiality-Utility Tradeoffs and Stakeholder Communication
John Abowd, U.S. Census Bureau
Differential Privacy in Practice
Ashwin Machanavajjhala, Duke University
Some Elements of the Interface of Privacy Protection with Data Quality, Risk and Cost
John Eltinge, U.S. Census Bureau
Break
Country-scale deployments of cryptographic security and privacy technologies
Dan Bogdanov, Cybernetica
The Perspective of the Chief Data Officer
Dan Morgan, Department of Transportation
COVID-19 Case Data Publication
Brian Lee, U.S., Centers for Disease Control and Prevention
Synthetic Data Generation
Jerry Reiter, Duke University
Break
Secure Multiparty Computation in the Boston Women's Workforce Network
Mayank Varia, Boston University
Multi-tier Access and Data User Needs
Barbara Downs, U.S. Census Bureau
Adversarial Modeling
Bradley Malin, Vanderbilt University
The TREC Datasets
Ellen Voorhees, NIST
Discussion and Planning for Day 2
May 26, 2021
Opening Remarks
Margaret Martonosi, National Science Foundation,
Elham Tabassi, National Institute of Standards and Technology, &
Organizers