Joint NSF-NIST Workshop

To Develop a Roadmap for Greater Public Use of Privacy-Sensitive Government Data



Overview

Government agencies collect and manage voluminous datasets, which have great potential to support both research and economic development. The rise of data science and AI-based technologies has created new, and yet to be recognized, opportunities for use of this data. Unfortunately, much of this data is or may be sensitive, which limits public access. Data sensitivity arises in various ways. Several of the obvious include:

  • Individual privacy. Data may reflect confidential information about people. Even if the data does not pose significant privacy risks, when combined with other data sources it may result in substantial risks of misuse.
  • Corporate confidentiality. Data may contain important business information that is not public, and release could have substantial economic impact on the entities the data is about.
  • Political sensitivity. Datasets held by agencies may contain information whose disclosure could be misinterpreted. For example, data on individuals used by the Department of Justice for statistical analysis could be misconstrued as surveillance of viewed out of context.

There are also less obvious, but still significant, impediments to sharing data widely. These include 1) regulatory and policy-based restrictions, 2) technological impediments, such as how the data is formatted and stored, and 3) budget-imposed constraints, such as lack of public-facing metadata and documentation on the source, quality, and semantics of data.

Existing fora focus on narrow aspects of these problems, such as technological or policy approaches to data privacy, or users groups of specific types of data. The goal of this workshop is to focus more broadly on impediments to broader use of government data. To accomplish this goal, we aim to bring together data stewards from across the federal government with researchers in policy and technology to identify promising directions and open challenges in addressing these impediments.

The outcome of the workshop will be a report, authored primarily by the organizers but open to contribution by workshop participants, detailing research challenges identified and outlining proposed research directions to address them.

Background

The federal government has a long history of data collection. The collection of potentially sensitive information on the citizenry dates back to the Constitution (while not specifically required by the constitution, the first decennial census asked about gender, race, relationships, slaveholder status, and in some states, occupation) and the requirement that some government-collected information be shared at least to the patent act of 1790. The 20th century saw growing concern with privacy of that data, perhaps most famously through the 1890 article “The Right to Privacy”, leading to a patchwork of laws, regulations, and policy to restrict access to and misuse of government held data.

The 21st century has seen a growing awareness of the need to better utilize this data to support public safety, leading to efforts such as the Terrorism Information Awareness program that was established after the 9/11 attack and the 1/21/2021 Executive Order on Ensuring a Data-Driven Response to COVID-19 and Future High-Consequence Public Health Threats. At the same time, there has been growing concern that such programs, even using already collected data shared only within the government, could be misused leading to substantial public harm. The rise in data breach incidents (and the recognition that government is not immune) has made clear the external risk of misuse of the data; more comprehensive datasets increase the potential damage of such a breach. As a result, efforts to improve data sharing and use have proven challenging, and lagged behind private enterprise creation of large and comprehensive (and largely unregulated or lightly regulated) datasets.

Recent years have seen significant research advances in data sharing and data privacy. For context, we outline a few areas here, but recognize that these field is substantially larger and more diverse.

Data Sharing

Advances in schema integration and record linkage promise substantial improvement in our ability to combine data from multiple sources. There has been substantial progress in areas such as Data Provenance and self-describing data formats such as XML. There has also been advances in privacy preserving record linkage .

While these have been motivated by the needs above, they have as yet had limited practical impact on a broad scale. Efforts such as the Department of Defense Core Architecture Data Model have evolved into the DoD Architecture Framework Version 2, which aims to support ”an architectural description consistent with specific project or mission objectives” rather than a DoD-wide logical data model. The growth in data complexity is making problems associated with sharing more challenging.

Privacy Technology

Several advances have given us new ways to use data with reduced privacy risk. These broadly fall into:

  • Formal privacy techniques. Through incorporating non-determinism into the results of data analysis, formal privacy methods limit the extent to which information specific to individuals can be reliably ascertained from the results. While showing some practical success, more complex data have proven more difficult.
  • Computing with Encrypted Data. Applications of Secure Multiparty Encryption, and more recently Fully Homomorphic Encryption, provide new opportunities to use data while controlling the risk of disclosure. The basic idea, that the data remains encrypted and thus not at risk of disclosure while it is being used, seems an ideal solution. Research successes include techniques for machine learning and encrypted database management systems. However, real-world use has been limited (although they do exist).

The next wave of privacy technology research needs to focus on the real-world challenges that are not yet addressed by the technology. In many cases, the data agencies would like to share highly complex data sets, for example, survey data with hundreds/thousands of variables of different types and complex survey design that makes it very challenging to apply emerging technologies. This workshop aims to identify key cross-cutting challenges and spur research to address them.

Agenda

All times are listed in Eastern Daylight Savings Time (EDT)

May 21, 2021
11:00am - 11:30am

Opening Remarks

Erwin Gianchandani, National Science Foundation
Lynne Parker, White House Office of Science and Technology Policy

11:30am - 12:00pm

Keynote: Confidentiality-Utility Tradeoffs and Stakeholder Communication

John Abowd, U.S. Census Bureau

12:00pm - 12:30pm

Differential Privacy in Practice

Ashwin Machanavajjhala, Duke University

12:30pm - 1:00pm

Some Elements of the Interface of Privacy Protection with Data Quality, Risk and Cost

John Eltinge, U.S. Census Bureau

1:00pm - 1:15pm

Break

1:15pm - 1:45pm

Country-scale deployments of cryptographic security and privacy technologies

Dan Bogdanov, Cybernetica

1:45pm - 2:15pm

The Perspective of the Chief Data Officer

Dan Morgan, Department of Transportation

2:15pm - 2:45pm

COVID-19 Case Data Publication

Brian Lee, U.S., Centers for Disease Control and Prevention

2:45pm - 3:15pm

Synthetic Data Generation

Jerry Reiter, Duke University

3:15pm - 3:30pm

Break

3:30pm - 3:50pm

Secure Multiparty Computation in the Boston Women's Workforce Network

Mayank Varia, Boston University

3:50pm - 4:10pm

Multi-tier Access and Data User Needs

Barbara Downs, U.S. Census Bureau

4:10pm - 4:30pm

Adversarial Modeling

Bradley Malin, Vanderbilt University

4:30pm - 4:50pm

The TREC Datasets

Ellen Voorhees, NIST

4:50pm - 5:00pm

Discussion and Planning for Day 2

May 26, 2021
11:00am - 11:30am

Opening Remarks

Margaret Martonosi, National Science Foundation,
Elham Tabassi, National Institute of Standards and Technology, &
Organizers

11:30am - 1:00pm

Breakout / Working Session 1

1:00 - 1:45

Break

1:45 - 2:15

Breakout Status Reports

2:15 - 3:00

Panel on Government - Academia Interaction?

3:00 - 3:15

Break

3:15 - 4:30

Breakout / Working Session 2

4:30 - 5:00

Breakout Summaries