Higher Education, Data Transparency, and the Limits of Data Anonymization

by Reihan Salam

The editors of Bloomberg View tout various state-level initiatives to contain the rising cost of higher education. Most of these ideas are perfectly reasonable, and likely to do at least some good. But I am increasingly convinced that unless governments do a better job of measuring student learning and labor market outcomes, any reform efforts will be of limited use. Advocates of higher education reform often argue that we ought to reward the most effective institutions, i.e., the institutions that do the most to improve student outcomes per dollar spent. The problem, however, is that we don’t have very good tools for assessing outcomes. Andrew P. Kelly and Daniel K. Lautzenheiser of the American Enterprise Institute offer two ideas for how states might address this data vacuum:

States should require institutions to measure student learning outcomes in a rigorous, reliable, and comparable way. This is not to suggest that states should coerce all institutions to use the same standardized test or that these tests should be a requirement for graduation. Rather, institutions should have the opportunity to choose from a menu of assessments, with the results made public. Administering an exam twice during a student’s tenure can allow institutions to measure the value added by the institution as a whole, providing less-selective institutions with an opportunity to showcase the gains their students make while in attendance. At the two-year level, policymakers should also report the proportion of remedial students who went on to successfully complete credit-bearing courses.

And Kelly and Lautzenheiser also call for providing students, parents, and taxpayers with more useful and reliable information on labor market outcomes:

Similarly, states should take steps to link data on postsecondary experience with earnings and employment information. Some institutions already try to measure employment and earnings using graduate surveys, but these are expensive to conduct and often suffer from low response rates. Linking administrative data from postsecondary and wage records is likely to be more informative and less expensive in the long run (despite start-up costs). With these data systems in hand, states would ideally be able to connect average earnings to both institutions and degree programs. Done right, this would enable prospective students to say, “If I am an accounting major at Eastern State University, the average wage one year after graduation is $50,000,” and then compare that to accounting programs at other institutions.

Students who are unsure about what type of program to pursue or what to major in may find this useful as well, as it will provide them with a sense of what credentials are likely to lead to a good job. These data can be particularly helpful to combat the myth that a bachelor’s degree is the only path to the middle class. Evidence suggests that graduates from some two-year and certificate programs outearn those with bachelor’s degrees, at least in the near term.

Kelly and Lautzenheiser have many other promising ideas for state governments in their report, “Taking Charge: A State-Level Agenda for Higher Education Reform.” One of them, a call for “Charter Universities,” might appeal to conservatives who hope to encourage business model innovation and liberals critical of for-profit higher education, which largely exists to meet the needs of nontraditional students ill-served by higher education incumbents. My concern is that while at least some states are inching towards the kind of learning and labor market assessments Kelly and Lautzenheiser champion, state-level efforts are limited in their utility relative to a federal effort to create a student unit record system, which would leverage the wage data collected by the Social Security Administration to give a clearer picture of how graduates of various higher education programs fare. But back in May, Kelly observed the following:

The federal government is uniquely positioned to collect data that could help students make better choices. The feds have already invested $500 million in state longitudinal data systems that could provide more comprehensive student success measures. And the Social Security Administration already collects wage data on all workers; a simple match could link labor market information to post-secondary experience.

Unfortunately, in 2008 Congress went in the opposite direction, explicitly banning the federal government from collecting individual-level data on college students. Some higher education interests argued that the ban on a student-unit record system was critical to protect student privacy. It coincidentally helps to protect colleges and universities from the wrath of better-informed consumers. This is not an argument to plug new data into ham-handed accountability measures, but to empower consumers to vote with their tuition dollars.

The fundamental challenge is that linking individual-level labor market information to post-secondary experience while protecting privacy requires data anonymization, yet there are real questions about whether true data anonymization is even possible. Pete Warden addressed this question in 2011:

Precisely because there are now so many different public datasets to cross-reference, any set of records with a non-trivial amount of information on someone’s actions has a good chance of matching identifiable public records. Arvind first demonstrated this when he and his fellow researcher took the “anonymous” dataset released as part of the first Netflix prize, and demonstrated how he could correlate the movie rentals listed with public IMDB reviews. That let them identify some named individuals, and then gave access to their complete rental histories. More recently, he and his collaborators used the same approach to win a Kaggle contest by matching the topography of the anonymized and a publicly crawled version of the social connections on Flickr. They were able to take two partial social graphs, and like piecing together a jigsaw puzzle, figure out fragments that matched and represented the same users in both.

Warden recommends a variety of strategies that might address this problem, e.g., limiting the detail of the information provided, but this would also limit the usefulness of the underlying data. My own view is that we ought to take Warden’s second recommendation seriously, which is to acknowledge and accept the risk of de-anonymization in light of the benefits that greater data transparency would provide, while doing what we realistically can to limit the risk.