Skip to main content Skip to secondary navigation
Page Content

De-Identifying Medical Patient Data Doesn’t Protect Our Privacy

A Stanford researcher makes the case that de-identifying health records used for research doesn’t offer anonymity and hinders the learning health system.

A woman sits between two shelves overflowing with medical records.

By law, medical records used for research must be scrubbed of personally identifiable information. That doesn't guarantee anonymization. | Phil McCarten

When we visit a doctor in the United States, we sign a privacy statement acknowledging that although our health status and mental health issues are private, our health information can be shared for various legitimate purposes including public health and research.

What the privacy statement doesn’t explain is that, when our health data is used for these specific purposes, it must be “de-identified” in compliance with the federal Health Insurance Portability and Accountability Act (HIPAA). That is, key information like names, birthdates, gender, and other factors must be removed. But that doesn’t mean our records are actually kept private.

“There’s a mismatch between what we think happens to our health data and what actually happens to it,” says Nigam Shah, professor of medicine (biomedical informatics) and of biomedical data science at Stanford University and an affiliated faculty member of the Stanford Institute for Human-Centered Artificial Intelligence. “We need to maintain privacy when analyzing healthcare data, but de-identification is an imperfect way to do that,” Shah says. 

Here, Shah calls out de-identification’s imperfections and proposes a path toward simpler privacy protections. He argues that de-identified data can easily be re-identified when combined with other datasets, and the only protection from re-identification right now is the recipient of the data agreeing to not do so. Furthermore, the rules have spawned a problematic free market in de-identified data and propagated an expensive and overly technical system that hinders research that could make use of electronic healthcare record (EHR) data to improve care.

Health Record Privacy: What It Is and Isn’t

Health record privacy dates to at least 400 B.C., when Hippocrates wrote his now famous oath promising to keep information about his patients secret.

The motivations behind protecting patient privacy have remained the same since Hippocrates’ day: It promotes honest communication between the patient and physician, which is essential for quality care; and it protects patients from embarrassment, discrimination or economic harm that might result from disclosure of their health status. 

Read related: Are Medical AI Devices Evaluated Appropriately?


Under HIPAA, healthcare providers and insurers must make sure patients’ healthcare information is not disclosed improperly. Still, HIPAA does allow medical providers to share patient information for multiple purposes, including for coordinating patient care (consultation with other doctors, for example), for running the medical organization smoothly, for billing, and for public health and research.

But when EHRs are used for public health or research purposes, the data must be de-identified in one of two ways: by obtaining an expert’s certification that the data are de-identified; or by using the “safe harbor” approach, which entails removing 18 different types of patient-identifying information including names, ages, addresses, email addresses, URLs, dates of service, and lots of numbers: Social Security numbers, telephone numbers, insurance IDs, and medical records identifiers.

Once EHRs have been de-identified, the dataset is no longer protected under HIPAA: It can be freely shared or bought and sold.

In practice, Shah says, this means that Stanford Hospital and every other research institution around the country spends millions of dollars annually de-identifying medical records so they can use them for research and share them with other research institutions. Some (though not Stanford) also sell them on the data market.

After health data is de-identified, the question remains: Is patient privacy actually protected?

De-Identification Is Imperfect

In truth, Shah says, it is never possible to guarantee that de-identified data can’t or won’t be re-identified. That’s because de-identification is not anonymization. Since 1997, researchers have repeatedly demonstrated the potential for re-identification of de-identified patient records by combining de-identified data with other data sources such as voter rolls or (in a recent case) public newspaper reports of car accidents or illnesses. 

Read related: Health Care’s AI Future: A Conversation with Fei-Fei Li and Andrew Ng


In addition, since HIPAA was passed in 1996, artificial intelligence has only gotten better at identifying people using facial recognition, genetic information, iris scans, and even gait. “We still practice de-identification as interpreted in 1996 and then complain that the removal of 18 identifiers selected years ago is not working to maintain privacy in 2021,” Shah says. 

Various technical solutions to improve de-identification have been proposed, but Shah says these are just a distraction from the goal of using research to provide better care. “If we go too far down the rabbit hole of technical solutions, we end up with anonymization, which might render the data useless for research.”

A few states, such as California, have passed laws banning re-identification of de-identified data, and healthcare providers who share data with other institutions typically require recipients to sign agreements forbidding re-identification. But these are piecemeal solutions that don’t address the fundamental fact that de-identification is an imperfect means to ensure privacy, Shah says.

Ultimately, “If the goal is to maintain privacy,” Shah says, “we should not be insisting on this legally defined imperfect procedure while pretending that it should give us privacy when it was never designed to do so – and being outraged when privacy is lost.”

The Market in De-Identified Data

Because de-identified EHRs are no longer considered protected under HIPAA, these datasets can be (and often are) freely bought and sold by companies that can combine them with other information for various purposes that were never anticipated by Congress when it passed HIPAA. Indeed, the aggregation of de-identified health data is now a multibillion-dollar industry.

And that industry can harm patients, note several researchers from Harvard and Duke in a recent New England Journal of Medicine perspective. For example, pharmaceutical companies might use such data to target physicians who are likely to prescribe a particular medicine, which could drive drug overuse and push up prices (a process known as pharmaceutical detailing). 

As Shah sees it, healthcare consumers are largely unaware of the market in their health data, which has allowed the market for de-identified data to thrive without legislative or regulatory intervention. 

A Barrier to a Learning Healthcare System?

Many clinical researchers aspire to develop a “learning healthcare system,” where data from doctor visits and hospital stays are aggregated and analyzed to gain new knowledge that would improve patient care. There’s nearly universal agreement that this type of feedback loop for improving care would serve the public good, Shah says. But the HIPAA Privacy Rule is a barrier to that.

Instead of de-identifying EHRs when they are used for public health and research, Shah says health record data should be kept private in the same way it is kept private when used by a patient’s medical team. “If we’re not worried when the doctor sends our detailed records to our physical therapist or nutritionist to provide care today, we should be OK with allowing it to be used for purposes that could improve our care in the future,” he says.

It’s also a matter of contributing to the common good, he says. “If I want to benefit from someone else’s data, I have a duty to also share my own.”

If we care about privacy, Shah says, we should come up with a legal solution rather than rely on an imperfect technical crutch. Since de-identification fails to guarantee privacy, allows unregulated data markets to exist, and hinders the use of health data to benefit patients in the learning healthcare system, he says, perhaps it’s time to find a less convoluted way for healthcare providers to fulfill their Hippocratic Oath. 

Stanford HAI's mission is to advance AI research, education, policy and practice to improve the human condition. Learn more

More News Topics