Anonymous Genetic Profiles Aren't Completely Anonymous

Genetic codes and dna
Human genomes are a boon to medical research, but pose privacy risks. (Image credit: JohnGoode via flickr |

(ISNS) -- Today it is easy for long-forgotten photos or personal information to live online indefinitely. But what if the most personal data about you – your genetic makeup – lived online? An individual's genome contains a vast amount of information about inherited diseases and physical traits, all stored in strands of DNA. The consequences of being able to search, cross-reference, and analyze this information are profound, experts say.

Hundreds of thousands of people have already had their genomes mapped in the U.S., either for research studies or through one of several private companies offering this service. In many cases, people want to know their risk of medical maladies like heart attack or breast cancer, or to identify the specific gene causing a disorder in their family. What these pioneers of personal genome mapping might not know, though, is how easily re-identifiable their anonymous data can be. And if that is the case, the question might not be whether to share, but rather how to regulate and protect what is being shared.

“We are entering an era of ubiquitous genetic information,” said computational biologist Yaniv Erlich, speaking at the American Association for the Advancement of Science meeting in Chicago in February.

Erlich, who is a fellow at the Whitehead Institute for Biomedical Research in Cambridge, Mass., brings a unique but apt background to genetic privacy research: He is a former hacker, someone who was hired to expose weaknesses in the security systems of banks and credit card companies. He and his team took a similar approach to illustrate vulnerabilities within genetic databases. Their study, published in Science last January, recovered the identities of nearly 50 anonymous participants in the 1000 Genomes Project; and they did it using free, publicly accessible Internet resources.  

“We have shown that it is possible, in some cases, to take genetic sequencing data of males and infer the surname by inspecting the Y-chromosome of this person,” Erlich said, “with a success rate of about 12 percent.”

Their method relied on the code-like nature of genomes. On the Y-chromosome of every male, there is a type of distinct pattern made up of what are called short tandem repeats, or Y-STRs. Erlich’s team developed an algorithm to help identify these patterns, called Y-STR haplotypes, in a human genome.

A number of recreational genetic genealogy websites connect surnames to Y-STR haplotypes, with the intent of building family trees and reuniting distant relatives. Unintentionally, these databases make it possible to re-identify seemingly anonymous genomes.

By comparing anonymous data to genome data on two major public databases, Ysearch and SMGF, the researchers were able to find close matches, and further narrow them with other data such as surnames, ages, and states of residence.

While about 40,000 U.S. males share an average surname, the combination of a surname, birth year and state shrinks that number considerably.

From the honed-down list of about 12 males, the team was able to use Google and free services such as to track down the owner of the unknown genome. A similar technique has been used by individuals who were adopted or conceived from sperm donation to trace their biological families. As more genetic data reaches online databases, Erlich said, new threats to privacy are keeping pace.

So, he would like to explore the best ways to collect genetic data for scientific studies, while protecting the privacy of participants. And he thinks it is possible to have both.

Drawing accurate conclusions regarding inherited disorders requires analysis of millions of samples, Erlich said. One big concern is how to keep all of those samples private — from insurance companies, marketers, anyone who might discriminate or draw conclusions about participants based on this wide array of information.

Privacy becomes especially important in those cases, he said, since prospective participants of scientific studies have ranked privacy of sensitive information as one of their top concerns and a major determinant of whether they will participate in a study.

In order to protect privacy, Erlich and Princeton researcher Arvind Narayanan suggest a combination of access control, data anonymity and cryptography. As national policy continues to evolve on the subject of genetic privacy, private industry is gearing to fill in gaps in a number of ways.

For example, in the future, it could be the norm for users to send their genetic data through a cloud service as an added precaution. Kristin Lauter, head of the cryptography research group at Microsoft Research, likens this method, called homomorphic encryption, to “not having to trust your jeweler,” since users would hand over their precious information, and allow a private service like hers to do calculations on it in an encrypted form.

“The cloud service never sees your private data,” she said. “Only you, who has the key, can un-encrypt it and analyze the result.”

But, like using a credit card, one runs the risk of being hacked. This is why another element of protecting genetic privacy might lie in improved informed consent processes, as well as follow-up analyses of each individual’s results.

John Wilbanks, chief commons officer for the Seattle-based Sage Bionetworks, which advocates open and collaborative science, said he agrees with Erlich's findings that re-identification risks are higher than people think.

“When these services guarantee anonymity, that’s a quite difficult promise to keep…I think right now they can tend to understate the re-identification risks, and overstate the risk of harm,” Wilbanks said.

Inside Science News Service is supported by the American Institute of Physics. Sarah Witman is a science writer based in Madison, Wisconsin.

ISNS Contributor