Scientist recovers coronavirus gene sequences secretly deleted last year in Wuhan

The SARS-CoV-2 virus invades human cells by attaching to ACE2 receptors on the surfaces of those cells.
The SARS-CoV-2 virus invades human cells by attaching to ACE2 receptors on the surfaces of those cells. (Image credit: Shutterstock)

Finding the origin story for SARS-CoV-2, the coronavirus responsible for nearly 3.9 million deaths worldwide, has been largely hampered by lack of access to information from China where cases first popped up.

Now, a researcher in Seattle has dug up deleted files from Google Cloud that reveal 13 partial genetic sequences for some of the earliest cases of COVID-19 in Wuhan, Carl Zimmer reported for The New York Times

The sequences don't tip the scales toward or away from one of the many theories about how SARS-CoV-2 came to be — they do not suggest the virus leaked from a high-security lab in Wuhan, nor do they suggest a natural spillover event. But they do firm up the idea that the novel coronavirus was circulating earlier than the first major outbreak at a seafood market.

Related: 14 coronavirus myths busted by science

In order to determine exactly how and where the virus originated, scientists need to find the so-called progenitor virus, the one from which all other strains descended. Until now, the earliest sequences are primarily those sampled from cases at the Huanan Seafood Market in Wuhan, which was initially thought to be where the novel coronavirus first emerged at the end of December 2019. However, cases from early December and as far back as November 2019 had no ties to the market, indicating pretty early in the pandemic that the virus emerged from another spot. 

There was one nagging issue with those first genetic sequences. Those from cases found at the market include three mutations that are missing in virus samples from cases that popped up weeks later outside of the market. The viruses missing those three mutations matched more closely with the coronaviruses found in horseshoe bats. Scientists are relatively certain that the novel coronavirus somehow emerged from bats, so it's logical to assume the progenitor would also be missing those mutations. 

And now, Jesse Bloom of the Howard Hughes Medical Institute in Seattle has found the deleted sequences — likely some of the earliest samples — also were devoid of those mutations. (Bloom is the lead author in a letter published in May in the journal Science urging an unbiased investigation into the origins of the coronavirus, Live Science reported.)

"They're three steps more similar to the bat coronaviruses than the viruses from the Huanan fish market," Bloom told The New York Times. This new data hints that the virus was circulating in Wuhan well before it showed up at the seafood market, Bloom said.

"This fact suggests that the market sequences, which are the primary focus of the genomic epidemiology in the joint WHO-China report ... are not representative of the viruses that were circulating in Wuhan in late December of 2019 and early January of 2020," Bloom wrote in his paper uploaded June 22 to the preprint database bioRxiv.

According to Zimmer, about a year ago 241 genetic sequences from coronavirus patients had gone missing from an online database called Sequence Read Archive that's maintained by the National Institutes of Health (NIH).

Bloom noticed the missing sequences when he came across a spreadsheet in a study published in May 2020 in the journal PeerJ in which the authors list 241 genetic sequences of SARS-CoV-2 through the end of March 2020; the sequences were part of a Wuhan University project called PRJNA612766 and were supposedly uploaded to the Sequence Read Archive. He searched the archive database for the sequences and got the message "No items found," Bloom wrote in the bioRxiv paper, which has not been peer-reviewed.

Related: 11 (sometimes) deadly diseases that hopped across species

His sleuthing revealed that the deleted sequences had been collected by Aisu Fu and Renmin Hospital of Wuhan University, and a preprint of the research published from those sequences (referred to as Wang et al. 2020) suggested they came from nose swab samples from outpatients with suspected COVID-19 early in the epidemic.

Bloom couldn't find any explanation for why the sequences had been deleted, and his emails to both corresponding authors to inquire received no response.

"There is no plausible scientific reason for the deletion: the sequences are perfectly concordant with the samples described in Wang et al. (2020a,b)," Bloom wrote in bioRxiv. "There are no corrections to the paper, the paper states human subjects approval was obtained, and the sequencing shows no evidence of plasmid or sample-to-sample contamination. It therefore seems likely the sequences were deleted to obscure their existence."

Bloom notes several limitations to his study, primarily that the sequences are only partial and include no information to give a clear date or place of collection — information crucial to tracing the virus back to its origin.

Regardless, Bloom thinks that looking more deeply at archived data from the NIH and other organizations — and piecing together the sequences — could help to paint a clearer picture of both the origin and early spread of SARS-CoV-2, all without needing on-the-ground studies in China. 

Read more about the deleted sequences at The New York Times.

Originally published on Live Science.

Jeanna Bryner
Live Science Editor-in-Chief

Jeanna served as editor-in-chief of Live Science. Previously, she was an assistant editor at Scholastic's Science World magazine. Jeanna has an English degree from Salisbury University, a master's degree in biogeochemistry and environmental sciences from the University of Maryland, and a graduate science journalism degree from New York University. She has worked as a biologist in Florida, where she monitored wetlands and did field surveys for endangered species. She also received an ocean sciences journalism fellowship from Woods Hole Oceanographic Institution.