Big Data and the Flu: How Wikipedia Can Track Influenza

SOPA/PIPA
(Image credit: Ifeelstock | Dreamstime.com)

By monitoring the number of times people look for flu information on Wikipedia, researchers may be better able to estimate the severity of a flu season, according to a new study.

Researchers created a new data-analysis system that looks at visits to Wikipedia articles, and found the system was able to estimate flu levels in the United States up to two weeks sooner than the flu data from the Centers for Disease Control and Prevention were released.

Looking at data spanning six flu seasons between December 2007 and August 2013, the new system estimated the peak flu week better than Google Flu Trends, another data-based system. The Wikipedia-based system accurately estimated the peak flu week in three out of six seasons, while the Google-based system got only two right, the researchers found. [10 Technologies That Will Transform Your Life]

"We were able to get really nice estimates of what the [flu] level is in the population," said study author David McIver, a postdoctoral fellow at Boston Children's Hospital.

The new system examined visits to Wikipedia articles that included terms related to flulike illnesses, whereas Google Flu Trends looks at searches typed into Google. The researchers analyzed the data from Wikipedia on how many times in an hour a certain article was viewed, and combined their data with flu data from the CDC, using a model they created.

The research team wanted to use a database that is accessible to everyone and create a system that could be more accurate than Google Flu Trends, which has flaws. For instance, during the swine flu pandemic in 2009, and during the 2012-2013 influenza season, Google Flu Trends got a bit "confused," and overestimated flu numbers because of increased media coverage focused on the two illnesses, the researchers said.

When a pandemic strikes, people search for news stories related to the pandemic itself, but this doesn't mean that they have the flu. In general, the problem with Internet-based estimation systems is that it is practically impossible to tell whether people are looking for information about an illness because they are sick, the researchers said.

In the new system, the researchers tried to overcome this issue by including a number of Wikipedia articles "to act as markers for general background-level activity of normal usage of Wikipedia," the researchers wrote in the study. However, just like any other data-based system, the Wikipedia system is not immune to the issues related to figuring out the actual motivation of someone checking information related to the flu.

Therefore, it's important to view systems such as Google Flu Trends and the Wikipedia system as complementary to data from official sources such as the CDC, McIver said.

"We are not trying to create something that will replace the CDC or anything like that," he said. Rather, the researchers' goal is "to get both things to work well together, to give us a more holistic view of what is going on," they said.

The study is published today (April 17) in the journal PLOS Computational Biology.

Follow Agata Blaszczak-Boxe on Twitter. Follow Live Science @livescienceFacebook Google+. Original article on Live Science.

Staff Writer