Can Wikipedia Predict the Future ... of Box Office Hits?
Credit: Digital data photo via Shutterstock

This weekend, will the tale of a murderous rampage told in "The Frozen Ground" starring Nicolas Cage beat out the romantic comedy "Drinking Buddies" with Anna Kendrick? Perhaps Wikipedia could tell us — even before these movies open.

New research suggests the data from user activity on movies' Wikipedia pages can be used to predict which movies will become blockbusters.

The researchers analyzed the Wikipedia pages of 312 American films, from the page creation to the movies' release dates in 2010. Looking at several factors, such as page views and the number of theaters screening the movie, they identified which elements are correlated with the commercial success of a movie over its opening weekend.

The researchers then built a mathematical model based on the identified factors, including the number of edits on the movie’s page, the number of editors contributing to the page and the diversity of online users. The model was tested several times to find the right balance between all factors in the equation. [Infographic: Model Predictions vs. Actual Movie Revenue]

When the model's predictions were compared with actual release weekend sales, they showed a high degree of correlation, according to the study published yesterday (Aug. 21) in the journal PLOS ONE. 

The results "show how simple use of user-generated data in a social environment like Wikipedia can enhance our ability to predict the collective reaction of society to a cultural product," the researchers said.

Stories online data can tell

Digital traces of people's activities online are being increasingly explored to follow social events and find hidden patterns in population behavior and the collective mind. Previously, data from Twitter was used to instantly detect events from earthquakes to traffic jams, or predict box-office success the next morning. Another example involved using edits on Wikipedia pages to identify controversial topics among groups of people across the globe.

Scientists found that upcoming films with high Wikipedia activity tended to do well at the box office. [<a href="http://www.livescience.com/39063-wikipedia-data-predicts-movie-blockbusters-infographic.html">See full infographic</a>]
Scientists found that upcoming films with high Wikipedia activity tended to do well at the box office. [See full infographic]
Credit: by Karl Tate, Infographics Artist

Predicting society's reaction to a new product is another potential use of massive data gathered online, the researchers said, and choosing Wikipedia as a data source may offer advantages over other databases or social media.

"Editing Wikipedia has a higher cost in terms of effort needed compared to, for example, tweeting, and it reflects sort of active participation," said study co-author Taha Yasseri, researcher at the University of Oxford. "It reflects the popularity and the interest in the item more accurately than other social media."

The researchers compared the accuracy of their new approach with a previous model based on Twitter data. The results showed the Wikipedia-based model outperforms the Twitter-based model in making good predictions and at an earlier date.

"That's because people edit Wikipedia pages of movies and read them much earlier than the time they tweet about it. This latter happens usually very close to watching the movie and most of the time after that," Yasseri said.

A better model for better movies

The model was a more accurate predictor for more successful movies in terms of sales. Estimations on the commercial sales for "Iron Man 2," "Alice in Wonderland," "Toy Story 3" and "Inception" were accurate, but the model failed to accurately predict the financial return on less successful movies, such as "Never Let Me Go," "Animal Kingdom," "The Girl on the Train," "The Killer Inside Me" and "The Lottery."

The reason that the model works better for successful movies may be that they generate more online data compared with movies destined to fail, the researchers said. More user-generated data usually reduces irrelevant data (noise) and results in more accurate predictions, they said.

Email Bahar Gholipour or follow her @alterwired. Follow LiveScience @livescience, Facebook & Google+. Original article on LiveScience.