Computer Algorithm Seeks To Crack Code Of Fiction Bestsellers

reading, algorithm, bestsellers
Successful books have stylistic similarities, researchers found. (Image credit: Magi Edos via flickr |

(ISNS)--The English novelist W. Somerset Maugham once said that there are three rules for writing novels.

"Unfortunately," he added, "no one knows what they are."

Three computer scientists at Stony Brook University in New York think they found some rules through a computer program that might predict which books will be successful. The algorithm had as much as 84 percent accuracy when applied it to already published manuscripts.

If so, it comes much too late for the more than 20 book editors who turned down J.K. Rowling's first manuscript about a boy wizard named Harry Potter.

They said it is the first study to correlate between a book's stylistic elements and its popularity and critical acclaim.

In a paper published by the Association of Computational Linguistics, Vikas Ganjigunte Ashok, Song Feng, and Yejin Choi said the writing style of books was correlated with the success of the book.

The researchers used a process called statistical stylometry, a statistical analysis of literary styles in several genres of books and identified characteristic stylistic elements more common in successful tomes than unsuccessful ones.

They began their research with Project Gutenberg, a database of 44,500 books in the public domain. A book was considered successful when it was critically acclaimed and had a high download count. The books chosen for analysis represented all genres of literature, from science fiction to poetry.

Then, they added some books not in the Gutenberg database, including Charles Dickens' "Tale of Two Cities," and Ernest Hemingway's "The Old Man and the Sea." They also added Dan Brown's latest novel, "The Lost Symbol," and books that have won the Pulitzer Prize, the National Book Award, and other awards.

They took the first 1,000 sentences of 4,129 books of poetry and 1,117 short stories and then analyzed them for various factors. They looked at parts of speech, use of grammar rules, the use of phrases, and "distribution of sentiment" – a way of measuring the use of words.

They found that successful books made great use of conjunctions to join sentences ("and" or "but") and prepositions than less successful books. They also found a high percentage of nouns and adjectives in the successful books; less successful books relied on more verbs and adverbs to describe what was happening.

More successful books relied on verbs describing thought processes rather than actions and emotions. The results varied by genre, but books that are less successful, the researchers reported, used words like "wanted," "took" or "promised." Successful authors employed "recognized" or "remembered."

"It has to do with showing versus caring," Choi said. "In order to really resonate with readers, instead of saying 'she was really really sad,' it might be better to describe her physical state, to give a literal description. You are speaking more like a journalist would."

Communications researchers believe journalists use more nouns, pronouns, and prepositions than other writers because those word forms give more information, Choi explained.

"Novelists who write more like journalists have literary success," she said.

This should come as no surprise since many great novelists--Dickens and Hemingway to name two--began their careers as journalists.

Choi emphasized that she was describing a correlation, not causation, but the results could be predictive.

The technique falls under the category of machine learning and has been used to successfully parse literature. For instance, Moshe Koppel, a computer scientist at Israel's Bar-Ilan University, developed a program that can tell whether the author of a book is male or female 80 percent of the time.

He said the Stony Brook study was well done but the sampling size was too small. Some of the books had fewer than 100 downloads.

It is not practical in the real world according to Michael Hamilburg, a literary agent at the Mitchell Hamilburg Agency in Los Angeles, whose job it is to find bestselling books among thousands of manuscripts.

"While it presents very interesting ideas, I don't yet see the real-world applications that would be beneficial to my day-to-day work or final choices," Hamilburg said. "It's very difficult to quantify decisions that are often made by intuition and relationships." 

At least one novelist agrees.

Ron Hansen, the author of several successful novels, including "The Assassination of Jesse James by the Coward Robert Ford," which was made into a movie starring Brad Pitt, said style is not the key.

"Most people buy and read books because they're captured by the topic," said Hansen, who teaches writing at Santa Clara University in California. "Of stylistic characteristics, the scientists are flying in the face of most teaching of creative writing when they emphasize nouns over verbs. Verbs are the engine of fiction and quality writing is often measured by their variety, precision, and force," Hansen said.

Or, as the sportswriter Red Smith once said, "Writing is easy. You just open a vein and bleed."

Inside Science News Service is supported by the American Institute of Physics. Joel Shurkin is a freelance writer based in Baltimore. He is the author of nine books on science and the history of science, and has taught science journalism at Stanford University, UC Santa Cruz and the University of Alaska Fairbanks. He tweets at @shurkin.

Inside Science News Service