Supercomputers allow the U.S. to virtually test nuclear weapons without plunging back into the Cold War — but undetected computing errors can corrupt or even crash such simulations involving 100,000 networked machines. The problem energized researchers to make an automated system for catching computer glitches before they spiral out of control.
The solution involved eliminating a "central brain" server that could not keep up with streaming data from thousands of machines — researchers organized the supercomputing cluster of machines by "classes" based on whether machines ran similar processes. That clustering tactic makes it possible to quickly detect any supercomputing glitches.
"You want the system to automatically pinpoint when and in what machine the error took place and also the part of the code that was involved," said Saurabh Bagchi, an associate professor of electrical and computer engineering at Purdue University. "Then, a developer can come in, look at it and fix the problem."
The Purdue researchers used generic computer code rather than actual classified nuclear weapons software code, but their breakthrough should work out well for supercomputer simulations of nuclear weapons testing.
Bagchi and his colleagues at the National Nuclear Security Administration's (NNSA) Lawrence Livermore National Laboratory have also begun fixing the separate problem of "checkpointing." That problem arises because the backup save system can't handle the supercomputing scale of 10,000 machines.
"The problem is that when you scale up to 10,000 machines, this parallel file system bogs down," Bagchi said. "It's about 10 times too much activity for the system to handle, and this mismatch will just become worse because we are continuing to create faster and faster computers."
A possible solution may "compress" the checkpoints similar to how ordinary computers compress image data . Eliminating the checkpointing bottleneck would help open up the possibility of making exascale supercomputers capable of running 1,000 quadrillion operations per second. [Supercomputer 'Titans' Face Huge Energy Costs]
"We're beginning to solve the checkpointing problem," Bagchi said. "It's not completely solved, but we are getting there."