High-performance computing expert Jason Stowe recently asked two of his engineers a simple question: Can you build a 10,000-core cluster in the cloud?
“It’s a really nice round number,” says Stowe, the CEO and founder of Cycle Computing, a vendor that helps customers gain fast and efficient access to the kind of supercomputing power usually reserved for universities and large research organizations.
SUPERCOMPUTERS: Microsoft breaks petaflop barrier, loses Top 500 spot to Linux
Cycle Computing had already built a few clusters on Amazon’s Elastic Compute Cloud that scaled up to several thousand cores. But Stowe wanted to take it to the next level. Provisioning 10,000 cores on Amazon has probably been done numerous times, but Stowe says he’s not aware of anyone else achieving that number in an HPC cluster, meaning one that uses a batch scheduling technology and runs an HPC-optimized application.
“We haven’t found references to anything larger,” Stowe says. Had it been tested for speed, the Linux-based cluster Stowe ran on Amazon might have been big enough to make the Top 500 list of the world’s fastest supercomputers.
One of the first steps was finding a customer that would benefit from such a large cluster. There’s no sense in spinning up such a large environment unless it’s devoted to some real work.
The customer that opted for the 10,000-core cloud cluster was biotech company Genentech in San Francisco, where scientist Jacob Corn needed computing power to examine how proteins bind to each other, in research that might eventually lead to medical treatments. Compared to the 10,000-core cluster, “we’re a tenth the size internally,” Corn says.
Cycle Computing and Genentech spun up the cluster on March 1 a little after midnight, based on Amazon’s advice regarding the optimal time to request 10,000 cores. While Amazon offers virtual machine instances optimized for high-performance computing, Cycle and Genentech instead opted for a “standard vanilla CentOS” Linux cluster to save money, according to Stowe. CentOS is a version of Linux based on Red Hat’s Linux.
The 10,000 cores were composed of 1,250 instances with eight cores each, as well as 8.75TB of RAM and 2PB disk space. Scaling up a couple of thousand cores at a time, it took 45 minutes to provision the whole cluster. There were no problems. “When we requested the 10,000th core, we got it,” Stowe said.
The cluster ran for eight hours at a cost of $8,500, including all the fees to Amazon and Cycle Computing. (See also: Start-up transforms unused desktop cycles into fast server clusters)
For Genentech, this was cheap and easy compared to the alternative of buying 10,000 cores for its own data center and having them idle away with no work for most of their lives, Corn says. Using Genentech’s existing resources to perform the simulations would take weeks or months instead of the eight hours it took on Amazon, he says. Genentech benefited from the high number of cores because its calculations were “embarrassingly parallel,” with no communication between nodes, so performance stats “scaled linearly with the number of cores,” Corn said.