America’s most powerful supercomputer in the world still has a hardware failure, can’t work for a full day

Tram Ho

America now owns a supercomputer at its own level different from the previous supercomputer lines, not only in the US but also in the whole world called Frontier.

It is based on Hewlett Packard Enterprise (HPE) Cray EX235a architecture, with 9,408 nodes each equipped with a 64-core AMD Trento 7A53 Epyc CPU with 512GB DDR4 memory, and four Instinct MI250X GPUs. from AMD with 128 GB HBM2E VRAM memory. In total, the system has a total of 602,112 CPU cores and 8,138,240 GPU cores, and 4.6 PB memory of both DDR4 and HBM2e.

The entire system is housed in 74 component cabinets, each weighing more than 3.6 tons. Supporting it is a 700 petabyte memory system with Slingshot high-performance ethernet for data transfer.

Siêu máy tính mạnh nhất thế giới của Mỹ vẫn bị lỗi phần cứng, chưa thể hoạt động trọn một ngày - Ảnh 1.

Frontier is currently the most powerful supercomputer in the world.

In May of this year, Frontier entered the TOP500, the global list of supercomputers, as the first supercomputer to break the “exascale barrier” after it demonstrated computing power of up to 1,102 ExaFlops/ S. Since then, Oak Ridge National Laboratory in Tennessee, which manages the supercomputer, has said it is ready for scientific studies on the device, which is expected to begin in January next year.

However, the latest reports suggest that Frontier’s launch may be interrupted by hardware failures. In a recent interview with Inside HPC , Program Director at Oak Ridge, Justin Whitt, confirmed Frontier was experiencing system failures on a daily basis but insisted it was inevitable with a system as large as this one. so.

“The average time between failures on a system this size is hours, not days,” he said . “So you need to make sure you understand what those failures are and there’s no stereotype for those failures that you need to be concerned about.”

Whitt added that if it ran more than a day without failure, it was “excellent”. Because according to him, The goal of its creation is to allow users to be productive in their scientific studies, and this time varies from project to project.

“Our goal is still to run it hourly,” says Justin Whitt.

Siêu máy tính mạnh nhất thế giới của Mỹ vẫn bị lỗi phần cứng, chưa thể hoạt động trọn một ngày - Ảnh 2.

Justin Whitt next to “a small piece” of the Frontier supercomputer.

Some rumors say that hardware problems are caused by the new AMD Instinct MI250X, but Whitt has denied them. The MI250X is AMD’s most powerful GPU, and it only sells it to certain partners.

“The problems span a lot of different categories, GPU is just one of them,” commented Whitt. “We’re dealing with a lot of things that are first built, as well as things that haven’t been seen on other systems we’ve deployed, so it’s all too commonplace.”

Whitt admits that Frontier’s unprecedented size has made it “a bit more difficult,” but the program representative said they are still sticking to the schedule set out for 2018-2019. despite delays caused by the pandemic.

Check out Techspot, Inside HPC

Share the news now

Source : Genk