Imagine for a second you were transposed into the karmic driven world of Earl. The millions of computer chips inside the servers that powered the world’s largest data centers had flaws. And the only way to find fault was to throw the chips into huge computing problems that would have been unthinkable just a decade ago.
As the small switches in computer chips shrink to a width of a few atoms, the reliability of chips has become another concern for those operating the world’s largest network. Companies like Amazon, Facebook, Twitter and many other sites have experienced surprising outages in the last year.
There are many causes of outages, such as programming errors and congestion on networks. But there is growing concern that as cloud-computing networks become larger and more complex, they are still at the most basic level, relying on computer chips that are now less reliable and, in some cases, less predictable.
Over the past year, researchers from both Facebook and Google have published studies describing computer hardware failures whose causes are not easy to identify. They argued that the problem was not with the software – it was somewhere in the computer hardware created by the various companies. Google declined to comment on his study, while Facebook did not return requests for comment on his study.
“They are seeing these silent bugs, which inevitably come from the underlying hardware,” said Subhash Mitra, an electrical engineer at Stanford University who specializes in testing computer hardware. Increasingly, Drs. “People believe that manufacturing defects are linked to these so-called silent errors which are not easily detected,” Mitra said.
Researchers worry that they are finding rare flaws as they try to solve bigger and bigger computing problems, which put an unexpected emphasis on their system.
Companies operating large data centers began reporting systemic problems more than a decade ago. In 2015, the engineering publication IEEE Spectrum, A group of computer scientists studying hardware reliability at the University of Toronto has reported that 4 percent of Google’s millions of computers each year have errors that cannot be detected and have caused them to shut down unexpectedly.
In a microprocessor with billions of transistors – or a computer memory board made up of trillions of tiny switches that can store 1 or 0 each – even a small error can disrupt a system that now regularly performs billions of calculations every second.
Early in the semiconductor era, engineers were concerned about the possibility of cosmic rays that occasionally flip a transistor and change the result of the calculation. Now they are worried that the switches themselves are becoming less reliable. Researchers at Facebook also argue that switches are becoming more dangerous and that the life of computer memory or processors may be shorter than previously thought.
Evidence is growing that the problem is getting worse with each new generation of chips. A report published in 2020 by chip maker Advanced Micro Devices found that the most advanced computer memory chips at the time were about 5.5 times less reliable than previous generations. AMD did not respond to requests for comment on the report.
Experienced hardware engineer David Ditzel, chairman and founder of Esperanto Technologies, the maker of a new type of processor designed for artificial intelligence applications in Mountain View, Calif., Said tracking these errors is challenging. He said his company’s new chip, which is just hitting the market, has 1,000 processors made up of 28 billion transistors.
It compares the chip to an apartment building that spreads across the United States. Using Mr. The metaphor of Ditzel, Dr. Mitra said finding new faults was like finding a single facet running in one of the apartments in the building, which is a defect only when the bedroom lights are on and the apartment door is open.
So far, computer designers have tried to fix hardware defects by adding special circuits to chips that correct errors. The circuit automatically detects and corrects bad data. It was once considered an extremely rare problem. But many years ago, Google production teams began reporting errors that were difficult to diagnose. Computation errors would occur intermittently and were difficult to reproduce, according to their report.
A team of researchers tried to track the problem, and last year they published their findings. They concluded that the company’s huge data centers, made up of computer systems based on millions of processor “cores”, were experiencing new errors that were probably a combination of several factors: small transistors that were close to physical limits and inadequate testing.
In their paper “Core That Doesn’t Count”, Google researchers noted that the problem was so challenging that they had already devoted several decades of engineering time to solving it.
Modern processor chips are made up of dozens of processor cores, calculating the engine that makes it possible to break down tasks and solve them in parallel. Researchers have found that a small subset of corona often and only under certain conditions produce inaccurate results. They described the behavior as sporadic. In some cases, cores will only generate errors when the calculated speed or temperature changes.
According to Google, increasing complexity in processor design was a major cause of failure. But engineers also say that smaller transistors, three-dimensional chips and new designs that only make mistakes in certain cases all contribute to the problem.
In a similar paper published last year, a group of Facebook researchers noted that some processors would pass the manufacturer’s tests but began to show failures when they were in the field.
Intel executives said they are familiar with Google and Facebook research papers and are working with both companies to develop new methods for detecting and correcting hardware errors.
Brian Jorgensen, vice president of Intel’s Data Platform Group, said the researchers’ claims were true and that “the challenge they are facing in the industry is the right place.”
He said Intel recently launched a project to help build standard, open source software for data center operators. The software will make it possible for them to detect and correct hardware errors that could not be detected by the built-in circuit in the chips.
The challenge was underscored last year, when many Intel customers quietly issued warnings about flaws discovered by their systems. Lenovo, the world’s largest manufacturer of personal computers, informs its customers that the design changes in some generations of Intel’s Xeon processors mean that the chips can produce a large number of errors that cannot be corrected by previous Intel microprocessors.
Intel has not spoken publicly about the issue, but Mr. Jorgensen acknowledged the problem and said it had now been rectified. The company has since changed its design.
Computer engineers are divided on how to respond to the challenge. A widespread response is the demand for new types of software that actively look for hardware bugs and make it possible for system operators to fix them when the hardware starts to malfunction. It has created an opportunity for new start-ups to offer software that monitors the health of internal chips in data centers.
One such operation is TidalScale, Los Gatos, Calif. A company based in, which makes specialized software for companies trying to reduce hardware outages. Its chief executive, Gary Smardon, suggested that Tidalscale and others were facing an impressive challenge.
“It would be a bit like changing an engine when a plane is still flying,” he said.