How Meta detects and mitigates ‘silent errors’

We are excited to bring Transform 2022 back to life on 19th July and virtually 20th July – 3rd August. Join AI and data leaders for discreet discussions and exciting networking opportunities. Learn more

Silent errors, as they are called, are hardware defects that leave no trace in the system log. The incidence of these problems can be further increased by factors like temperature and age. It is an industry-wide problem that poses a major challenge to datacenter infrastructure, as they can wreak havoc on applications for a long time, even when they are not detected.

In a newly published paper, Meta provides detailed information on how to detect and minimize these bugs in its infrastructure. Uses a combined approach by testing both when offline machines for meta maintenance as well as small tests during production. Meta found that while the previous method achieved greater overall coverage, in-production testing could achieve stronger coverage in a much shorter period of time.

Silent mistakes

Silent Errors, also known as Silent Data Corruption (SDC), are the result of an internal hardware defect. More precisely, these errors occur in places where there is no investigative logic, which cannot be traced to the defect. They may be further influenced by factors such as temperature variation, datapath variation and age.

Defects cause incorrect circuit operation. This can then manifest itself as a flipped bit in the data value at the application level, or it can lead the hardware to run completely incorrect instructions. Their effects can spread to other services and systems.

For example, in a case study, a simple calculation in the database gave a wrong answer to 0, resulting in missing rows and losing data. On a meta scale, the company reports hundreds of such SDCs. Meta has discovered an SDC incidence rate of one thousand silicon devices, which it claims reflects basic silicon challenges rather than particle effects or cosmic rays.

Meta has been running a search and testing framework since 2019. These strategies can be classified into two buckets: a fleetscanner for out-of-product testing and a ripple for in-production testing.

Silicone test funnel

Before the silicone device enters the meta fleet, it passes through the silicone test funnel. During development the silicon chip is passed before verification (simulation and emulation) and after the silicon verification on the actual samples. Both of these tests can last for several months. During production, the device undergoes further (automated) tests at the device and system level. Silicone vendors often exploit this level of testing for binding purposes, as performance will vary. Non-efficient chips result in lower product yields.

Finally, when the device comes to meta, it undergoes infrastructure intake (burn-in) testing on many software configurations at rack-level. Traditionally, this would have completed testing, and the device would be expected to work for the rest of its life cycle depending on the built-in RAS (reliability-availability-serviceability) features to monitor the health of the system.

However, SDCs cannot be detected by these methods. Therefore, this requires a dedicated testing pattern that is run periodically during production, requiring orchestration and scheduling. In the most extreme cases, these tests are performed during

It is worth noting that the closer the device production workload runs, the shorter the duration of the tests, but also the lower the ability to diagnose the root cause of silicon defects. In addition, the cost and complexity of the test, as well as the potential impact of the defect, are increased. For example, at the system level multiple types of devices have to work seamlessly, while the infrastructure level adds complex applications and operating systems.

Fleetwide test observations

Silent errors are difficult because they can produce false results that cannot be detected, as well as affect numerous applications. These errors will continue to propagate until they produce noticeable differences at the application level.

Moreover, there are many factors that affect their occurrence. Meta found that these defects fall into four main categories:

  • Data randomization. Corruption is based on input data, for example due to certain bit patterns. This creates a huge state space for testing. For example, 3 times 5 is evaluated correctly at 15, while 3 times 4 is evaluated at 10.
  • Electrical variation. Changes in voltage, frequency and current can lead to more cases of data corruption. Under one set of these parameters, the result may be accurate, while for other sets this may not be the case. This test further complicates the state space.
  • Environmental diversity. Other variations such as temperature and humidity can also affect Silent Errors, as this can directly affect the physics associated with the device. There can also be hotspots in controlled environments such as datacenters. In particular, this can lead to variations in results across datacenters.
  • Variety of life cycle. Like the regular failure of the device, the occurrence of SDC can also change throughout the silicon life cycle.

Infrastructure testing

Meta has implemented two series of fleetwide testing across millions of machines. This is out-of-production and in-production testing.

Workflow chart for in-production testing.

In out-of-product testing, the machines are taken offline and subjected to a known pattern of inputs. The output is then compared with the references. In these tests, all of the variables discussed above are considered and tested against the use of state search policies.

Most often, machines are not typically taken offline for testing on machine errors, but they are occasionally tested when the machine is offline for a variety of other reasons, such as firmware and kernel upgrades, provisioning, or traditional server repair.

During such server maintenance, Silent Error Detection is performed with a test tool called Meta Fleetscanner. This method of operation reduces overhead and therefore costs. When silent data corruption is detected, the machine is isolated and further tests are performed.

Workflow chart for in-product testing.

Since out-of-production is slow, it has a long reaction time for newly identified signatures, even testing in production with a tool called Meta Ripple. It co-operates with workloads and executes test instructions at millisecond level intervals. Meta reports that it is able to perform shadow testing by running A / B testing in various variables, and that the tool is always on. Meta has specifically identified ripple testing as a major evolution for Silent Data Corruption tools.

Findings and trade

Based on three years of observations, Fleetscanner achieves 93% coverage for a family with a certain defect, and 23% unique coverage that could not be reached by ripples. However, the cost is of course the non-trivial time (and therefore the cost) that is spent on testing. In contrast, Ripple offers 7% unique coverage. Meta argues that it is impossible to achieve this coverage with Fleetscanner due to the frequent transition of the workload with ripples.

Comparing the time to achieve 70% equivalent SDC coverage, Fleetscanner will take 6 months compared to just 15 days for Ripple.

When left untouched, applications can be exposed to silent data corruption for months. This in turn can lead to significant effects such as data loss which can take months to debug. As such, this poses a serious problem for datacenter infrastructure.

Meta has implemented a comprehensive testing method that includes out-of-product fleetscanners that run during maintenance for other purposes, and rapid (millisecond level) in-production ripple testing.

Venturebeat’s mission Transformative Enterprise is about to become a digital town square for technology decision makers to gain knowledge about technology and transactions. Learn more

Similar Posts

Leave a Reply

Your email address will not be published.