How ML can solve root cause application failure mysteries for engineering and support teams

Join online with today’s leading executives at the Data Summit on March 9th. Register here.


This article was contributed by Ajay Singh, Founder and CEO of Zebrium.

Software sometimes breaks down – whether in the cloud, in hardware appliances, or in infrastructure such as networking and security. It is an inevitable fact of life, mainly due to frequent code updates, with complexity and numerous use variants. Problems with the app become costly for companies and can lead to customer loss, shopping cart termination or even a bad reputation.

A six-hour Facebook outage in October 2021 resulted in a loss of $ 164,000 per minute and reduced the company’s market cap by about $ 40 billion. The December 2021 AWS outage wreaked havoc across US banks, service companies and other retailers causing significant damage when mobile applications or web applications failed. Outages and problems are extremely expensive, so fixing them quickly is paramount. The pressure is on, and the clock is ticking. Unfortunately, finding the root cause of these failures is rarely straightforward and often involves significant detective work.

In case of declining Facebook outage, Dowindetector “With more than 10.6 million problem reports from around the world, this is the biggest outage we’ve ever seen on a down detector.” The outage was eventually identified as a configuration change problem. Outage is becoming more serious and costly, according to the Uptime Institute 2020 Outage Analysis Report. At the same time, features such as software micro-services and cloud infrastructure are becoming more and more complex to treat as the dependence grows.

To find the root cause, in the ideal world, engineers and support teams must have a constant flow of logs, unlimited time to analyze it and an understanding of the problem they are about to solve, but this rarely happens. Most of the time, they get a bundle of log files after the fact, without any other reference or understanding of the problem. They are then asked to work out their detective skills. Because these files are often just a snapshot of a few hours’ duration on the day of the incident, the difficult task of establishing an understanding of what went wrong seems like an unsolved mystery.

Thanks to some very clever machine learning (ML) techniques, however, even a stable bundle of logs can give quick answers. ML-powered root cause analysis can identify patterns and correlations that may not be obvious to the support engineer’s naked eye and may reveal the cause of the event faster than manual analysis. This not only speeds up the resolution, but also improves team productivity and efficiency.

In most cases, the challenge of finding the root cause is complicated by the sheer size and number of logs, their disorganized and disorganized nature, and lack of clarity on what the person is trying to find. All of these factors favor ML, not because the task is impossible for trained personnel, but because ML operates faster than the human eye and the limitations of available human resources.

When troubleshooting by analyzing logs, skilled engineers usually start by looking at the entire log for rare and unexpected log events and correlating them with errors. The larger the amount of logs and data, the more difficult it is for humans and the higher the cost proposition of using ML. The difficulty of the task increases as one moves on to finding discrepancies and providing meaningful insights after reviewing large data sets. With ML, each step can be completed autonomously and can be easily measured to almost any volume of data.

ML is even more appropriate to determine the real root cause of the problem. In the face of time competition and team resource constraints, engineers and support staff will often find a quick fix or solution instead of identifying and addressing its true root cause. This means that the same problem will happen again and could affect many other customers as well. However, while MLs are used to uncover the root cause, engineers can use their limited time to work directly to address the source of the problem and prevent it from having an ongoing effect.

Of course, ML is not a panacea for the completeness of application support. Trained professionals still need to review ML findings and take appropriate action. While most overall processes can now be automated, it leaves team members to apply their skills to the most important task – the “last mile”. The result of using ML is to speed up the whole process, increase the efficiency of the team and give professionals more time to work on important tasks.

As application and environmental complexities continue to grow and demand on subsidiaries increases, the introduction of MLs for logs in the application support process is rapidly moving from luxury to necessity.

Ajay Singh is the founder and CEO of Zebrium.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including tech people working on data, can share data-related insights and innovations.

If you would like to read about the latest ideas and latest information, best practices and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing to your own article!

Read more from DataDecisionMakers

Similar Posts

Leave a Reply

Your email address will not be published.