In an era where there’s no shortage of established best practices and tools, engineering teams are consistently finding their ability to prevent, detect and resolve production issues is only getting harder. Why is this the case?
Our most recent DevOps Pulse Survey highlighted alarming trends to this end. Among the most troubling is that Mean Time to Recovery (MTTR), or, the average time it takes to recover from a product or system failure, has worsened despite all the great tools and talent out there.
In 2021, roughly 64% of DevOps Pulse Survey respondents reported their MTTR during production incidents was over an hour. This was up pretty significantly from 47% in 2020.
At the same time, some 90% of respondents said they use multiple tools for observability, with 66% using at least 2-4 observability systems and roughly 24% employing anywhere from 5 to 10. In 2020, roughly 20% were using only 1 tool, with only 10% using over 5.
Clearly we have no shortage of tools at our disposal in observability, as well as data that we can glean from those tools. Yet, the evidence suggests that things are growing more complicated, not easier.
Here are six reasons why I believe this to be the case.
Table of Contents
Reason #1: Explosion of Noisy, Valueless Data
In a journey to become more intelligent in consolidating and analyzing relevant data, many organizations have shipped enormous amounts of useless telemetry sources into their monitoring systems. This includes multiple observability data types such across the three major categories of logs, metrics and traces.
In an effort to become a more insightful and informed engineering organization, many organizations collect increasing volumes of data under the misinformed pretense that, “we need everything.”
We frequently hear from customers who are struggling to increase uptime that they have all the data, but, upon further review, much of it is useless. The reality is that based on the sheer volume alone, they’re never going to look through the lion’s share of it – they’re never going to search for it, they’re never going to analyze it – but, using this model, they are going to pay for it. And that’s a big deal as well.
The big picture problem: most organizations really don’t have the maturity or talent to understand all of this data, or put it to work and gain insights from it.
Reason #2: Organizations Aren’t Mature Enough to Understand the Data
When you have all these voluminous amounts of data, what are you supposed to do with it? How would you even know what to do with all of it?
Companies most often started by collecting a lot of logs on Day 1. For instance, as Kubernetes has become more common, most teams have realized that there are significant challenges around gaining related observability. So, many tried to solve this issue by analyzing K8s data alongside a lot of other telemetry data types. At present, most seem stuck somewhere in the journey of implementation, instrumentation, collection and parsing.
Here’s the issue: this journey is taking forever. The reality is that it takes most companies years to become mature enough to make data-driven decisions on their production environment. Most are still in this spot, in my view. They’re collecting a lot of data, but they still don’t have the maturity to use it effectively. Plus, most don’t have the in-house expertise to get there.
For example, how do you use distributed tracing effectively to isolate incidents and understand performance issues? What is the best way to correlate it with logs in this specific production environment, and get to the insights needed in real-time to automate mitigation? How do you take quick action to find the root cause as fast as possible? These are usually lots of dispersed, disjointed systems, and it takes a lot of time to really mature observability.
Reason #3: Too Damn Expensive!
For reason #1 and #2, observability is broken for most organizations that cannot afford the exponential growth in cost that it drives. (Unless you’re Twitter or Facebook, who can afford it…actually, I guess that changes too.)
Especially in the 2022 economy, spending an unpredictable fortune that drives limited value is not going to fly. CFOs will ask difficult questions around the value of all this telemetry data that in reality has very little.
How do you forecast costs? How do you manage new code deployments with ridiculous “debug” flags enabled?
Reason #4: Environments are Constantly Getting More Complex
We touched on teams dealing with an explosion of data. But, related to that primary challenge is the sheer complexity of most production environments and the observability required to monitor them effectively. Most implementations simply fail to help the normal engineer to become more effective based on spiraling complexity. Yes, there are a few experts in each organization that understand how it all connects, but we see that more often than not, the majority of the team is left behind.
Kubernetes is great, as an example, but it consistently creates additional complexity and abstraction layers. As a result, engineers are more and more disconnected from the infrastructure on which their code actually runs because of that outcome. It creates an observability nightmare when someone is trying to understand the root cause of an incident that spanned different microservices, different K8s clusters in different regions.
As such, most of today’s observability solutions are focused on adding more and more capabilities instead of helping engineering teams cope with the ever rising complexity of their environments.
Reason #5: Engineers Want Open Source, But Those Solutions Have Gaps
I personally believe that if given the choice of solutions, many engineers will choose open source every time. One need only look at the sheer number of successful projects and legions of users supporting those tools to underline this claim. For observability, huge numbers of teams are already using tools such as Prometheus, Grafana, Jaeger, OpenSearch and OpenTelemetry. Many times, this is where they actually start when building out their monitoring stacks.
Yet, while each of these individual tools serve foundational requirements for establishing visibility around logs, metrics and traces, there are so many gaps in the widely-used open source solutions that it is too difficult to effectively achieve end-to-end observability and support a modern environment.
For that reason, many organizations walk away from their open source roots and enlist proprietary vendors that promise to unify their data and address all the related requirements. How is that working out? See above.
For those that decide to continue to use open source and roll their own stacks, the approach most often leads to a “Frankenstein” of observability tools mish-mashed together. Many attempt to combine open source and proprietary.
These efforts most often drain the value of the entire implementation and amplify the aforementioned complexity that keeps organizations from reaching their observability goals.
Reason #6: A Deepening Engineering Talent Gap
I mentioned before that many organizations don’t have talent in place who truly understand observability and how to analyze all the involved telemetry data. This is underlined by the other factors I’ve already provided.
The truth is, people with the needed skills are extremely hard to come by and retain. A very small number of engineers actually understand how all of the relevant observability signals should come together, and how their microservices connect to the cloud, and Kubernetes, among other things. It’s just not that easy.
Most teams do not have that level of sophistication. They’re struggling to use current observability implementations to find and isolate root problems in environments. Try putting yourself in the shoes of inexperienced or understaffed engineering teams given all that we’ve already mentioned here in terms of pervasive challenges.
Fixing Broken Observability
So what can we do to fix this long list of observability issues?
At Logz.io, we’ve listened to customer feedback to help them discover what they need in an overwhelming, costly and hyper-skilled environment. They need unified capabilities, the opportunity to cut through the data and complexity, and the freedom to best invoke the power and flexibility of open source…for the teams that they already have in place today.
We’re very excited to share more information in the coming weeks about how we can help organizations fix their broken observability environments. Solutions to these issues are what we try to build everyday. Stay tuned for more details soon!
Leave a Reply