Tom Wilkie is CTO at Grafana Labs, a member of the Prometheus team and one of the original authors of the Cortex and Loki projects.
In a landscape increasingly shaped by artificial intelligence and machine learning (AI/ML) advancements and the emergence of large language models (LLMs), one question looms large: How will these advancements, and the rise of LLMs, fundamentally change how we approach observability?
Despite bold assertions from some suggesting AI/ML could completely replace engineers and SREs, the reality remains different. We’re still navigating the peak of inflated expectations of generative AI, where promises often outpace practical applications. Without question, AI will have a tremendous impact on observability—it already has and will continue to. According to Grafana Labs’ 2024 Observability Survey, users are excited about how AI/ML will help augment anomaly detection, provide predictive insights and automate dashboard generation. Very few (11%) believe AI/ML is all hype, meaning they see how it will help reduce toil and streamline operations—but at the same time, it won’t be the panacea for the multifaceted challenges of observability.
The path forward demands a delicate balance between embracing innovation and managing expectations. As we venture into uncharted territory, one thing remains certain: AI/ML holds promise but not without human intervention.
Historical Skepticism Of AI And ML In Observability
As AI/ML progresses, skepticism of its utility for observability isn’t unfounded. That’s because observability is a web of complexity that requires more than just an understanding of your raw data—it also demands a deep understanding of the broader context and relationships within your systems.
To truly understand the health and behavior of a complex system, you need to know how it works, how its dependencies interact and where the boundaries are. As the modern technology stack becomes increasingly complex with more layers of abstraction, this becomes a specialized skill set. With this added complexity, traditional analysis becomes less powerful, replaced with the need for more critical thinking—in other words, the ability to connect the dots and make inferences and predictions based on prior experience.
Consider a large e-commerce platform that processes millions of transactions daily. The observability system has access to CPU data, network latency, query times and more. Although these metrics provide valuable insights into the performance of individual components, they don’t tell the full story. To understand the health and behavior of the overall system, the observability team needs to consider the broader context and relationships, such as the impact of peak shopping seasons, the effect of promotional campaigns on inventory and logistics, and so on.
In this complex, ever-changing environment, the overall system architecture becomes “unknowable” in the traditional sense—any attempt to capture it is instantly out of date. The observability team can no longer rely on static models or pre-defined thresholds to identify and resolve issues. Instead, they must adopt a more dynamic, iterative approach, where they discover the system’s behavior through telemetry, timely action and feedback loops. This might involve techniques like distributed tracing to understand the end-to-end flow of transactions, real-time anomaly detection to identify emerging trends and patterns, and incident response to quickly diagnose and mitigate issues.
Although AI/ML will assist and speed up some of these aspects, human input, understanding and contextualization will still be necessary. This is especially true during these early days of generative AI/ML where we still see LLMs laced with bias, hallucinations and other bugs. Those issues will be resolved in due time, but it’s unclear whether AI/ML will ever develop the deep understanding and contextual awareness required for effective observability.
AI And ML Opportunities In Observability
So, where will AI/ML make a meaningful impact? Initially, the focus has been on minimizing toil and automating certain tasks, like anomaly detection and root cause analysis, which then gives engineers more time to solve more complex issues. For example, AI/ML models can be trained on historical monitoring data to learn normal system behavior patterns, enabling them to accurately detect anomalies and alert engineers before issues escalate.
But the real efficiency gains will be how LLMs augment developers’ experience. With AI/ML, junior site reliability engineers (SREs) have access to the knowledge of someone with decades of experience at their fingertips. And instead of having to learn PromQL or other database-specific languages, they can use a natural language interface to generate their queries and interact with observability systems. An SRE could just type, “Show me the CPU utilization of our web servers over the past week,” and the AI/ML would translate that into the appropriate query. Even more advanced use cases have involved “self-optimizing” usage automatically. Similar to how a grizzled SysOps engineer could spend days working to reduce CPU usage, LLMs can be used to analyze profiles and recommend changes to conserve CPU and memory—saving significant time and resources.
This opens up the possibility for more people across the business to leverage observability data because you won’t need years of programming experience—you’ll just need to know how to speak a language. Non-technical roles like product managers, executives and business analysts could directly query and analyze system behavior, performance and reliability using conversational AI/ML assistants.
What’s Next?
The impact of AI/ML on observability is likely to be limited, at least in the near term, as current systems struggle to fully capture the nuance and context required for effective observability. However, AI/ML may have a complementary role to play when combined with human expertise and a holistic understanding of complex systems. The future of observability will require a balance of technological capabilities and human insight. But with all that to say, nothing would excite me more than to have my assumptions about AI/ML’s applicability to observability tested.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?
Read the full article here