Open source monitoring and observability tools can be found in production all over the world – whether they’re being used by startups or entire enterprise development teams.
DevOps, ITOps, and other technical teams rely on tools like Prometheus, Grafana, OpenSearch, OpenTelemetry, Jaeger, Nagios, Zabbix, Graphite, InfluxDB, and others to monitor and troubleshoot their cloud environment. These are mission-critical tool sets – without them, technical teams would be blind to production issues directly impacting customer experience and revenue.
Of course, technical teams can also choose proprietary tools like Splunk, Datadog, or New Relic. To learn more about the popularity of open source vs proprietary observability stacks, we asked the DevOps community in Logz.io’s annual survey:
- 90% of respondents indicated they are utilizing open source capabilities in some capacity
- Over 75% reporting that open source accounts for at least 25% of their systems
- Almost 40% responded that half or more of their tools are open source
- Roughly 20% citing that they rely on over 75% open source tooling
- 10% of respondents were 100% open source.
This is no surprise. Like other arenas within the world of DevOps, open source software dominates the cloud monitoring landscape – but they’re usually mixed with some proprietary technologies as well.
In this article, we’re going to break down the popularity of open source technologies versus their proprietary counterparts.
We’ll explore questions like: How does open source create tactical advantages for observability practitioners? And if those advantages are so persistent, why doesn’t everyone use open source monitoring tools all the time?
As we tackle these questions, we’ll identify a typical narrative of open source observability adoption and migration.
Table of Contents
Open source adoption: Why open source for observability?
The intuitive reason to explain the rise of open source monitoring adoption is its price tag: it’s free! However, according to the same DevOps Pulse survey, there are plenty of other contributing factors.
Let’s break these down.
Purpose-built to integrate with cloud-native environments
Open source tools like Prometheus, OpenTelemetry, and Fluentd are maintained by the same community that maintains Kubernetes – the Cloud-Native Computing Foundation (CNCF). For this reason, many Kubernetes users turn to the CNCF for their monitoring and observability needs.
As cloud-native technologies like Kubernetes add complexity to cloud environments, observability practitioners need observability tools to break down and understand this complexity. And what better endorsement than the same community who oversees the most influential cloud native technologies?
This advantage is clearly reflected in the chart above – the most common reason observability practitioners choose open source is due to ‘ease of integration.’ Collecting telemetry data (logs, metrics, and traces) is the first – and sometimes most complex – step for gaining observability. It’s no wonder ease of integration is so important to users.
Community-Driven Innovation
Open source code can be improved and reviewed by millions of developers. In addition to having the entire world as a source for innovation, many believe the transparency and accountability of open source yields higher quality code.
A bet on the open source community is oftentimes a good one – as the DevOps Pulse survey respondents indicate.
Lower Cost of Ownership
Since open source is free to download and use, it may be surprising that ‘Lower cost of ownership’ is not the number one reason developers prefer open source.
While open source has no up front costs, it takes engineering resources to scale and maintain.
Maintenance requirements for open source observability deployments are usually tied to data volumes. Growing cloud workloads and data volumes can strain open source observability stacks, requiring more maintenance tasks, such as: implementing queueing, tuning for performance, upgrading software components, and other tasks which we’ll get to later.
All that said, a solid 36% of open source users still say they adopt open source for reasons related to cost.
Avoid Vendor Lock-in
Observability vendor lock-in is uniquely difficult to break. This is largely due to significant onboarding investments.
Observability deployments can require hours (or days!) of configuration and installation to collect data, and additional hours to set up monitoring dashboards and alerts. After investing all that time, migrating to yet a new observability stack – which often entails ripping and replacing everything you’ve built – can be a difficult pill to swallow.
Open source technologies eliminate this problem because they’re compatible with so many observability back-ends. Once you implement open source data collection (again, think Prometheus, OpenTelemetry, and Fluentd), you can use it to collect data for all kinds of observability tools (as opposed to Datadog’s agent or Splunk’s agent, which only works for their own solutions).
Existing familiarity
Sticking with what you know is often the path of least resistance. No new training, no new configurations, no new interfaces.
As open source popularity continues to soar, more and more will choose familiarity as a reason for sticking with open source monitoring tools.
The challenges of running your own open source observability stack
Due to the factors described above, open source deployments are hugely common for those getting started with observability. Especially for small cloud workloads – open source is the cheap and easy solution.
However, as cloud workloads grow, they begin to generate more telemetry data – which can strain open source observability deployments.
Increased load means more infrastructure and components (think multiple clusters, data queuing, data collection components), which can ultimately leave teams with large and burdensome data pipelines that require hours of engineering maintenance. In this case, open source maintenance translates to engineering costs.
Increased data can also drive up Mean Time to Resolution (MTTR) – as data volumes increase, it can be harder to find the specific data that can help troubleshoot an issue. Open source technologies are generally behind their proprietary counterparts regarding analytics that can quickly surface the relevant information.
The relationship between time/effort/MTTR and data volumes can be summarized in this chart:
Let’s dig deeper into the open source maintenance tasks that grow more challenging as data volumes grow.
Infrastructure management and optimization
If you’re expecting more customers within the next five years, you can also expect larger cloud workloads and larger data volumes. Think big!
As your customers rely more heavily on your digital products and telemetry data volumes explode, you’ll need the infrastructure to support the load on your observability system. This requires continuous infrastructure provisioning and optimization to ensure a high performance data pipeline.
Scaling and data queuing
Telemetry data can be bursty – especially logs. During production incidents, log volumes can double or more, which can crash your open source observability stack precisely when you need your logs the most.
For this reason, you’ll need to implement a scalable architecture with data queuing technologies – like Kafka or RabbitMQ – to prevent data bursts from overwhelming your system. That’s more components to manage, upgrade, and troubleshoot.
Software upgrading
As security vulnerabilities and bugs are found in your data pipeline components – and improvements are made to the software – we’ll need to make upgrades.
Sometimes, this is no problem. Upgrading a single Prometheus server is not time-consuming or difficult. However, upgrading a large OpenSearch cluster is both time consuming and difficult – it usually entails setting up an entirely new cluster with the upgraded OpenSearch pipeline components.
Like the rest of open source maintenance tasks, as your cloud workloads and data volumes grow, upgrading your software components becomes harder.
Data parsing
Log parsing is a non-intuitive and specific skill to learn. Yet, it is essential for structuring your logs so they can be easily searched and visualized.
The most common parsing language is Grok, which you’ll need to implement in your log shippers (i.e. Filebeat or Fluentd) or processors (i.e. Logstash). As you ship more logs, you’ll need to implement parsing for all of the data to make it useful.
Monitoring the monitoring stack
When we’re running our own monitoring stack, we’re on the hook to make sure the deployment is available and performant. If we can’t quickly access our telemetry data, we will be blind to health and performance issues impacting our customers – not to mention how to diagnose those problems.
For this reason, open source observability users need to collect their logs and metrics from their monitoring stack. This can be a significant task for large observability deployments – and what happens if the monitoring stack crashes? How will you diagnose the problem without the relevant data?
Performance tuning
Nobody likes slow queries – especially when the cause is unclear. There can be hundreds of reasons why your open source observability stack is slow to return results – maybe the infrastructure needs to be optimized, maybe the observability front-end and back-end connection broke, or maybe there is a bug in the pipeline.
It’s up to the open source user to diagnose the issue (probably using their logs or metrics!) and solve the problem.
Security
Observability data can contain sensitive information, but open source technologies don’t have the security capabilities many companies need. Adding SSO, data encryption, and patching components is doable – the question is, do you have time to do it?
The Open Source Observability Adoption and Migration Curve
The challenges of adopting open source observability tools are not insurmountable. Companies like Netflix, Google, and Uber run their own open source observability stacks. But these companies have vast engineering resources that can handle the load and maintenance caused by massive data volumes.
Open source observability adoption and scaling is a matter of opportunity cost. Do you have the engineering resources needed to run your own open source deployments? Or would you rather they be doing something else?
For some, the cost of dedicating engineering resources to managing and scaling open source deployments exceeds an acceptable threshold, which we can visualize below:
In this scenario, it follows that the team would migrate to an observability vendor that offloads the burden of managing the data pipeline.
Unfortunately, the decision to migrate off of your own open source observability stack can be painful if you’re moving to a proprietary vendor. In most cases, all of those hours spent implementing data collection, dashboards, and alerts go out the window – along with the data analysis skills your team built with the open source interfaces.
That’s why we built Logz.io. By unifying the most familiar open source observability technologies on a cloud-native SaaS platform, teams can continue using the open source observability tools they already know, without managing it themselves. This means the migration is simple – your team can continue using the data collection, dashboards, alerts, and skill sets they have in place.
And it’s not just managed open source – check out this page to learn about all the other capabilities Logz.io builds on top of the leading open source observability tools to make observability easier, most cost efficient, and faster.
Leave a Reply