Why Observability Matters in System Design

1. Introduction

In this post, I want to talk about observability and how to design a system with observability. Observability just means watching and understanding internal state of a system through external outputs. But what exactly are we watching when it comes to systems or microservices? We’ll dig into that. In one of my previous posts, I have talked about observability by collecting metrics.

Often, we build a system backward. We start with an idea and make a product for people to use. We hope everything works smoothly, but it rarely does. There are always problems, challenges, and the product might not work as users expect. Users might say your product isn’t working, get frustrated, and leave. In the end, it just becomes another not-so-great product.

When users report problems, an engineer has to figure out why it happened. Having logs that show what went wrong can be really helpful. But in these situations, we’re reacting—the system is already doing something, and we’re waiting for it to mess up. If someone else has to understand our system without knowing all the technical stuff, it’s tough for them to figure out what’s happening. They might have to use the product to really get what’s going on. Is there a better way? Can we design our system so anyone can understand how it’s doing without diving into all the complicated details?

How do we design a system that can allow us to observe it? Can we have a mechanism to alert an issue before a customer comes to us? How can observability help us here?

2. Components of Observability

There are 4 components to Observability:

  • Log Aggregation
  • Metrics Collection
  • Distributed Tracing
  • Alerting

Each of these 4 components of observability is a building block for reliable system design.

Anyhow, let’s look at them.

2.1 Log Aggregation

Whether you are using Microservices or Monoliths for your system, there is some communication between frontend and backend OR backend to backend. The data flows from one system to other following some business logic. Logs are the key to help us understand how the production system is working.

While designing any system, we should form a convention on how we want to log the information. Logging right data should be a prerequisite to building any application. Every system design comes with its own set of challenges, log aggregation for the system is the least of the challenge. Having log aggregation will help in long term. How you implement log aggregation can depend on variety of factors like the tech stack, cloud provider. With modern tools, it has become easy to query aggregated logs.

When you start building your system, adopt a convention on how and what to log from your application. One caution to take into account is to not log any personal identifying information.

2.2 Metrics Collection

Whenever your system will start to mature, you will notice the trends of resources it has been using. You might face issues where you might have allocated less space or memory. Then on emergency basic, you will need to increase memory or space. Even in that scenario, you might have to make a judgement call by how much to increase.

This is where metrics collection comes into picture. If we build a system that can provides us metrics like latency, Memory/CPU usage, disk usage, read/write ops, it can help us for capacity planning. Also, it will provide more accurate information on how we want to scale if we start to see heavy load.

Metrics can also allow us to detect anomalies in our system. With monitoring tools available, it has become easier to collect metrics. Datadog, Splunk, New Relic are few of the observability tools.

Other than system metrics, one can also build product metrics using such tools for metrics collection. Product metrics can allow to see how your users have been using or not using your application, what does your product lack? It is one of the ways to gather feedback of your own product and improve on it.

2.3 Distributed Tracing

Overall, the microservices for various systems are getting complex.  All microservices within a single system talk to each other and process some data. It would be nice if there was a way to trace each call that comes from the client to these services. That’s where distributed tracing comes into picture.

Distributed tracing is capturing activity within a local thread using span. Eventually, collecting all these spans at one central place. All these spans are linked to each other through a common correlation id OR trace id.

The advantage of distributed tracing is to be able to see any issue or anomaly of the microservice. This allows us to trace an end-to-end call.

2.4 Alerting

At the beginning of this post, I mentioned how users of our product can get frustrated and bring to our attention various issues. In many such cases, we are reactive. Once the user reports an issue, we investigate why it happened and how it can not happen again.

With distributed tracing and metrics, can we create alerts to inform us about any issues in our application? Yes, that’s where the alerting comes into picture. We can also collect metrics related to errors OR gather errors data through distributed tracing and logs.

We also want to be careful where we don’t want to alert too much that it becomes a noise. Having some threshold around errors helps. Anomaly alerts can be handy. At the end, it all depends on your application and all the observability data that you will gather.

3. System Design with Observability

We talked about the key components around observability. We are aware of designing a system with certain requirements and industry practices. How do we build this system with observability in mind? Setting certain service level agreements (SLA) and service level objectives (SLO) before building your application can direct engineers. Engineers can then build observability as they build the application instead of coming at it later. This will be more proactive approach. In many microservices, we can also integrate with opentelemetry. Opentelemetry is a vendor neutral observability framework.

4. Conclusion

In this post, I discussed monitoring, observability and how an engineer can think about these while building a system. To be more proactive with these practices, system design should take observability into account.