blank

SLO formulas implementation in PromQL step by step

2024-03-25T08:30:00+00:00

Theory

In engineering, perfection isn’t optimal. Even though a service that is continuously up and always responds with the expected latency^[1] might sound desirable, it would require too expensive resources and overly conservative practices. To maintain the expectations and prioritize engineering resources, it’s common to define and publish Service Level Objectives (SLO).

An SLO defines the expected ratio of measurements meeting the target value to all measurements recorded within a specified time interval.

To break it down, I believe that a properly formulated SLO consists of four segments:

Definition of the metric (what’s being measured).
Target value.
Anticipated ratio of good (meeting the target) to all measurements, typically expressed in percentages.
Time interval over which the SLO is considered.

A simple availability SLO could be:

a service is up 99.99% of the time over a week period
- the metric is the service availability (implicit)
- the target value is that it’s up
- the anticipated good to all ratio is 99.99%
- the considered time interval is week

An example of latency SLO could be:

the latency of 99% of requests is less than or equal 250ms over a week period
- the metric is the requests latency
- the target value is less than or equal 250ms
- the anticipated good to all ratio is 99%
- the considered time interval is week

Having an SLO defined, a service operator needs to deliver a calculation that tells the reality. The remainder of this article is a step by step guide on how to implement availability and latency SLO formulas with the Prometheus monitoring system.

Practice - SLO formulas with PromQL step by step

There are plenty of resources out there about SLO calculation. Moreover, there are off-the-shelf solutions for getting SLOs from metrics^[2].
OK then, why add one more resource on the topic? I want to present real-life examples of both availability and latency SLOs, as they are more nuanced than they may initially appear. Also, I find it worthwhile sharing a detailed guide as it showcases uncommon uses of PromQL and demonstrates the language’s versatility.

Formula for availability SLO

Let’s look again at our SLO:

A service is up 99.99% of the time over a week period.

To reason about a service’s availability we need a metric that indicates whether a service is up or down. Additionally, we need to sample this metric at regular time intervals. This allows us to divide the count of samples when the service reported as up by the count of all time intervals in the observed period. This gives us our SLO.

Prometheus delivers the up metric (quite an accurate naming) for every scrape target. It’s a gauge which reports 1 when the scrape succeeded and 0 if the scrape failed. Given its simplicity, it’s a very good proxy of the service’s availability (however, it’s still prone to false negatives and false positives).

Let’s consider the following plot of an up metric:

This service was sampled every 15s. The time range is from 10:30:00 to 12:00:00 (1h30m inclusive - 361 samples). We can observe two periods when the service was down: from 11:03:45 to 11:04:00 and from 11:09:45 to 11:21:45. This gives us 2 + 49 = 51 samples missing the target and 361 - 51 = 310 samples meeting the target.
Therefore the SLO is (310 / 361) * 100 = 85.87%. How to get this number with PromQL? Intuitively, we could count good samples and all samples (with count_over_time), and then calculate the ratio. However, given that the up metric is a gauge reporting either 0 or 1 we can do a trick and use the avg_over_time function:

avg_over_time(
    up{job="my_service"}[90m:]
) * 100
=> 85.87257617728531

For production readiness, I recommend using instant query and adjust the range. Also, see the footnotes if you suffer from periodic metric absence^[3].

Formula for latency SLO

Let’s recity our latency SLO:

The latency of 99% of requests is less than or equal 250ms over a week period.

For an accurate calculation, we need the count of requests that took less than or 250ms and the count of all requests. It turns out that it is a rare luxury, though. It actually means that you either have:

the service instrumented upfront with a counter that increments when a request takes less than or 250ms, along with another counter that increments for every request;
a histogram metric with one of the buckets set to 0.25, which is the same as the first point as a histogram is essentially a set of counters.

Assuming you have the histogram metric with the 0.25 bucket, the query is:

sum(
  increase(http_request_latency_seconds_bucket{job="my_service",le="0.25"}[7d])
) * 100 /
sum(
  increase(http_request_latency_seconds_bucket{job="my_service",le="+Inf"}[7d])
)
=> 99.958703283089

Ok, but what to do when such counters aren’t available?

Formula for latency SLO with percentiles

Another approach to latency SLOs is setting expectations regarding percentiles, which is a common real-life example. It gives the flexibility in testing different target values without the need of re-instrumentation. However, I suggest to refrain from relying solely on percentiles in SLOs for several reasons:

it’s an aggregate view, so it may hide certain characteristics of the service; a ratio of two counters offers way more straightforward understanding;
combining percentiles from different services can be tricky; for example, averaging percentiles is possible only for services having the same latency distribution (almost impossible) or by calculating a weighted average (also almost impossible);
the SLO gets more complicated, making it difficult to reason about — especially for non-SRE people who are also users of the SLO.

Nonetheless, an example of such an SLO is:

99th percentile of requests latency is lower than 400ms 99% of the time over a week period.

It can be implemented in PromQL in three steps:

Calculate the quantile:

histogram_quantile(
  0.99,
  sum by (le) (rate(http_server_request_duration_seconds_bucket[1m]))
)

Perform a binary quantization to get a vector of 1 and 0 values corresponding to when the target value is met and not met. It’s a perfect input for our avg_over_time function. This is achieved with the bool modifier (). histogram_quantile( 0.99, sum by (le) (rate(http_server_request_duration_seconds_bucket[1m])) )

 Calculate the percentage: avg_over_time(
  (histogram_quantile(
    0.99,
    sum by (le) (rate(http_server_request_duration_seconds_bucket[1m]))
  )  96.66666666666661

 
 When you finish prototyping with any of the above formulas, it’s a good practice to define a recording rule for the main metric for an efficient evaluation.
 
 Footnotes
  In this article, ‘latency’ refers to the time it takes for the service to generate a response. I want to clarify it because in some resources it’s used as the time that a request is waiting to be handled.
 On this note, I highly recommend getting familiar with Pyrra.
 It’s a common issue that the DB misses certain samples when the monitoring is hosted on the same server as the service having problems. While rethinking the architecture would be ideal, a quick workaround is to reduce all labels from the up metric and fill the gap with the scalar vector(0). Beware, the consequence is that the absence of the metric is seen as the service being down. avg_over_time(
  (sum(up{job="my_service"}) or vector(0))[90m:]
) * 100



Self-hosted observability stack for Ruby on Rails apps
2024-01-24T18:18:00+00:00
 This blog post introduces the opentelemetry-rails-example repository - a demo of fully instrumented Rails app with a self-hosted observability stack.
 
 OpenTelemetry has done remarkable work on providing specification of integrated telemetry data as well as on delivering the tooling - instrumentation SDKs for various languages and the OpenTelemetry Collector for data ingestion. However, by design, it doesn’t specify how the data is persisted and accessed. This area is mainly owned by vendors - cloud observability providers who compete in making use of the gathered data by providing all sorts of visualizations and alerting.
 Yet, application owners may consider an entirely in-house approach. There are many reasons why one may want to self-host their observability stack. The motivation could be that they want to avoid a $65B bill from the observability tooling provider, or they want to have full control over how and what data is collected, or simply their policy disallows sending the data to a third-party provider.
 Regardless of the motivation, I believe it’s worthwhile to know what can be achieved using generally available software that can be self-hosted, as opposed to leveraging observability cloud providers.
 
 I was fortunate to work on implementing observability in numerous Ruby (and Rails) apps from scratch. It involves connecting quite a few moving parts - starting from the application instrumentation, through data ingestion and storage to building up dashboards and alerts - it may be a struggle. As an afterthought, I distilled a minimum stack that gives very powerful insights into the system’s performance.
 For details, check out this repository: -> opentelemetry-rails-example <-
   
 The example stack covers three main layers of the system observability:
  instrumentation of the app (using the OpenTelemetry SDK, but not only);
 ingestion and storage of the observability data (featuring the OpenTelemetry Collector, StatsD, Prometheus, Promtail and Loki);
 accessing the observability data - seamless navigation between logs, metrics and traces with Grafana
 
   
 
 Of course, it all comes with costs. Production deployments may need tuning and sampling to limit the infrastructure burden. Business-wise, with an in-house stack, the luxury of logs, metrics, and traces doesn’t have to connote an enormous bill from a third-party observability provider.
 By adapting the in-house approach development teams may benefit from an integrated instrumention while having a better understanding of data and a better control over the costs. I believe it’s worth the effort - debugging a system with traversable logs, metrics, and traces is mentally easier and more efficient, especially when the time is critical.


Collecting Prometheus metrics from multi-process web servers, the Ruby case
2023-09-01T16:00:00+00:00
In Prometheus, metrics collection must follow concrete rules. For example, counters must be either always monotonically increasing or reset to zero. Violating this rule will result in collecting nonsensical data.
 It is a challenge with multi-process web servers (like Unicorn or Puma in Ruby or Gunicorn in Python) where each scrape might reach a different instance of the app which holds a local copy of the metric^[1]. These days, horizontal autoscaling and threaded web servers only increase the complexity of the problem. Typical solutions - implementing synchronization for scrapes or adding extra labels to initiate new time series for every instance of the app - can’t always be implemented.
 In this article I describe a rebellious solution of the problem which combines StatsD for metric collection and aggregation with Prometheus for time series storage and data retrieval.
 
 Contents
  1. The problem
  2. The proposed solution  2.1. In theory
 2.2. In practice
 
 
 3. Tradeoffs
 4. Final thoughts
 
 
 1. The problem
 In an ideal world, the target that Prometheus scrapes is a long-living process that has the full picture of the instrumented app. Such a process can infrequently be restarted causing the local registry of metrics to reset. In this world, everything works OK.
 However, it gets complicated when the target’s /metrics endpoint is served by a pool of processes (workers or pods). Then, in order to render a full picture of the app at a scrape, processes must have some means of synchronization or other ability to gather the collected data from each other. In other words, the client aggregation becomes a challenge.
   
 This is the reason why the Python Prometheus client uses mmap or the PHP Prometheus client recommends running Redis next to the app instance.^[2]
 In Ruby space, a lot has been said already. The original GitHub Issue - Support pre-fork servers - was opened on the Feb 8, 2015 and closed on the Jun 25, 2019. The issue was resolved by the introduction of a new data store DirectFileStore - a solution where each process gets a file where it dumps its registry. Then, at a scrape, all files are read so that the data can be aggregated. 
I remember watching the thread closely. At that time I was fortunate to work on a Ruby web server running over 100 processes. Unfortunately, switching my (excessively) multi-process app to the new data store made scrapes very slow, eventually leading to timeouts. I wasn’t an outlier, till today I can see that half of the open issues in the prometheus/client_ruby gem relate to DirectFileStore. 
Also, the solution requires all the app instances to be able to access the same volume. It puts certain scenarios of horizontal autoscaling in question. 
I was forced to look for other solutions, even though I’m still watching issues related to DirectFileStore and I keep my fingers crossed.
 Another intuitive solution (which deserves a mention) is to dynamically add an extra label per metric in every app’s instance serving the /metrics endpoint. Such a label uniquely represents the metrics registry. It could be the process PID and/or pod name. However, this is a shortsighted solution soon leading to a blow-up of the total number of time series that Prometheus has to maintain. Adding a volatile label (effectively adding a new time series) is not the right tool for this problem. Especially, when it was not meant to ever group by this label.
 2. The proposed solution
 In a nutshell, the proposed solution is to delegate the metrics aggregation to StatsD which then exports metrics in the Prometheus format. I’m not bringing novelty - this bridge has already been implemented as statsd_exporter. It’s been serving me very well in various setups, hence the praise.
 2.1. In theory
 The key difference between using StatsD instead of Prometheus client is where the aggregation happens. The StatsD client sends UDP packets to the collector which aggregates received signals, while the Prometheus client aggregates metrics in the app’s runtime and then exposes it for scrapes (or sends to the PushGateway).
   
 2.2. In practice
 There’re very few changes needed for this solution.
   deploy the statsd_exporter; a simple docker-compose based deployment could include:
 statsd-exporter:
  image: prom/statsd-exporter:v0.24.0 # check for the latest
  ports:
    - 9102:9102
    - 9125:9125/udp
 
 In the future, you might want to add custom configurations, like specific quantiles or label mappings. For more details, check out the Metric Mapping and Configuration section.
 
 use the statsd-instrument gem in the app;  make sure that required env vars are provided:  STATSD_ADDR should be set to statsd-exporter:9125
 STATSD_ENV should be set to production; if not provided, then StatsD will fallback to RAILS_ENV or ENV
 
 
 now, adding a counter to the codebase is as easy as: StatsD.increment('rack_server_requests_total', tags: { ... })
 
 
 alternatively, for a quick start check out my gem for Rack instrumentation: statsd-rack-instrument
 
 
  point Prometheus to scrape the StatsD exporter, add to your prometheus.yml:
 scrape_configs:
  - job_name: statsd
    static_configs:
      - targets:
        - statsd-exporter:9102
 
 
 
 That’s it! You should be able to see all metrics delivered via StatsD with this PromQL query:
 sum by (__name__) ({job="statsd"})
 3. Trade-offs
 The proposed solution doesn’t come without costs. As a thought experiment, I came up with three areas on which it can be criticized:
  It requires an additional service to be added to your fleet of services
 
 I agree this is a disadvantage. As a matter of fact, if you have a service of 0.99 reliability and you add another service of the same high reliability, then the reliability of your system drops to 0.99 * 0.99 = 0.9801. Yet, I have never encountered any issue with the StatsD server. It has proven to be very reliable.
 
  Incrementing a counter requires sending a UDP packet in contrary to fast in-memory update
 
 On the same note, let me pull the quote from the "Prometheus: Up & Running" book^[3]:
  Performance is vital for client libraries. This excludes designs where work processes send UDP packets or any other use of networks, due to the system call overhead it would involve. What is needed is something that is about as fast as normal instrumentation, which means something that is as fast as local process memory but can be accessed by other processes.
 
 I fully understand the motivation. Probably it holds true for most of the languages. Yet, at least for Ruby, I think it has proven that the client aggregation isn’t the optimal solution.
 
  Soon, you might have many applications sending metrics to a single StatsD server making it a bottleneck
 
 This is a reminder to add the monitoring of the StatsD server. Even though I think that reaching the scale when it becomes a bottleneck is limited to very few.
 4. Final thoughts
 Despite the recommendation from the statsd_exporter repository:
  We recommend using the exporter only as an intermediate solution, and switching to native Prometheus instrumentation in the long term.
 
 I don’t see the need of migrating from the described solution. If I ever find and migrate to a more suitable solution, then I’ll describe it and update this post.
 Currently, the OpenTelemetry project catches a lot of my attention. It offers an integrated approach for gathering app’s signals (logs, metrics, traces). However, the OpenTelemetry Collector doesn’t feature server-side metric aggregation yet - currently it’s an open proposal. Assuming it’s accepted and implemented one day, it may take a long time until it’s available in the client SDKs.
 
 Footnotes
  The problem of collecting metrics from multi-process web servers I know from my own experience, therefore I focus on it. However, I think that collecting metrics from short-living executions (such as GCP functions or AWS labdas) is a sibling problem that might be solved in the same manner.
 It is interesting how seemingly similar problems are approached differently in different languages.
 Prometheus: Up & Running by Brian Brazil, ISBN: 9781492034094, page 66.
 
 I have found prom-aggregation-gateway which is also a method for delegating aggregation while still using Prometheus client in the app. To be honest, I haven’t tried it but I feel obliged to mention it in the context of this post. 


Simple Prometheus queries for metrics inspection
2023-08-13T18:18:00+00:00
Cardinality is key. And it’s easy to get it out of control, as it is with any instance of the combinatorial explosion.
 This, combined with the claim that 90% of metrics are never accessed^[1], creates an area worth exploring.
 Observability cloud vendors already provide tools allowing to inspect unused data and eventually reduce the cost^[2]. But how to get a sense of your metrics when you don’t have access to such tools (i.e. when you run your own Prometheus server)?
 This article proposes a set of simple queries which allow to detect heavy metrics. For a fine illustration of the results, there’s also a Grafana dashboard proposition. A sneak peek of the dashboard is presented on the below screenshot.
   
 
 Contents
  1. A short theory intro
  2. Queries  2.1. The count of all series
 2.2. The count of all metrics
 2.3. The count of all jobs
 2.4. The count of series per metric
 2.5. The count of series per job
 
 
 3. Grafana dashobard
 
 
 1. A short theory intro
 In short, the cardinality of a label is the number of distinct values that were observed. A metric’s cardinality is the number of all observed combinations of labels’ values. In the worst case, it is the product of all labels’ cardinalities.
 A fine example is a metric counting HTTP requests having path, method and response_code labels. Let’s consider a scenario in which five paths are observed with three methods and three response codes. Then, the cardinality is 5 * 3 * 3 = 45.
 If we decide to make a subtle change and turn this counter into histogram (with 12 buckets), short in time our metric may grow to the cardinality of 45 * 12 = 540.
 2. Queries
 Beware, when running the queries it’s important to run an instant query, not range query, as for the purpose of this article the last recorded value is enough - we are not interested in the change over time. Querying for a range query may be very slow.
 Without further ado, let’s dive into details.
   
 2.1. The count of all series
 For starters, let’s pull the total number of series. That is the count of all unique label combinations (including the __name__ label).
 count({__name__!=""})
 This query relies on one simple rule: Prometheus Query Language (PromQL) requires to provide either a metric name or at least one label matcher. __name__ is an internal label added to every metric with the value of the metric name. As a metric name cannot be empty the !="" expression selects all the metrics.
 This powerful concept will be reused in further queries.
 2.2. The count of all metrics
 Now, let’s check how many metrics our Prometheus instance maintenances.
 count(count({__name__!=""}) by (__name__))
 Nothing out of ordinary, it’s just a grouping the previous query by __name__ and counting it.
 2.3. The count of all jobs
 Similarly, let’s check the number of jobs that produce metrics.
 count(count({__name__!=""}) by (job))
 It’s the same query as the previous just from a different angle - grouped by the job label.
 2.4. The count of series per metric
 Now let’s zoom in a bit and see more detailed data.
 sort_desc(
    count({__name__!=""}) by (__name__)
)
 This query outputs the number of series per metric. It is useful to spot metrics with high cardinality. With the sort_desc we can see the most interesting results at the top.
 2.5. The count of series per job
 Again similarly, let’s change the grouping to job.
 sort_desc(
    count({__name__!=""}) by (job)
)
 The output shows the number of series per job. It may reveal that a certain job (very often representing a single scraping target) is producing disproportionately many series.
 3. Grafana dashboard
 Staring at dashboards is amusing. Therefore I couldn’t resist building a dashboard based on the above queries. Besides numbers described above, it has more accompanying calculations, like the percentage that a metric shares out of all series. Check it out at https://grafana.com/grafana/dashboards/19341-metrics-management/.^[3]
 
 Footnotes
  “Lightstep has studied customers especially for metrics (…) - one in ten metrics is ever queried for any purpose.” Ben Sigelman, Lighstep CEO on OpenObservability Talks, https://www.youtube.com/live/gJhzwP-mZ2k?feature=share&t=1902
 If you are a happy user of Grafana Cloud, check out cardinality management dashboards
 There’s also Gist with the dashboard as a backup https://gist.github.com/michal-kazmierczak/1538bd8df46e4a1fbf9c859bfa045126.