Build a SLO

This tutorial guides you through creating a Service Level Objective (SLO) in slaOS

SLOs set specific, measurable targets for your service's performance based on SLIs. Learn how to configure SLOs to establish clear reliability goals for your service.

Event vs Interval based SLOs

SLOs can be categorized into two main types: Event-based and Interval-based. Understanding the difference is crucial for effective service reliability management.

Event-based SLOs

  • Definition: Measure the success rate of discrete, countable occurrences.

  • Example: "99.9% of API requests should will be served successfully in a rolling 7d window"

  • Characteristics:

    • Based on individual events (such as requests, transactions etc)

    • Often used for request/response systems

    • Typically measured as a ratio of successful events to total events

Interval-based SLOs

  • Definition: Measure the proportion of time a service meets a specific criterion.

  • Example: "99% of all 10-minute intervals in a month will have latency less than 100ms "

  • Characteristics:

    • Based on continuous monitoring over time periods

    • Often used for availability or uptime metrics

    • Measured as a percentage of time the service meets the defined criteria

Select the SLO type that best fits your service and reliability goals. The choice between event-based and interval-based SLOs is both a technical and business decision. We recommend tech and business stakeholders collaboratively evaluate both options, test them on the slaOS UI before finalizing.

To learn more about event and interval based SLOs, we highly recommend a read of Google Cloud's Observability Handbook!

Best practice for raw data points (not pre-aggregated):

These are individual data points like status codes, latencies, or timestamps that haven't been aggregated before ingestion into slaOS.

For Value SLIs (e.g., latency):

  • Recommendation: Use event-based SLOs

  • Rationale: Each latency measurement is a discrete event. Event-based SLOs allow you to set objectives like "99% of requests should have a latency < 100ms."

For Percentage SLIs (e.g., status codes for uptime):

  • Recommendation: Use interval-based SLOs

  • Rationale: Uptime is typically measured over time periods. Interval-based SLOs allow objectives like "The service should be up 99.9% of the time over a month."

Best practice for pre-aggregated metrics

These are metrics that have been aggregated before ingestion, such as average latency or hourly uptime.

  1. For Value SLIs (e.g., average latency):

    • Both event-based and interval-based can work, depending on your business goals

    • Event-based example: "99% of hourly average latencies should be < 100ms"

    • Interval-based example: "In 99% of 10-minute intervals, the average latency should be < 100ms"

    • Choose based on whether you care more about overall performance (event-based) or consistent performance over time (interval-based)

  2. For Percentage SLIs (e.g., hourly uptime):

    • Again, both approaches can work

    • Event-based example: "The average of hourly uptime measurements over a calendar month should be ≥ 99.5%"

    • Interval-based example: "All daily (24hr intervals) average uptime measurements over a calendar month should be > 99.5%"

    • Choose based on whether you want to allow some fluctuation (event-based) or ensure consistent performance every day (interval-based)

Need help defining your SLOs? Contact us at hello@rated.co for consultation.

Create an Event based SLO

Follow these steps to create a Event Service Level Objective (SLO):

  1. Find and click on "Objectives" in the side navigation bar

  2. Click the "+ New SLO" button to open the Create SLO modal

  3. Click on the dropdown to select from your list of active SLIs

  4. Choose "Event" as the type of SLO you'll be building

  5. Depending on the type of SLI you’ve chosen:

    1. If your SLI is a value SLI, you will need to set a benchmark against which each “event” in the SLI will be compared against. To set a benchmark, you will select an operator and the benchmark value

      1. Allowed operators: >, <, >=, <=, =

      2. Allowed benchmark types: numeric, boolean

    2. If your SLI is a percentage SLI, you will need to set an aggregator for your SLI which will be applied to all events within a period.

      1. Allowed aggregators: AVG, MIN, MAX, COUNT, SUM, PERCENTILE

  6. Select the compliance period and target. Click “Continue”.

  7. A name and description will be auto filled for your SLO based on your configuration

  8. Click the "Save" button to create your new event SLO

Examples

For the example implementation, we’ll create an Event SLO with a Latency SLI. This SLO specifies that, over a calendar month, 99.9% of successful requests must have a latency under 1 second.

  1. Find and click on "Objectives" in the side navigation bar

  2. Click the "+ New SLO" button to open the Create SLO modal

  3. Select Latency as the SLI and choose Event SLO

  4. Set the operator as and benchmark as 1 second

  5. Set the time window type as Calendar, period as Monthly and Service target as ≥ 99.9%

  6. The SLO will get the following name: Latency above 99.9% and description Event based calculation over a calendar period of 1 month via the autofill feature

  7. Click "Save"

Create an Interval based SLO

Follow these steps to create an Interval Service Level Objective (SLO):

  1. Find and click on "Objectives" in the side navigation bar

  2. Click the "+ New SLO" button to open the Create SLO modal

  3. Click on the dropdown to select from your list of active SLIs

  4. Choose "Interval" as the type of SLO you'll be building

  5. Configure your interval behavior. This includes:

    1. The size of each intervals. You can choose a minimum size of 1min and a maximum size of 24h.

    2. How we should treat intervals where no events were received. This can happen when your service is undergoing planned maintenance or during off peak hours. You can either choose to treat those intervals as GOOD meaning service was good, BAD meaning service was down/misbehaving or EXCLUDE meaning the interval won't be considered as there was nothing to consider.

    3. The aggregation function that you'll want to apply on the interval

      1. Allowed aggregators: AVG, MIN, MAX, COUNT, SUM, PERCENTILE

    4. The benchmark each intervals' aggregated result will be compared against

      1. Allowed operators: >, <, >=, <=, =

      2. Allowed benchmark types: numeric, boolean

  6. Select the compliance period and target. Click “Continue”.

  7. A name and description will be auto filled for your SLO based on your configuration

  8. Click the "Save" button to create your new interval SLO

Examples

For the example implementation, we’ll create an Interval SLO with a Latency SLI. This SLO specifies that, over a rolling 28d window, 99.5% of all 10 min intervals will have p99 latency less than or equal to 1 second.

  • Find and click on "Objectives" in the side navigation bar

  • Click the "+ New SLO" button to open the Create SLO modal

  • Select Latency as the SLI and choose Interval SLO

  • Set the interval size as 10min and empty interval behavior as Exclude

  • Choose p99 as the aggregation function

  • Set the operator as and benchmark as 1 second

  • Set the time window type as Rolling, period as 28days and Service target as ≥ 99.5%

  • The SLO will get the following name: Latency above 99.5% and description Calculation every 10 min over a rolling period of 28 days via the autofill feature

  • Click "Save"

Last updated