Prometheus

Prometheus Integration Guide

This guide provides comprehensive instructions for integrating your Prometheus metrics with slaOS. This integration enables slaOS to collect and analyze metrics from your Prometheus instances, helping you establish and monitor Service Level Indicators (SLIs).

Prerequisites

Required Components

A running Prometheus instance
Your slaOS account credentials
(Optional) Access to configure authentication methods

Important Note: Currently, the integration requires an existing Prometheus server. If you only have applications exposing /metrics endpoints that need to be scraped, this is not yet supported but is coming soon!

Please contact our support team if this is your use case - we're happy to help find alternative solutions and work with you to ensure a smooth onboarding experience when this feature becomes available.

Integration Steps

Step 1: Authentication Setup

If your Prometheus instance requires authentication or runs with TLS enabled, you'll need to configure the appropriate authentication method. This step is crucial for securing access to your metrics while ensuring slaOS can reliably collect them.

Choose one of the following authentication methods based on your Prometheus setup:

Choose one authentication method

Basic Authentication

prometheus:
  base_url: "http://prometheus:9090"
  auth:
    username: "admin"
    password: "secret"

Token Authentication

prometheus:
  base_url: "http://prometheus:9090"
  auth:
    token: "your-secret-token"

Certificate Authentication (mTLS)

prometheus:
  base_url: "https://prometheus:9090"
  auth:
    cert_path: "/path/to/client.crt"
    key_path: "/path/to/client.key"
    verify_ssl: true

Google Cloud

prometheus:
  base_url: "https://monitoring.googleapis.com/v1/projects/[PROJECT_ID]/location/global/prometheus"
  auth:
    gcloud_service_account_path: "/path/to/service-account.json"
    gcloud_target_principal: "prometheus-reader@[PROJECT_ID].iam.gserviceaccount.com"
    oauth_scopes:
      - "https://www.googleapis.com/auth/monitoring.read"
      - "https://www.googleapis.com/auth/cloud-platform"

Important setup steps:

Create a service account with these IAM roles:
- roles/monitoring.viewer
- roles/iam.serviceAccountTokenCreator
- roles/iam.serviceAccountUser
Generate and download the service account key file (JSON)
Replace:
- [PROJECT_ID] with your actual GCP project ID
- /path/to/service-account.json with the actual path to your downloaded key file
Ensure the service account has the required OAuth scopes enabled in your GCP project.

For more details check Github templates at prometheus/auth.md

Step 2: Set up your promQL queries

Set up your PromQL queries for collecting metrics. slaOS validates query correctness during the onboarding process to ensure reliable data collection. For self-hosted deployments, invalid query formats will prevent the indexer from starting.

Query Configuration

You can use the full power of PromQL to build your queries. For a comprehensive guide on writing PromQL queries, refer to the official Prometheus documentation.

Querying basics | Prometheusprometheus.io

Here are some common query patterns for monitoring service health:

Monitor the rate of incoming requests:

queries:
  - query: 'sum by (customer_id) (rate(http_requests_total{job="api"}[5m]))'
    step: 
      value: 60
      unit: "s"
    slaos_metric_name: "request_rate"
    organization_identifier: "customer_id"

This query:

Calculates request rate over 5-minute windows
Groups results by customer_id
Returns data points every minute (step)
Maps to "request_rate" metric in slaOS

Calculate the ratio of errors to total requests:

queries:
  - query: 'sum by (customer_id) (rate(http_errors_total{job="api"}[5m])) / sum by (customer_id) (rate(http_requests_total{job="api"}[5m]))'
    step:
      value: 60
      unit: "s"
    slaos_metric_name: "error_rate"
    organization_identifier: "customer_id"

This query:

Computes error rate as errors/total requests
Maintains customer-specific error rates
Provides percentage of failed requests
Updates every minute

Calculate 95th percentile latency from histogram buckets:

queries:
  - query: 'histogram_quantile(0.95, sum by (le, customer_id) (rate(http_duration_seconds_bucket{job="api"}[5m])))'
    step:
      value: 60
      unit: "s"
    slaos_metric_name: "p95_latency"
    organization_identifier: "customer_id"

This query:

Uses histogram_quantile for p95 calculation
Maintains the 'le' (less than or equal) label required for histograms
Groups by customer_id for per-customer latency
Updates every minute

Query Validation

slaOS performs several validations on your queries:

Syntax correctness
Label presence (especially for organization_identifier)
Appropriate use of aggregation operators
Correct histogram usage
Valid time windows and steps

If validation fails:

In cloud slaOS: The onboarding interface will show specific error messages
In self-hosted slaOS: The indexer will log errors and fail to start

Best Practices

Time Windows: Use appropriate time windows for rate calculations

rate(metric[5m])     # Good for high-traffic services
rate(metric[1m])     # May be noisy for low-traffic services
rate(metric[15m])    # Better for low-traffic services

Step Selection

When querying metrics, the step interval determines how frequently data points are sampled. Here are the key points about step configuration:

We poll Prometheus integrations every 60 seconds (1 minute)
Step sizes must be ≤ 60 seconds
Step intervals should evenly divide into 60 seconds to ensure consistent metric sampling

For example, valid step intervals include: 1s, 2s, 3s, 4s, 5s, 6s, 10s, 12s, 15s, 20s, 30s, and 60s.

step:  # Good for real-time monitoring
  value: 60
  unit: "s"    
step:  # High resolution but more resource intensive
  value: 15
  unit: "s"

Aggregation: Include necessary labels in aggregations

sum by (customer_id, endpoint) (...)    # Preserves endpoint information
sum by (customer_id) (...)              # More condensed view

For more complex queries or specific use cases, consult our support team or refer to the Prometheus querying documentation.

Querying basics | Prometheusprometheus.io

Step 3: Configuration Setup

Combine the outcomes from Step 1 (Authentication) and Step 2 (Queries) into your main configuration file. Here's an example:

yamlCopyinputs:
  - integration: prometheus
    slaos_key: prometheus_metrics
    type: metrics
    prometheus:
      base_url: "http://prometheus:9090"
      # Add your authentication configuration from Step 1 if needed
      auth:
        username: "admin"          # If using basic auth
        password: "secret"         # If using basic auth
        # Or
        token: "your-token"        # If using token auth
        # Or
        cert_path: "/path/to/cert" # If using mTLS
        key_path: "/path/to/key"   # If using mTLS
      # Add your queries from Step 2
      queries:
        - query: 'rate(http_request_duration_seconds_count{job="api"}[5m])'
          step: 
            value: 60
            unit: "s"
          slaos_metric_name: "http_request_rate"
          organization_identifier: "customer_id"
          fallback_org_id: "default_customer"
      # Connection settings
      timeout: 15.0
      pool_connections: 10
      pool_maxsize: 10
      max_parallel_queries: 5
      retry_backoff_factor: 0.1
      max_retries: 3

Tip: For the latest configuration examples and templates, check our GitHub repository. We regularly update these templates with best practices and new features.

Advanced settings for self-hosted

When running self-hosted slaOS, you have full control over connection settings. Here are the available parameters with recommended values:

prometheus:
  # Request handling
  timeout: 15.0                # Request timeout in seconds
  max_retries: 3              # Maximum retry attempts
  retry_backoff_factor: 0.1   # Delay between retries (exponential backoff)

  # Connection pooling
  pool_connections: 10        # Initial pool size
  pool_maxsize: 10           # Maximum concurrent connections
  max_parallel_queries: 5     # Maximum concurrent queries

Configuration Guidelines

Timeout Settings

prometheus:
  timeout: 15.0    # Default: Good for most cases
  timeout: 30.0    # For complex queries or slower networks
  timeout: 5.0     # For simple queries, fast networks

Connection Pool Optimization

# High-traffic setup
prometheus:
  pool_connections: 20
  pool_maxsize: 20
  max_parallel_queries: 10

# Low-traffic setup
prometheus:
  pool_connections: 5
  pool_maxsize: 5
  max_parallel_queries: 3

Retry Strategy

# Aggressive retry
prometheus:
  max_retries: 5
  retry_backoff_factor: 0.2

# Conservative retry
prometheus:
  max_retries: 2
  retry_backoff_factor: 0.5

Frequently Asked Questions (FAQ)

Authentication

Q: Can I use multiple authentication methods simultaneously? A: No, authentication methods are mutually exclusive. Choose one that best fits your security requirements.

Q: How often should I rotate credentials? A: Best practice is to rotate credentials every 90 days or immediately if compromised.

Organization Identification

Q: What happens if the organization identifier is missing? A: The integration will:

Use the fallback_org_id if configured
Stop with an error if no fallback_org_id is provided

Q: Can I use different organization identifiers for different queries? A: Yes, each query can specify its own organization_identifier and fallback_org_id.

Metrics and Queries

Q: How oftQ: How often does slaOS collect metrics? A: After initial backfilling of historical data, slaOS queries the data source every 60 seconds. The frequency of data points within each 60-second window is determined by the step parameter in your query configuration.

Q: Can I query logs through Prometheus? A: No, the Prometheus integration only supports metric queries. For log analysis, please use other supported integrations like CloudWatch. We plan to integrate promQL compatible log systems soon.

For any additional questions or issues, please contact the slaOS support team on Slack.

PreviousIntegrations NextHandling organization_id

Last updated 1 year ago