Prometheus

Prometheus Integration Guide

This guide provides comprehensive instructions for integrating your Prometheus metrics with slaOS. This integration enables slaOS to collect and analyze metrics from your Prometheus instances, helping you establish and monitor Service Level Indicators (SLIs).

Prerequisites

Required Components

  • A running Prometheus instance

  • Your slaOS account credentials

  • (Optional) Access to configure authentication methods

Important Note: Currently, the integration requires an existing Prometheus server. If you only have applications exposing /metrics endpoints that need to be scraped, this is not yet supported but is coming soon!

Please contact our support team if this is your use case - we're happy to help find alternative solutions and work with you to ensure a smooth onboarding experience when this feature becomes available.

Integration Steps

Step 1: Authentication Setup

If your Prometheus instance requires authentication or runs with TLS enabled, you'll need to configure the appropriate authentication method. This step is crucial for securing access to your metrics while ensuring slaOS can reliably collect them.

Choose one of the following authentication methods based on your Prometheus setup:

Choose one authentication method

Basic Authentication

prometheus:
  base_url: "http://prometheus:9090"
  auth:
    username: "admin"
    password: "secret"

Token Authentication

prometheus:
  base_url: "http://prometheus:9090"
  auth:
    token: "your-secret-token"

Certificate Authentication (mTLS)

prometheus:
  base_url: "https://prometheus:9090"
  auth:
    cert_path: "/path/to/client.crt"
    key_path: "/path/to/client.key"
    verify_ssl: true

Google Cloud

prometheus:
  base_url: "https://monitoring.googleapis.com/v1/projects/[PROJECT_ID]/location/global/prometheus"
  auth:
    gcloud_service_account_path: "/path/to/service-account.json"
    gcloud_target_principal: "prometheus-reader@[PROJECT_ID].iam.gserviceaccount.com"
    oauth_scopes:
      - "https://www.googleapis.com/auth/monitoring.read"
      - "https://www.googleapis.com/auth/cloud-platform"

Important setup steps:

  1. Create a service account with these IAM roles:

    • roles/monitoring.viewer

    • roles/iam.serviceAccountTokenCreator

    • roles/iam.serviceAccountUser

  2. Generate and download the service account key file (JSON)

  3. Replace:

    • [PROJECT_ID] with your actual GCP project ID

    • /path/to/service-account.json with the actual path to your downloaded key file

  4. Ensure the service account has the required OAuth scopes enabled in your GCP project.

For more details check Github templates at prometheus/auth.md

Step 2: Set up your promQL queries

Set up your PromQL queries for collecting metrics. slaOS validates query correctness during the onboarding process to ensure reliable data collection. For self-hosted deployments, invalid query formats will prevent the indexer from starting.

Query Configuration

You can use the full power of PromQL to build your queries. For a comprehensive guide on writing PromQL queries, refer to the official Prometheus documentation.

Here are some common query patterns for monitoring service health:

Monitor the rate of incoming requests:

queries:
  - query: 'sum by (customer_id) (rate(http_requests_total{job="api"}[5m]))'
    step: 
      value: 60
      unit: "s"
    slaos_metric_name: "request_rate"
    organization_identifier: "customer_id"

This query:

  • Calculates request rate over 5-minute windows

  • Groups results by customer_id

  • Returns data points every minute (step)

  • Maps to "request_rate" metric in slaOS

Query Validation

slaOS performs several validations on your queries:

  • Syntax correctness

  • Label presence (especially for organization_identifier)

  • Appropriate use of aggregation operators

  • Correct histogram usage

  • Valid time windows and steps

If validation fails:

  • In cloud slaOS: The onboarding interface will show specific error messages

  • In self-hosted slaOS: The indexer will log errors and fail to start

Best Practices

  1. Time Windows: Use appropriate time windows for rate calculations

rate(metric[5m])     # Good for high-traffic services
rate(metric[1m])     # May be noisy for low-traffic services
rate(metric[15m])    # Better for low-traffic services
  1. Step Selection

When querying metrics, the step interval determines how frequently data points are sampled. Here are the key points about step configuration:

  • We poll Prometheus integrations every 60 seconds (1 minute)

  • Step sizes must be ≤ 60 seconds

  • Step intervals should evenly divide into 60 seconds to ensure consistent metric sampling

For example, valid step intervals include: 1s, 2s, 3s, 4s, 5s, 6s, 10s, 12s, 15s, 20s, 30s, and 60s.

step:  # Good for real-time monitoring
  value: 60
  unit: "s"    
step:  # High resolution but more resource intensive
  value: 15
  unit: "s"    
  1. Aggregation: Include necessary labels in aggregations

sum by (customer_id, endpoint) (...)    # Preserves endpoint information
sum by (customer_id) (...)              # More condensed view

For more complex queries or specific use cases, consult our support team or refer to the Prometheus querying documentation.

Step 3: Configuration Setup

Combine the outcomes from Step 1 (Authentication) and Step 2 (Queries) into your main configuration file. Here's an example:

yamlCopyinputs:
  - integration: prometheus
    slaos_key: prometheus_metrics
    type: metrics
    prometheus:
      base_url: "http://prometheus:9090"
      # Add your authentication configuration from Step 1 if needed
      auth:
        username: "admin"          # If using basic auth
        password: "secret"         # If using basic auth
        # Or
        token: "your-token"        # If using token auth
        # Or
        cert_path: "/path/to/cert" # If using mTLS
        key_path: "/path/to/key"   # If using mTLS
      # Add your queries from Step 2
      queries:
        - query: 'rate(http_request_duration_seconds_count{job="api"}[5m])'
          step: 
            value: 60
            unit: "s"
          slaos_metric_name: "http_request_rate"
          organization_identifier: "customer_id"
          fallback_org_id: "default_customer"
      # Connection settings
      timeout: 15.0
      pool_connections: 10
      pool_maxsize: 10
      max_parallel_queries: 5
      retry_backoff_factor: 0.1
      max_retries: 3

Tip: For the latest configuration examples and templates, check our GitHub repository. We regularly update these templates with best practices and new features.

Advanced settings for self-hosted

When running self-hosted slaOS, you have full control over connection settings. Here are the available parameters with recommended values:

prometheus:
  # Request handling
  timeout: 15.0                # Request timeout in seconds
  max_retries: 3              # Maximum retry attempts
  retry_backoff_factor: 0.1   # Delay between retries (exponential backoff)

  # Connection pooling
  pool_connections: 10        # Initial pool size
  pool_maxsize: 10           # Maximum concurrent connections
  max_parallel_queries: 5     # Maximum concurrent queries

Configuration Guidelines

  1. Timeout Settings

prometheus:
  timeout: 15.0    # Default: Good for most cases
  timeout: 30.0    # For complex queries or slower networks
  timeout: 5.0     # For simple queries, fast networks
  1. Connection Pool Optimization

# High-traffic setup
prometheus:
  pool_connections: 20
  pool_maxsize: 20
  max_parallel_queries: 10

# Low-traffic setup
prometheus:
  pool_connections: 5
  pool_maxsize: 5
  max_parallel_queries: 3
  1. Retry Strategy

# Aggressive retry
prometheus:
  max_retries: 5
  retry_backoff_factor: 0.2

# Conservative retry
prometheus:
  max_retries: 2
  retry_backoff_factor: 0.5

Frequently Asked Questions (FAQ)

Authentication

Q: Can I use multiple authentication methods simultaneously? A: No, authentication methods are mutually exclusive. Choose one that best fits your security requirements.

Q: How often should I rotate credentials? A: Best practice is to rotate credentials every 90 days or immediately if compromised.

Organization Identification

Q: What happens if the organization identifier is missing? A: The integration will:

  1. Use the fallback_org_id if configured

  2. Stop with an error if no fallback_org_id is provided

Q: Can I use different organization identifiers for different queries? A: Yes, each query can specify its own organization_identifier and fallback_org_id.

Metrics and Queries

Q: How oftQ: How often does slaOS collect metrics? A: After initial backfilling of historical data, slaOS queries the data source every 60 seconds. The frequency of data points within each 60-second window is determined by the step parameter in your query configuration.

Q: Can I query logs through Prometheus? A: No, the Prometheus integration only supports metric queries. For log analysis, please use other supported integrations like CloudWatch. We plan to integrate promQL compatible log systems soon.

For any additional questions or issues, please contact the slaOS support team on Slack.

Last updated