Handling organization_id

Multi-tenant API service sxample

Let's explore common scenarios using a real-world example of a multi-tenant API service. In this example, you're running an API that serves multiple organizations, and you want to monitor various metrics. Some metrics are naturally split by organization (like latency and request volume), while others are collected globally (like error rates).

The organization_id in these examples could represent different identifiers depending on your use case:

customer_id: When serving multiple end customers (e.g., SaaS platform)
- Email service monitoring delivery rates per business account
- DEX monitoring liquidity provider positions
- NFT marketplace tracking collection trading volume
vendor_id: When aggregating metrics across different suppliers or partners
- Marketplace measuring seller performance metrics
- RPC node provider tracking request volumes
- Oracle service monitoring price feed updates
service_id: When monitoring multiple internal services or microservices
- E-commerce tracking checkout service reliability
- Bridge monitoring cross-chain transfers
- Smart contract monitoring function calls
integration_id: When tracking metrics for different third-party integrations
- Payment platform monitoring gateway success rates
- Multi-chain wallet tracking transaction status
- DEX aggregator monitoring swap routes

Scenario 1: Organization-Specific Metrics (Per-Organization Latency)

Context: Your API tracks request latency per organization, which is essential for:

Monitoring individual organization experience
Meeting specific SLAs per organization
Identifying organization-specific performance issues

# Prometheus metrics
# Each request is tagged with organization_id
api_request_latency_seconds{organization_id="org123", endpoint="/api/v1/users"} 0.45
api_request_latency_seconds{organization_id="org456", endpoint="/api/v1/users"} 0.32

# slaOS configuration
queries:
  - query: 'histogram_quantile(0.95, sum by (le, organization_id) (rate(api_request_latency_seconds_bucket[5m])))'
    step: 
      value: 60
      unit: "s"
    slaos_metric_name: "p95_latency"
    organization_identifier: "organization_id"  # Each organization gets their own latency metrics

Use Case Examples:

SaaS Platform: Track response times for each customer's API usage
Marketplace: Monitor transaction processing times for different vendors
Microservices: Measure inter-service communication latency
Integration Platform: Track external API call latencies per integration

Scenario 2: Service-Wide Metrics (Global Error Rates)

Context: Your API tracks error counts globally due to:

Infrastructure limitations
Metric collection setup
No business need to track errors per organization

# Prometheus metrics
# Error counts are only tagged with status code
http_errors_total{status="500"} 10
http_errors_total{status="400"} 25
http_errors_total{status="200"} 1000

# slaOS configuration
queries:
  - query: 'sum(rate(http_errors_total{status=~"5.."}[5m])) / sum(rate(http_errors_total[5m]))'
    step: 
      value: 60
      unit: "s"
    slaos_metric_name: "error_rate"
    fallback_org_id: "global_service"  # All error metrics go to a default organization

In this case:

Error metrics don't have organization identification
Using fallback_org_id assigns all error rates to a default organization
Useful for service-wide SLAs or general monitoring
All organizations reference the same error rate metrics

Scenario 3: Mixed Metrics (Combined Approach)

# Organization-specific requests
api_requests_total{organization_id="org123", endpoint="/api/v1/users"} 150
api_requests_total{organization_id="org456", endpoint="/api/v1/orders"} 75

# Public endpoint requests (no organization_id)
api_requests_total{endpoint="/public/status"} 50
api_requests_total{endpoint="/health"} 25

# slaOS configuration
queries:
  - query: 'sum by (organization_id) (rate(api_requests_total[5m]))'
    step: 
      value: 60
      unit: "s"
    slaos_metric_name: "request_rate"
    organization_identifier: "organization_id"
    fallback_org_id: "public_endpoints"  # For requests without organization_id

Best Practices for Mixed Environments

Consistent Labeling:

# Good - consistent organization identification
api_latency_seconds{organization_id="org123", ...}
api_requests_total{organization_id="org123", ...}

# Avoid - inconsistent labeling
api_latency_seconds{organization_id="org123", ...}
api_requests_total{client="org123", ...}  # Different label name

Clear Separation:

queries:
  # Organization-specific latency
  - query: 'histogram_quantile(0.95, sum by (le, organization_id) (rate(api_latency_seconds_bucket[5m])))'
    organization_identifier: "organization_id"
    slaos_metric_name: "org_latency"

  # Global error rates
  - query: 'sum(rate(http_errors_total{status=~"5.."}[5m])) / sum(rate(http_errors_total[5m]))'
    fallback_org_id: "global_service"
    slaos_metric_name: "global_error_rate"

Meaningful Fallback IDs:

# Descriptive fallback IDs
fallback_org_id: "public_api_endpoints"    # Clear purpose
fallback_org_id: "unauthenticated_users"   # Clear purpose

# Avoid generic fallbacks
fallback_org_id: "default"                 # Too generic
fallback_org_id: "other"                   # Not descriptive

Remember:

Choose the appropriate organization identifier based on your use case
Not all metrics need to be split by organization
Use fallback IDs thoughtfully and consistently
Document your choices for future reference
Consider future changes in metric collection
Balance granularity with system complexity

PreviousPrometheus NextPrivacy and Security

Last updated 7 months ago