Elevated service errors
Symptom
A CustomerElevatedServiceErrors
alert is firing.
It indicates a service is showing a high level of errors (5xx responses) from a service, specifically:
-
if a service is running on a production cluster AND
-
the service is receiving more than 6 requests / minute AND
-
more than
10%
of the responses from the service are 5xx for the last10
minutes
These alerts apply to any service used by an ingress - Magnolia service, frontend service, redirect service or other service - running on a production cluster. |
CustomerElevatedServiceErrors alerts are sent to subscribers via email.
|
Observations
Here are the details on the alert:
Alert: CustomerElevatedServiceErrors
Expression |
|
Delay |
10 minutes |
Labels |
team: customer |
Annotations |
summary, description, tenant, cluster_id, cluster_name, namespace, service |
Expression |
|
Delay |
|
Labels |
|
Annotations |
|
Solutions
This section provides solutions that should help resolve the issue in most cases.
Investigate cause for elevated service errors
Unfortunately there is no easy resolution for elevated service errors alerts; there are many possible causes for service errors.
Possible causes for the elevated service errors:
-
The ingress is misconfigured: service selector does not find the expected service (check the ingress configuration)
-
No pods for the service are running: pods are restarting or crashlooping (check the pod status with kubectl or Rancher)
-
No pods for the service are running: deployment / daemonset / statefulset is scaled down to 0 pods (check the deployment / daemonset / statefulset with kubectl or Rancher)
-
Pods are running but are returning errors (do this as a last resort! Logs are often verbose and are difficult to interpret)