Magnolia is "crash looping"
Symptom
A CustomerMagnoliaCrashLooping
alert is firing. Kubernetes has restarted a Magnolia at least three times within 15
minutes.
CustomerMagnoliaCrashLooping alerts are sent to subscribers via email.
|
Kubernetes will restart a pod if it exceeds its memory limit. The Magnolia JVM typically cannot exceed its memory limit - the JVM max heap setting - but the JVM also will consume a small amount of non-heap memory (usually about 200MB
) that can vary over time. Other containers running in the Magnolia pod may also consume memory but they usually use very small amounts (10s
of MB
). Temporary filesystems may use memory as well.
Observations
Here are the details on the alert:
Alert: CustomerMagnoliaCrashLooping
Expression |
|
Delay |
|
Labels |
|
Annotations |
|
Check readiness and liveness probe config for Magnolia pod
The alert will note the affected Magnolia pod.
You can view probes configuration for the Magnolia pod in Rancher or with kubectl.
kubectl -n <namespace from alert> describe pod <Magnolia pod from alert>
Look for the "Liveness" and "Readiness" sections in the output:
Liveness: http-get http://:liveness-port/livez delay=240s timeout=10s period=10s #success=1 #failure=4
Readiness: http-get http://:liveness-port/readyz delay=2s timeout=1s period=2s #success=1 #failure=3
Check bootstrapper container log output for Magnolia pod
The liveness and readiness probes actually check the bootstrapper instead of Magnolia. The bootstrapper then checks Magnolia and returns a result depending on Magnolia’s state.
The bootstrapper’s log shows the results of liveness and readiness checks. You can view the bootstrapper log in the customer’s cockpit or in Grafana.
Be careful when adjusting the the readiness and liveness probes for the Magnolia pod: don’t set very long delays or failure thresholds until you have verified that Magnolia really needs to have more time when starting up. |
Solutions
This section provides solutions that should help resolve the issue in most cases.
Stop failing readiness check
Magnolia may take longer to start up and pass its readiness probe for a variety of reasons (Lucene indexing, large JCR repository, lots of module startup tasks).
You can allow Magnolia more time in starting up by:
-
increasing the
failureThreshold
Helm chart value for readiness to increase the number failed readiness checks tolerated -
increasing the
initialDelaySeconds
Helm chart value for readiness to increase the time before readiness is checked -
increasing the
periodSeconds
Helm chart value for readiness to increase the time between readiness checks
Stop failing liveness check
Magnolia may take longer to start up and pass its liveness probe for a variety of reasons (Lucene indexing, large JCR repository, lots of module startup tasks).
You can allow Magnolia more time in starting up by:
-
increasing the
failureThreshold
Helm chart value for liveness to increase the number failed readiness checks tolerated -
increasing the
initialDelaySeconds
Helm chart value for liveness to increase the time before readiness is checked -
increasing the
periodSeconds
Helm chart value for liveness to increase the time between readiness checks