Magnolia is "crash looping"

Symptom

A CustomerMagnoliaCrashLooping alert is firing. Kubernetes has restarted a Magnolia at least three times within 15 minutes.

CustomerMagnoliaCrashLooping alerts are sent to subscribers via email.

Kubernetes will restart a pod if it exceeds its memory limit. The Magnolia JVM typically cannot exceed its memory limit - the JVM max heap setting - but the JVM also will consume a small amount of non-heap memory (usually about 200MB) that can vary over time. Other containers running in the Magnolia pod may also consume memory but they usually use very small amounts (10s of MB). Temporary filesystems may use memory as well.

Observations

Here are the details on the alert:

Alert: CustomerMagnoliaCrashLooping

Expression

increase(kube_pod_container_status_restarts_total{container="magnolia-helm"}[15m]) > 3

Delay

2 minutes

Labels

team: customer

Annotations

  • source

  • summary

  • description

  • tenant

  • cluster_id

  • cluster_name

  • pod

  • instance

Check readiness and liveness probe config for Magnolia pod

The alert will note the affected Magnolia pod.

You can view probes configuration for the Magnolia pod in Rancher or with kubectl.

kubectl -n <namespace from alert> describe pod <Magnolia pod from alert>

Look for the "Liveness" and "Readiness" sections in the output:

    Liveness:       http-get http://:liveness-port/livez delay=240s timeout=10s period=10s #success=1 #failure=4
    Readiness:      http-get http://:liveness-port/readyz delay=2s timeout=1s period=2s #success=1 #failure=3

Check bootstrapper container log output for Magnolia pod

The liveness and readiness probes actually check the bootstrapper instead of Magnolia. The bootstrapper then checks Magnolia and returns a result depending on Magnolia’s state.

The bootstrapper’s log shows the results of liveness and readiness checks. You can view the bootstrapper log in the customer’s cockpit or in Grafana.

Be careful when adjusting the the readiness and liveness probes for the Magnolia pod: don’t set very long delays or failure thresholds until you have verified that Magnolia really needs to have more time when starting up.

Solutions

This section provides solutions that should help resolve the issue in most cases.

Stop failing readiness check

Magnolia make take longer to start up and pass its readiness probe for a variety of reasons (Lucene indexing, large JCR repository, lots of module startup tasks).

You can allow Magnolia more time in starting up by:

  • increasing the failureThreshold Helm chart value for readiness to increase the number failed readiness checks tolerated

  • increasing the initialDelaySeconds Helm chart value for readiness to increase the time before readiness is checked

  • increasing the periodSeconds Helm chart value for readiness to increase the time between readiness checks

Stop failing liveness check

Magnolia make take longer to start up and pass its liveness probe for a variety of reasons (Lucene indexing, large JCR repository, lots of module startup tasks).

You can allow Magnolia more time in starting up by:

  • increasing the failureThreshold Helm chart value for liveness to increase the number failed readiness checks tolerated

  • increasing the initialDelaySeconds Helm chart value for liveness to increase the time before readiness is checked

  • increasing the periodSeconds Helm chart value for liveness to increase the time between readiness checks

Feedback

PaaS

×

Location

This widget lets you know where you are on the docs site.

You are currently perusing through the DX Cloud PaaS docs.

Main doc sections

DX Core Headless PaaS Legacy Cloud Incubator modules