/site/ping --> Readiness probe

Hi team,

We are running Bloomreach Experience Manager (brXM 14.5.0) at AWS and in our deployment file we have added readiness like below :

readinessProbe:
  failureThreshold: 5
  httpGet:
    path: /site/ping/
    port: 8080
    scheme: HTTP
  initialDelaySeconds: 120
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 15

Now intermettently I am facing 502 bad gateway on cms ui. When we verified logs didn’t found any thing, we have grafana, I am sharing those details below :

setenv.sh

#!/bin/sh

#
# ${CATALINA_BASE}/bin/setenv.sh
#
# This script is executed automatically by ${catalina.base}/catalina.sh.
#

# Repository configurations
REP_OPTS="-Drepo.path=${REPO_PATH} -Drepo.bootstrap=false -Drepo.config=${REPO_CONFIG} -Drepo.autoexport.allowed=false -Dhippo.search.index.rebuild=false"

# JAVA_OPTS="-verbose:class"
#Remove unless debugging: -XX:NativeMemoryTracking=summary
JAVA_OPTS="-Dproject.basedir=/brxm/project -Dcargo.jvm.args='-DCMS_BASEURL=${CMS_BASEURL}' -Delastic.apm.service_name=${APM_SERVICE_NAME} -Delastic.apm.application_packages=com.* -Delastic.apm.server_url=${APM_SERVER_URL} -Delastic.apm.secret_token=${APM_SECRET}"

# Logging configurations
L4J_OPTS="-DLOG_LEVEL=${LOG_LEVEL} -DLOG_LEVEL_CMS=${LOG_LEVEL_CMS} -DLOG_LEVEL_CUSTOM_CODE=${LOG_LEVEL_CUSTOM_CODE} -Dlog4j.configurationFile=file://${CATALINA_HOME}/conf/log4j2.xml -DLog4jContextSelector=org.apache.logging.log4j.core.selector.BasicContextSelector"

if [ "${APM_ENABLED}" == true ]
then
	# JVM heap size options
	JVM_OPTS="-XX:NewRatio=1 -javaagent:/usr/local/tomcat/apm-agent/elastic-apm-agent-1.26.0.jar -server -Xms12g -Xmx12g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=45 -XX:+ParallelRefProcEnabled -XX:+DisableExplicitGC -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=${CATALINA_HOME}/temp -XX:MaxMetaspaceSize=1g -XX:+AlwaysPreTouch -Djava.util.Arrays.useLegacyMergeSort=true -Dorg.apache.catalina.connector.URI_ENCODING=UTF-8 -Dserver.tomcat.max-threads=200 -Dserver.tomcat.max-connections=10000 -Dserver.tomcat.accept-count=100"
else
	# JVM heap size options
	JVM_OPTS="-server -Xms12g -Xmx$12g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=45 -XX:+ParallelRefProcEnabled -XX:+DisableExplicitGC -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=${CATALINA_HOME}/temp -XX:MaxMetaspaceSize=1g -XX:+AlwaysPreTouch -Djava.util.Arrays.useLegacyMergeSort=true -Dorg.apache.catalina.connector.URI_ENCODING=UTF-8 -Dserver.tomcat.max-threads=200  -Dserver.tomcat.max-connections=10000 -Dserver.tomcat.accept-count=100"
fi

#Visual VM's
# VISUAL_VM_OPTS="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.port=1099 -Dcom.sun.management.jmxremote.rmi.port=1099 -Djava.rmi.server.hostname=127.0.0.1"

# JVM garbage Collector options
VGC_OPTS="-verbosegc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:${CATALINA_HOME}/logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=2048k"

# JVM heapdump options
# DMP_OPTS="-XX:MaxGCPauseMillis=200 -XX:+ParallelRefProcEnabled -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=${CATALINA_HOME}/temp -XX:MaxMetaspaceSize=1g -XX:+AlwaysPreTouch"

#CATALINA_OPTS=" ${JAVA_OPTS} ${JVM_OPTS} ${VGC_OPTS} ${REP_OPTS} ${DMP_OPTS} ${L4J_OPTS}"
# CATALINA_OPTS="${VISUAL_VM_OPTS} ${JAVA_OPTS} ${JVM_OPTS} ${VGC_OPTS} ${REP_OPTS} ${L4J_OPTS}"
CATALINA_OPTS="${JAVA_OPTS} ${JVM_OPTS} ${VGC_OPTS} ${REP_OPTS} ${L4J_OPTS}"

I have tuned the jvm opts, please suggest for any correction.

and kubectl describe pod {pod-name} details below :

Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  15m                  default-scheduler  Successfully assigned default/bloomreach-cms to ip-172-17-9-191.eu-west-1.compute.internal
  Normal   Pulling    15m                  kubelet            Pulling image "84512547.amazonaws.com/default/bloomreach-cms:20"
  Normal   Pulled     15m                  kubelet            Successfully pulled image "84512547.amazonaws.com/default/bloomreach-cms:20" in 5.157s (5.157s including waiting). Image size: 394696117 bytes.
  Normal   Created    15m                  kubelet            Container created
  Normal   Started    15m                  kubelet            Container started
  Warning  Unhealthy  3m9s (x41 over 13m)  kubelet            Readiness probe failed: Get "http://172.17.8.70:8080/site/ping/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

I always get 502, when I searched , found that /site/ping/ is a heavy request path that causes instability, and got suggestion for some custom endoint for health check i.e. /cmsinternal/ping correctly serves Kubernetes health checks.

I want help for this 502 issues, and if needed can we add /cmsinternal/ping and how?, If I do not use /site/ping/ will it impact real flow for health check that is really needed!.