Hi team,
We are running Bloomreach Experience Manager (brXM 14.5.0) at AWS and in our deployment file we have added readiness like below :
readinessProbe:
failureThreshold: 5
httpGet:
path: /site/ping/
port: 8080
scheme: HTTP
initialDelaySeconds: 120
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 15
Now intermettently I am facing 502 bad gateway on cms ui. When we verified logs didn’t found any thing, we have grafana, I am sharing those details below :
setenv.sh
#!/bin/sh
#
# ${CATALINA_BASE}/bin/setenv.sh
#
# This script is executed automatically by ${catalina.base}/catalina.sh.
#
# Repository configurations
REP_OPTS="-Drepo.path=${REPO_PATH} -Drepo.bootstrap=false -Drepo.config=${REPO_CONFIG} -Drepo.autoexport.allowed=false -Dhippo.search.index.rebuild=false"
# JAVA_OPTS="-verbose:class"
#Remove unless debugging: -XX:NativeMemoryTracking=summary
JAVA_OPTS="-Dproject.basedir=/brxm/project -Dcargo.jvm.args='-DCMS_BASEURL=${CMS_BASEURL}' -Delastic.apm.service_name=${APM_SERVICE_NAME} -Delastic.apm.application_packages=com.* -Delastic.apm.server_url=${APM_SERVER_URL} -Delastic.apm.secret_token=${APM_SECRET}"
# Logging configurations
L4J_OPTS="-DLOG_LEVEL=${LOG_LEVEL} -DLOG_LEVEL_CMS=${LOG_LEVEL_CMS} -DLOG_LEVEL_CUSTOM_CODE=${LOG_LEVEL_CUSTOM_CODE} -Dlog4j.configurationFile=file://${CATALINA_HOME}/conf/log4j2.xml -DLog4jContextSelector=org.apache.logging.log4j.core.selector.BasicContextSelector"
if [ "${APM_ENABLED}" == true ]
then
# JVM heap size options
JVM_OPTS="-XX:NewRatio=1 -javaagent:/usr/local/tomcat/apm-agent/elastic-apm-agent-1.26.0.jar -server -Xms12g -Xmx12g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=45 -XX:+ParallelRefProcEnabled -XX:+DisableExplicitGC -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=${CATALINA_HOME}/temp -XX:MaxMetaspaceSize=1g -XX:+AlwaysPreTouch -Djava.util.Arrays.useLegacyMergeSort=true -Dorg.apache.catalina.connector.URI_ENCODING=UTF-8 -Dserver.tomcat.max-threads=200 -Dserver.tomcat.max-connections=10000 -Dserver.tomcat.accept-count=100"
else
# JVM heap size options
JVM_OPTS="-server -Xms12g -Xmx$12g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=45 -XX:+ParallelRefProcEnabled -XX:+DisableExplicitGC -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=${CATALINA_HOME}/temp -XX:MaxMetaspaceSize=1g -XX:+AlwaysPreTouch -Djava.util.Arrays.useLegacyMergeSort=true -Dorg.apache.catalina.connector.URI_ENCODING=UTF-8 -Dserver.tomcat.max-threads=200 -Dserver.tomcat.max-connections=10000 -Dserver.tomcat.accept-count=100"
fi
#Visual VM's
# VISUAL_VM_OPTS="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.port=1099 -Dcom.sun.management.jmxremote.rmi.port=1099 -Djava.rmi.server.hostname=127.0.0.1"
# JVM garbage Collector options
VGC_OPTS="-verbosegc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:${CATALINA_HOME}/logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=2048k"
# JVM heapdump options
# DMP_OPTS="-XX:MaxGCPauseMillis=200 -XX:+ParallelRefProcEnabled -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=${CATALINA_HOME}/temp -XX:MaxMetaspaceSize=1g -XX:+AlwaysPreTouch"
#CATALINA_OPTS=" ${JAVA_OPTS} ${JVM_OPTS} ${VGC_OPTS} ${REP_OPTS} ${DMP_OPTS} ${L4J_OPTS}"
# CATALINA_OPTS="${VISUAL_VM_OPTS} ${JAVA_OPTS} ${JVM_OPTS} ${VGC_OPTS} ${REP_OPTS} ${L4J_OPTS}"
CATALINA_OPTS="${JAVA_OPTS} ${JVM_OPTS} ${VGC_OPTS} ${REP_OPTS} ${L4J_OPTS}"
I have tuned the jvm opts, please suggest for any correction.
and kubectl describe pod {pod-name} details below :
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 15m default-scheduler Successfully assigned default/bloomreach-cms to ip-172-17-9-191.eu-west-1.compute.internal
Normal Pulling 15m kubelet Pulling image "84512547.amazonaws.com/default/bloomreach-cms:20"
Normal Pulled 15m kubelet Successfully pulled image "84512547.amazonaws.com/default/bloomreach-cms:20" in 5.157s (5.157s including waiting). Image size: 394696117 bytes.
Normal Created 15m kubelet Container created
Normal Started 15m kubelet Container started
Warning Unhealthy 3m9s (x41 over 13m) kubelet Readiness probe failed: Get "http://172.17.8.70:8080/site/ping/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
I always get 502, when I searched , found that /site/ping/ is a heavy request path that causes instability, and got suggestion for some custom endoint for health check i.e. /cmsinternal/ping correctly serves Kubernetes health checks.
I want help for this 502 issues, and if needed can we add /cmsinternal/ping and how?, If I do not use /site/ping/ will it impact real flow for health check that is really needed!.

