Working with Openshift in a daily basis, I come with several situations where the pod crashes. Given my background on java, I will talk about java here.
Let me list a few situations and the next steps
pods crash with OOME | The java process uses more heap than it was suppose – it would generate a heap on OOME exception – but it might exit given: ExitOnOutOfMemoryError |
pods crash with OOME killer | Verify OCP node dmesg and verify if there is OOME Killer messages |
How to know why my pod is crashing?
Let’s pretend you don’t know why the java pod (java pod here == pod with one container that is java). The first would be to see if the pod is OOME (in the JVM) or suffering from OOME-killer.
OOME will be handled by the JVM itself however, because the containers usually have ExitOnOOME so then the container will exit, which will prompt the orchestrator to respawn new pods given a certain timeout period.
For OOME Killers, this is an external agent (OCP node, or the cgroups) acting out and affecting the container to finish it up given a certain condition. Like lack of resources if the OCP (kubelet) needs to spawn a certain pod but doesn’t have resources, so it might just terminate the QoS best efforts ones over spawning Guaranteed pods.
Or that can be a native allocation breaching the cgroups limitations and causing the container to exit, by being killed.