I have had an ongoing problem running a Tomcat java web application in a docker container (which I refer to as a 'task' in this post) hosted in an ECS (elastic container service) on AWS.
We notice that the task climbs to 97% CPU usage (using the AWS metrics) and while it sometimes climbs back down to a lower CPU usage on its own, the task generally just shuts down.
Luckily, the ECS spawns a new docker task and starts up the application again (although, it takes 5-10 minutes for everything to come back online, which is a huge amount of time during our production day!)
We don't have any upper limit on the ECS task configured (perhaps we should?) — — in a previous project we increased the CPU on the ECS host from 8 vCPU, to 32 vCPU and sure enough this particular docker task climbed to the 97% of the ECS host CPU persistently throughout the project.
This week we increased the CPU from 8 vCPU, to 16 vCPU (and a memory of 64 GB).
And are seeing the same thing. I increased the soft memory limit of the task to 4 GB (it was originally set to 2 GB) and I can see the memory usage climbs but definitely does not go above about 6 GB.
Going by the stack trace, (which is too long to post), there are no Outof Memory error logged by the tomcat/java application.
It usually starts with a JDBC error (maximum connections /pool exhausted), then things getting deregistered, the logging system shut down, etc.
Is the ECS host shutting down the task, or is the task shutting itself down after reaching CPU/memory constraints (java/tomcat shutting itself down)? Furthermore, in our ECS agent log I can see a statement about 'Exit 143' -- is this termination of the task from the ECS or the container itself exiting? Would it be best to set an upper CPU limit on the task (regarding JVM memory, using whatever is available to it)?