I upgraded an old (about 5 or 7 years) Fedora Server to 32 and now have a process being stopped problem - stopped by the OS. The only "application code" that is new is the java version (open-jdk
), and that doesn't matter to the java code. And, the hang is "restartable" if it was started from the command line (and maybe otherwise? If so I don't know how yet) as it goes into a "Stopped" state (reported by ps
as Tl
) that can be un-done by foregrounding (more on that below).
So, something changed about the OS itself.
There are multiple ways to start the code that's being stopped but I've been starting it via the command line as a matter of convenience, backgrounding it via &
. However it's started, it runs a Java-based daemon that looks for work to do and when it finds something, it launches a child process - also Java - which does some processing on its own, asynchronously.
If there's no child launch, there's no stop.
If there's a child launch, the child runs for a bit and then the whole process tree, starting from the command-line-started daemon on down is put in the "Stopped" state. And I probably can't easily give it load enough to start multiple children before one of them stops the dispatching daemon because the system is so fast it gets to the stop point nearly instantly.
The child's functions are broken into "Prelude" "Main", "Epilogue", and "Cleanup." And, it reports what it's doing so we know where it's at. ... It always gets to Main when it gets stopped, so I know for sure it's running various sorts of setup before it gets to the stop - it's not merely stopping when the child Java is launched. Both processes are put in the Tl
state.
From the command-line, jobs
shows the job as "Stopped". You can then fg
back to the daemon itself and it then prompts for password (I've NEVER seen this behavior before?!), and once forgrounded all stopped threads / processes are resumed and it runs to completion like nothing ever happened. I usually ^z
and then bg
to return to the previously backgrounded state...
On a lark, I tried NOT backgrounding and when it gets to the problem spot, I just get a prompt for password! ... I haven't ever seen anything like this behavior before on ANY system. This MUST be a strong clue.
Investigating the Tl
state has yielded, so far, absolutely nothing, though the T
state is reasonably documented, though other than the fg
trick, I don't yet know how to restart such a stopped job. (A method to restart such a stopped process tree without having to enter a password would be a good temporary workaround!) Nor have I yet had time to learn what all puts something into the T
state. But that's my next focus, after I complete pursuing something I share below.
This is completely unacceptable behavior and I need to fix it ASAP.
I thought maybe this was modified behavior of the OOM
killer, but it isn't killing jobs?! But maybe, so I observed memory and there's not a lot of actual memory free but there's always plenty of virtual memory (swap space) available.
The only other quirk that might be playing a role is that while I upgraded the server to Fedora Server 32, if I'm not supervising the boot at the console, it boots into 31 instead - VERY frustrating and I've already tried fruitlessly to fix this. It appears to be an old bug in grub
as the system was "upgraded". However, it's presently running as 31 - it's a server and restarting in person at the console is painful! - could this be the cause?! I'm guessing this is not the cause, but I don't know.
I turned selinux off, just to be sure that wasn't the cause, and it's not (presuming the NSA stuff can actually be turned off).
Desperate for a solution ASAP, I've done what research I can and a web search suggested a Fedora 30 issue that was closed regarding EnableMultipleStreamsException
. But this appears to NOT be the problem. ... Still researching!
WORKAROUND Attempts
I tried using kill -CONF <pid>
but it only freed the daemon, and didn't work on the child processes, even when I did the same command directly applied to them. My wild assed guess is that they are looking for the password to be applied from somewhere. NOTABLY, the state change from Tl to Sl! What does this mean? IDK.
SOMETHING IN THE OS STOPPED THAT TREE OF PROCESSES!
And I have to turn that something off or go to another OS, however painful that is.
Anybody else seen this before, now what to do, etc?
0 Answers