We had a server effectively go down this morning. SSH access cut out, and at least temporarily network access went down as well. We were able to log in using out-of-band access and were presented with a screen full of "Init: cannot fork, retry.." messages.
When trying to log in, when we typed in a userid and bad password, we got the normal "invalid user/pass" error. However if we typed in a correct userid and password, we were simply presented with the MOTD and the login screen again. It looks like the system was no longer able to launch any new processes (logging in successfully should launch a shell, if it can't I guess it drops you back at login?).
I found a description of the issue at Red Hat's knowledgebase (https://access.redhat.com/site/solutions/39497), but there is very little supplementary information on the error, just a suggested solution.
What exactly does nproc do? Is it a hard limit on the number of processes the system can have running at any point in time? When nproc is exceeded does it cause impacts like we saw? Is there any way to set it to unlimited? If not, how can we know what a safe or unsafe range is?
Any help or guidance would be very much appreciated, since it caused production issues and is now on the plate of several layer-8 folks :(
Edit: Also in /var/log/messages:
May 31 15:26:00 servername udevd[1637]: udev_event_run: fork of child failed: Resource temporarily unavailable
May 31 15:26:00 servername last message repeated 3 times
May 31 15:26:00 servername udevd-event[2461]: run_program: fork of '/lib/udev/udev_run_hotplugd' failed: Resource temporarily unavailable
May 31 15:26:00 servername udevd-event[2461]: run_program: fork of '/lib/udev/udev_run_devd' failed: Resource temporarily unavailable
May 31 15:26:00 servername udevd[1637]: udev_event_run: fork of child failed: Resource temporarily unavailable
The error message means that the server ran out of the limit of the number of processes. There are two limits - hard and soft. When you fork(), you create a new process from the existing process. Here, we have some condition that is not allowing fork().
You have a problem in forking udev child processes. I guess, this is happening on boot time. See this
/lib/udev/udev_run_hotplugd
So there is some hot-pluggable device there. Otherwise, I don't see a reason for that library to be called.
Two suggestions for now -
1) If you can reproduce it, strace it if possible. Get the syscall where it is failing. Much easier that way. I don't exactly remember which syscall it is.
2) Run udev in debug mode. Change
udev_log=info
to debug BUT test it first. It produces HUGE amount of logs and without a good ring buffer size or an enormous wide monitor, missing out the messages shown on terminal is fairly common.But I have seen this issue a lot. Lemme tell you, why not ask the Red Hat folks if you have a subscription.
Sounds like either (1) you ran out of memory+swap space, or (2) an errant process flooded your process table preventing new processes from being created.