At one of my customer sites, the local guy shut down their local Solaris 10 x86 server, pulled the power inputs, moved it, and now it won’t start properly. It boots and then presents a prompt which lets you log in. This appears to be single user milestone (or equivalent).
Digging into it, I think that SMF isn’t permitting the system to go multi-user. SMF was generating a ton of errors on autofs, after some fooling with it I got it to generate errors on inetd and nfs/client instead. This all tells me that the problem is in some SMF state file or database that needs to be fixed/deleted/recreated or something, but I don’t know what the actual issue is.
By “generate errors”, I mean that every second I get a message on the console saying “Method or service exit timed out. Killing contract <#>.” This makes interacting with the computer difficult.
Running svcs –xv shows the service as “enabled”, in state “disabled”, reason “Start method is running”. Fooling with svcadm on the service does nothing, except confirm that the service is not in a Maintenance state.
Logs in /lib/svc/log/$SERVICE just tell you that this loop has been happening once per second. Logs in /etc/svc/volatile/$SERVICE confirm that at boot the service is attempted to start, and immediately stopped, no further entries. Note that system-log isn’t starting because system-log depends on autofs so I have no syslog or dmesg.
Googling all these terms ends up telling me how to debug/fix either autofs or nfs/client or inetd or rpc/gss (which was the dependency that SMF was using as an excuse to prevent nfs/client from “starting”, it was claiming that rpc/gss was “undefined” which is incorrect since this all used to work. I re-enabled it with inetadm, but inetd still won’t start properly). But I think that the problem is SMF in general, not the individual services.
Doing a restore_repository to the “manifest_import” does nothing to improve, or even detectibly change, the situation. I didn’t use a boot backup because the last boot(s) were not useful.
I have told the customer that since the valuable data directories are on a separate file system (which fsck’s as clean so it is intact) we could just re-install solaris 10 on the / partition. But that seems like an awfully windows-like solution to inflict on this problem.
So. Any ideas what piece is broken and how I might fix it?
Update 1: I should probably mention that this system has two file systems, / and /export. Both fsck clean and mount properly.
A common root cause of such a problem is an issue while mounting file systems due to some file system corruption. This is becoming quite rare, especially for local ones, but your customer didn't put the odds on his side by both disabling ufs logging (which avoids most file system corruptions caused by an abrupt power-off) and by not using ZFS (which cannot be corrupt in the first place by design).
You can enable verbose smf startup by editing /boot/grub/menu.lst. The precise way depends on your Solaris version and update but usually, this is done by replacing
console=graphics
byconsole=text -v -m verbose
in the line loading the kernel.Should you want to start in single user mode, use
console=text -v -m verbose,milestone=single-user
.To enable smf debug mode, use
console=text -v -m debug
Note that you can use grub edit mode to temporarily set these options.