I was passed a box today which wasn't coming back from a reboot. After a fair bit of work with live/rescue disks I came across the situation I'm now stuck on. Basically various lowlevel tools (ls, grep, etc) are segfaulting - which a reinstall fixes, but it keeps reverting.
One of the various segfaulting programs is grep. A random example:
$ grep eth0 /etc/sysconfig/network-scripts/*
Segmentation fault
However a reinstall of the grep package resolves the issue:
$ yum reinstall grep
Loaded plugins: fastestmirror
Setting up Reinstall Process
Loading mirror speeds from cached hostfile
[...]
Installed:
grep.i386 0:2.5.1-55.el5
Complete!
$ grep eth0 /etc/sysconfig/network-scripts/*
/etc/sysconfig/network-scripts/ifcfg-eth0:DEVICE=eth0
[...]
But when the box reboots, everything is broken again! I can even replicate this by simply switching run levels.
$ init 4
$ grep eth0 /etc/sysconfig/network-scripts/*
Segmentation fault
I can repeat my reinstall fix, but then switch bhack to runlevels 5 and it happens again.
I've included a copy of an strace for the grep command below, but as I say it effects "ls" too, which I also fixed with a reinstall of coreutils.
execve("//bin/grep", ["grep", "eth0", "/etc/sysconfig/network-scripts/i", "/etc/sysconfig/network-scripts/i", "/etc/sysconfig/network-scripts/i", "/etc/sysconfig/network-scripts/i", "/etc/sysconfig/network-scripts/i", "/etc/sysconfig/network-scripts/i", "/etc/sysconfig/network-scripts/i", "/etc/sysconfig/network-scripts/i", "/etc/sysconfig/network-scripts/i", "/etc/sysconfig/network-scripts/i", "/etc/sysconfig/network-scripts/i", "/etc/sysconfig/network-scripts/i", "/etc/sysconfig/network-scripts/i", "/etc/sysconfig/network-scripts/i", ...], [/* 24 vars */]) = 0
brk(0) = 0x9bd0000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=29251, ...}) = 0
mmap2(NULL, 29251, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7fe2000
close(3) = 0
open("/lib/libpcre.so.0", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\20\17\0\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=117448, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7fe1000
mmap2(NULL, 116176, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x1d3000
mmap2(0x1ef000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c) = 0x1ef000
close(3) = 0
open("/lib/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\340_\1\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1686224, ...}) = 0
mmap2(NULL, 1410500, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x8aa000
mmap2(0x9fd000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x152) = 0x9fd000
mmap2(0xa00000, 9668, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xa00000
close(3) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7fe0000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb7fe06c0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
mprotect(0x9fd000, 8192, PROT_READ) = 0
mprotect(0x818000, 4096, PROT_READ) = 0
munmap(0xb7fe2000, 29251) = 0
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
+++ killed by SIGSEGV +++
Anyone got some clever ideas what's going on? I don't intend to trust this box (hardware of software), but I do want to get to the bottom of this.
Like you said in comment if your server has been compromised, you certainly have a rootkit installed. If it comes back after a reboot, it is a nasty one (with multiple strategies to reinstall itself in different places, custom libraries wraping the real ones and kernel module intercepting system calls in order to hide itself).
In this case the segfaults are caused by the custom libraries of the rootkit which are not ABI-compatible with the libraries of your distribution.
To fix this problem the only real solution is to reinstall from scratch and restore carefuly your data.
You either have substantial disk corruption or bad memory in this system, and my money's on the latter. Run the appropriate hardware diagnostics for both, and start testing with one DIMM at a time removed.
I suspect the problem is because of file system corruption due to bad disk/RAID controller. I would check the SMART output to check the health of the drive/s. Second would run memtest to rule out any problems with the RAM. Third I would do a stress test of the disks.
I highly doubt its a rootkit.