I'm trying to setup a simple ppp tunnel over ssh. It works on several machines just fine. But on one machine, pppd gets "stuck":
> pgrep pppd | xargs ps up
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 4178 0.0 0.1 3020 1088 pts/1 Ds+ 05:28 0:00 /usr/sbin/pppd
Any attempt to kill it (even sudo kill -9 4178
) has no effect that I can see. strace -p 4178
also hangs similarly. After it has been started for a while, I start getting messages in dmesg
like shown below.
It is started like so from another machine:
ssh -t root@server /usr/sbin/pppd passive noauth
When I do this to one of the machines that work, the remote end's pppd
spits out garbage/binary data to the console (as expected). When I do it to the one that fails, I get no output from pppd, but the ssh session eventually times out. If I instead ssh to the machine, and then run /usr/sbin/pppd passive noauth
in a separate step I also get the expected binary output.
I now have a couple of questions:
- What could be up with the one machine where pppd fails? I don't even know where to start looking...
- What could be the difference between
ssh -t root@server /usr/sbin/pppd passive noauth
in a single step andssh root@server
and/usr/sbin/pppd passive noauth
in two steps? - How can it be that I can't kill the process even with
sudo kill -9
? The only way I know is to reboot.
(I've tried searching for something similar but didn't get anywhere so I'm sorry I don't have any more leads)
Any ideas?
The problem machine runs in debian on VMware "hardware" (as do the ones that work) and it exhibits the problem when cloned and on both debian lenny (original) and squeeze (after upgrade)
dmesg
entries:
[ 1198.727248] INFO: task pppd:4178 blocked for more than 120 seconds.
[ 1198.727507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1198.727904] pppd D ece2dc9c 0 4178 4174 0x00000004
[ 1198.727908] 00000098 00000082 f2503520 ece2dc9c 0000b1e7 00000000 c148d1c0 c148d1c0
[ 1198.727913] f2a06100 f6e071c0 00000000 ece2dc18 f5cd07e0 00000000 ece2d400 ece2dc9d
[ 1198.727918] 00c52300 ece2dcbc f67bfef8 ec98e480 f291cec0 00000000 c10cf5b0 c10dfd21
[ 1198.727923] Call Trace:
[ 1198.727926] [<c10cf5b0>] ? nameidata_to_filp+0x37/0x41
[ 1198.727929] [<c10dfd21>] ? dput+0x21/0xb7
[ 1198.727932] [<c11cfecc>] ? tty_ldisc_ref_wait+0x5f/0x76
[ 1198.727935] [<c104de7a>] ? wake_up_bit+0x5c/0x5c
[ 1198.727938] [<c11cb91b>] ? tty_ioctl+0x85f/0x8ba
[ 1198.727941] [<c10fec18>] ? do_lock_file_wait+0x3d/0xd9
[ 1198.727944] [<c1162c97>] ? _copy_from_user+0x2b/0x102
[ 1198.727946] [<c11cb0bc>] ? tty_check_change+0xb9/0xb9
[ 1198.727949] [<c10dbeb7>] ? do_vfs_ioctl+0x485/0x4c7
[ 1198.727952] [<c10db59a>] ? do_fcntl+0x24f/0x3a2
[ 1198.727954] [<c10dbf3a>] ? sys_ioctl+0x41/0x58
[ 1198.727957] [<c12c6a1f>] ? sysenter_do_call+0x12/0x28
[ 1318.457225] INFO: task sshd:4174 blocked for more than 120 seconds.
[ 1318.457500] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1318.457896] sshd D f25024cc 0 4174 2393 0x00000000
[ 1318.457901] 00000098 00000086 f2a06940 f25024cc 0000b246 00000000 c148d1c0 c148d1c0
[ 1318.457906] f2503520 f6e071c0 00000000 3f056585 0000000f ece2d4bc 3f056585 f2503520
[ 1318.457911] ec98bb38 ec98bbdc 00000000 00000000 00000000 c12c09b5 f2503520 c10327cb
[ 1318.457916] Call Trace:
[ 1318.457926] [<c12c09b5>] ? schedule_hrtimeout_range_clock+0x3c/0xd9
[ 1318.457931] [<c10327cb>] ? try_to_wake_up+0x13f/0x13f
[ 1318.457935] [<c11cfecc>] ? tty_ldisc_ref_wait+0x5f/0x76
[ 1318.457940] [<c104de7a>] ? wake_up_bit+0x5c/0x5c
[ 1318.457943] [<c11c9ad3>] ? tty_poll+0x32/0x5e
[ 1318.457947] [<c10dd4d5>] ? do_select+0x2a1/0x42e
[ 1318.457950] [<c10dcb83>] ? poll_freewait+0x69/0x69
[ 1318.457953] [<c10dcc25>] ? __pollwait+0xa2/0xa2
[ 1318.457955] [<c10dcc25>] ? __pollwait+0xa2/0xa2
[ 1318.457958] [<c10dcc25>] ? __pollwait+0xa2/0xa2
[ 1318.457960] [<c10dcc25>] ? __pollwait+0xa2/0xa2
[ 1318.457963] [<c10dcc25>] ? __pollwait+0xa2/0xa2
[ 1318.457965] [<c10dcc25>] ? __pollwait+0xa2/0xa2
[ 1318.457968] [<c10dcc25>] ? __pollwait+0xa2/0xa2
[ 1318.457971] [<c10429c2>] ? lock_timer_base+0x19/0x35
[ 1318.457974] [<c1042eb5>] ? __mod_timer+0x10c/0x116
[ 1318.457977] [<c1042f89>] ? mod_timer+0x69/0x6e
[ 1318.457981] [<c121325d>] ? sk_reset_timer+0xc/0x16
[ 1318.457984] [<c1252f57>] ? tcp_event_new_data_sent+0x66/0x6b
[ 1318.457987] [<c1255b85>] ? tcp_write_xmit+0x7a7/0x86a
[ 1318.457990] [<c121760d>] ? __alloc_skb+0x50/0xfd
[ 1318.457994] [<c12c12bc>] ? _raw_spin_lock_bh+0x8/0x1e
[ 1318.457996] [<c1212e98>] ? release_sock+0x10/0xc4
[ 1318.457999] [<c124b543>] ? tcp_sendmsg+0x6dd/0x7b7
[ 1318.458003] [<c1162c97>] ? _copy_from_user+0x2b/0x102
[ 1318.458006] [<c10dd7a0>] ? core_sys_select+0x13e/0x1c3
[ 1318.458009] [<c12102a3>] ? sock_aio_write+0xc0/0xd4
[ 1318.458012] [<c10d0655>] ? do_sync_write+0xa0/0xe4
[ 1318.458016] [<c10b141c>] ? handle_mm_fault+0x222/0x238
[ 1318.458019] [<c10f6096>] ? fsnotify+0x1de/0x1f9
[ 1318.458022] [<c10dd9e8>] ? sys_select+0x6e/0x8f
[ 1318.458024] [<c10d105e>] ? sys_write+0x3c/0x63
[ 1318.458028] [<c12c6a1f>] ? sysenter_do_call+0x12/0x28
I since remembered that that machine is running a new 2.6.39 based custom kernel I compiled myself. When I reverted to the stock Ubuntu kernel, it worked fine. As far as I remember, I didn't apply any patches to the kernel, I just took a debian unstable
.config
and took it from there.If anybody else runs into this, I've created a gist with the two kernel configs [working, broken] for future reference. But for now, I'll just stick with the Ubuntu stock kernel for
pppd
.However, regardless of what config options that are active in the 2.6.39 kernel, a pppd process that can't be killed is still odd...