I have replaced a dead node that was running in dual-primary mode with OCFS2. All the steps work:
/proc/drbd
version: 8.3.13 (api:88/proto:86-96)
GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by [email protected], 2012-05-07 11:56:36
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:81 nr:407832 dw:106657970 dr:266340 al:179 bm:6551 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
until I try to mount the volume:
mount -t ocfs2 /dev/drbd1 /data/webroot/
mount.ocfs2: Transport endpoint is not connected while mounting /dev/drbd1 on /data/webroot/. Check 'dmesg' for more information on this error.
/var/log/kern.log
kernel: (o2net,11427,1):o2net_connect_expired:1664 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors.
kernel: (mount.ocfs2,12037,1):dlm_request_join:1036 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):dlm_try_to_join_domain:1210 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):dlm_join_domain:1488 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):dlm_register_domain:1754 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):ocfs2_dlm_init:2808 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):ocfs2_mount_volume:1447 ERROR: status = -107
kernel: ocfs2: Unmounting device (147,1) on (node 1)
and below is the kernel log on the node 0 (192.168.3.145):
kernel: : (swapper,0,7):o2net_listen_data_ready:1894 bytes: 0
kernel: : (o2net,4024,3):o2net_accept_one:1800 attempt to connect from unknown node at 192.168.2.93
:43868
kernel: : (o2net,4024,3):o2net_connect_expired:1664 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors.
kernel: : (o2net,4024,3):o2net_set_nn_state:478 node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err 0 -> -107
I'm sure /etc/ocfs2/cluster.conf
on the both node are identical:
/etc/ocfs2/cluster.conf
node:
ip_port = 7777
ip_address = 192.168.3.145
number = 0
name = SVR233NTC-3145.localdomain
cluster = cpc
node:
ip_port = 7777
ip_address = 192.168.2.93
number = 1
name = SVR022-293.localdomain
cluster = cpc
cluster:
node_count = 2
name = cpc
and they are connected fine:
# nc -z 192.168.3.145 7777
Connection to 192.168.3.145 7777 port [tcp/cbt] succeeded!
but the O2CB heartbeat is not active on the new node (192.168.2.93):
/etc/init.d/o2cb status
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster cpc: Online
Heartbeat dead threshold = 31
Network idle timeout: 30000
Network keepalive delay: 2000
Network reconnect delay: 2000
Checking O2CB heartbeat: Not active
Here're the results when running tcpdump
on the node 0 while starting the ocfs2
on the node 1:
1 0.000000 192.168.2.93 -> 192.168.3.145 TCP 70 55274 > cbt [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSval=690432180 TSecr=0
2 0.000008 192.168.3.145 -> 192.168.2.93 TCP 70 cbt > 55274 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSval=707657223 TSecr=690432180
3 0.000223 192.168.2.93 -> 192.168.3.145 TCP 66 55274 > cbt [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSval=690432181 TSecr=707657223
4 0.000286 192.168.2.93 -> 192.168.3.145 TCP 98 55274 > cbt [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=32 TSval=690432181 TSecr=707657223
5 0.000292 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181
6 0.000324 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [RST, ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181
The RST
flag is sent after every 6 packets.
What other can I do to debug this case?
PS:
OCFS2 versions on the node 0:
- ocfs2-tools-1.4.4-1.el5
- ocfs2-2.6.18-274.12.1.el5-1.4.7-1.el5
OCFS2 versions on the node 1:
- ocfs2-tools-1.4.4-1.el5
- ocfs2-2.6.18-308.el5-1.4.7-1.el5
UPDATE 1 - Sun Dec 23 18:15:07 ICT 2012
Are both nodes on the same lan segment? No routers etc.?
No, they are 2 VMWare servers on the different subnet.
Oh, while I remember - hostnames/DNS all setup and working correctly?
Sure, I added both the hostname and IP address of each node to /etc/hosts
:
192.168.2.93 SVR022-293.localdomain
192.168.3.145 SVR233NTC-3145.localdomain
and they can connect to each other via hostname:
# nc -z SVR022-293.localdomain 7777
Connection to SVR022-293.localdomain 7777 port [tcp/cbt] succeeded!
# nc -z SVR233NTC-3145.localdomain 7777
Connection to SVR233NTC-3145.localdomain 7777 port [tcp/cbt] succeeded!
UPDATE 2 - Mon Dec 24 18:32:15 ICT 2012
Found the clues: my co-worker manually edited the /etc/ocfs2/cluster.conf
file while the cluster is running. So, it still keeps the dead node information in the /sys/kernel/config/cluster/
:
# ls -l /sys/kernel/config/cluster/cpc/node/
total 0
drwxr-xr-x 2 root root 0 Dec 24 18:21 SVR150-4107.localdomain
drwxr-xr-x 2 root root 0 Dec 24 18:21 SVR233NTC-3145.localdomain
(SVR150-4107.localdomain
in this case)
I'm going to stop the cluster to remove the dead node but got the following error:
# /etc/init.d/o2cb stop
Stopping O2CB cluster cpc: Failed
Unable to stop cluster as heartbeat region still active
I'm sure the ocfs2
service is already stopped:
# mounted.ocfs2 -f
Device FS Nodes
/dev/sdb ocfs2 Not mounted
/dev/drbd1 ocfs2 Not mounted
There are no references anymore:
# ocfs2_hb_ctl -I -u 12963EAF4E16484DB81ECB0251177C26
12963EAF4E16484DB81ECB0251177C26: 0 refs
I also unloaded the ocfs2
kernel module to ensure:
# ps -ef | grep [o]cfs2
root 12513 43 0 18:25 ? 00:00:00 [ocfs2_wq]
# modprobe -r ocfs2
# ps -ef | grep [o]cfs2
# lsof | grep ocfs2
but nothing changes:
# /etc/init.d/o2cb offline
Stopping O2CB cluster cpc: Failed
Unable to stop cluster as heartbeat region still active
So the final question is: how to delete the dead node information without rebooting?
UPDATE 3 - Mon Dec 24 22:41:51 ICT 2012
Here're all of the running heartbeat threads:
# ls -l /sys/kernel/config/cluster/cpc/heartbeat/ | grep '^d'
drwxr-xr-x 2 root root 0 Dec 24 22:18 72EF09EA3D0D4F51BDC00B47432B1EB2
Reference counts for this heartbeat region:
# ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB2
72EF09EA3D0D4F51BDC00B47432B1EB2: 7 refs
Try to kill:
# ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
Any ideas?
Oh yeah! Problem solved.
Pay attention to the UUID:
but:
This could happen because I "accidentally" force re-formated the OCFS2 volume. The problem I'm facing with is similar to this on the Ocfs2-user mailing list.
This is also the reason for below error:
because
ocfs2_hb_ctl
cannot find the device with UUID72EF09EA3D0D4F51BDC00B47432B1EB2
in the/proc/partitions
.One idea comes to my mind: Can I change the UUID of a OCFS2 volume?
Looking through the
tunefs.ocfs2
man page:so I do the following command:
Verify:
Tried to kill the heartbeat region again to see what happens:
Keep killing until I see the
0 refs
then turn off the cluster:and stop it:
Re-starting to see if the new node was updated:
OK, on the peer node (192.168.2.93), tried to start the OCFS2:
Thanks to Sunil Mushran because this thread helped me solve the problem.
The lessons are: