I have a functional xenserver 6.5 pool with two nodes. It is backed by an iscsi share on a Dell MD3600i SAN, and this works fine. It was set up before my time.
We've added three more nodes to the pool. However these three new nodes will not connect to the storage.
Here's one of the original nodes, working fine:
[root@node1 ~]# iscsiadm -m session
tcp: [2] 10.19.3.11:3260,1 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [3] 10.19.3.14:3260,2 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [4] 10.19.3.12:3260,1 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [5] 10.19.3.13:3260,2 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
Here's one of the new nodes. Notice the corruption in the address?
[root@vnode3 ~]# iscsiadm -m session
tcp: [1] []:-1,2 ▒A<g▒▒▒-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [2] 10.19.3.12:3260,1 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [3] 10.19.3.11:3260,1 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [4] 10.19.3.14:3260,2 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
The missing IP address is .13 but another node is missing .12
Comments:
I have live running production VMs on the existing nodes and nowhere to move them, so rebooting the SAN is not an option.
Multipathing is disabled on the original nodes, despite the san having 4 interfaces. This seems sub optimal so I've turned on multipathing on the new nodes.
The three new nodes have awfully high system loads. Original boxes have a load average of 0.5 to 1, and the three new nodes are sitting at about 11.1, with no VMs running. top shows no high CPU processes, so its something kernel-related ? There are no processes locked in state D (uninterruptable sleep)
If I tell Xencenter to "repair" those Storage Repositories it sits spinning its wheels for hours till I hit cancel. The message is Plugging PDB for node5
Question: How do I get my new xenserver pool members to see the pool storage and work like expected ?
EDIT Further information
- None of the new nodes will do a clean reboot either - they get wedged in "stopping iSCSI" on a reboot and I have to use the drac to remotely repower them.
- Xencenter is adamant that the nodes are in maintenance mode and that they haven't finished booting.
Good pool node:
[root@node1 ~]# multipath -ll
36f01faf000eaf7f90000076255c4a0f3 dm-36 DELL,MD36xxi
size=3.3T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=12 status=enabled
| |- 14:0:0:6 sdg 8:96 active ready running
| `- 15:0:0:6 sdi 8:128 active ready running
`-+- policy='round-robin 0' prio=11 status=enabled
|- 12:0:0:6 sdc 8:32 active ready running
`- 13:0:0:6 sdh 8:112 active ready running
36f01faf000eaf6fd0000098155ad077f dm-35 DELL,MD36xxi
size=917G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=14 status=enabled
| |- 12:0:0:5 sdb 8:16 active ready running
| `- 13:0:0:5 sdd 8:48 active ready running
`-+- policy='round-robin 0' prio=9 status=enabled
|- 14:0:0:5 sde 8:64 active ready running
`- 15:0:0:5 sdf 8:80 active ready running
Bad node
[root@vnode3 ~]# multipath
Dec 24 02:56:44 | 3614187703d4a1c001e0582691d5d6902: ignoring map
[root@vnode3 ~]# multipath -ll
[root@vnode3 ~]# (ie no response at all, exit code was 0)
Bad node
[root@vnode3 ~]# iscsiadm -m session
tcp: [1] []:-1,2 ▒A<g▒▒▒-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [2] 10.19.3.12:3260,1 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [3] 10.19.3.11:3260,1 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [4] 10.19.3.14:3260,2 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
[root@vnode3 ~]# iscsiadm -m node --loginall=all
Logging in to [iface: default, target: iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb, portal: 10.19.3.13,3260] (multiple)
^C iscsiadm: caught SIGINT, exiting...
So it tries to log into an IP on the SAN, but spins its wheels for hours till I hit ^C.
If the iSCSI discovery doesn't work, it's probably a matter of the IQN on the XenSerever host, the MD3600i or both not recognizing each other. Make sure the MD3600i is allowed access from all your IQNs on all your XenServer host using Dell's MDSM utility and then try to redo the iSCSI discovery:
iscsiadm -m discovery -t st -p (MD3600i-primary-controller-IP-address)
iscsiadm -m node --loginall=all
iscsiadm -m session
You should be at least able to ping the primary IP address of the MD3600i from your XenServers if you have network access.
Also note that you'll need to first set up separate iSCSI interfaces on the NICs associated with each new XenServer and assign static IP addresses to those that are unique and on the same subnets as those of your other hosts' iSCSI conenctions.
I hope that helps, --Tobias
For closure, there were multiple things wrong.
Multipath seemed to have no bearing on the problem at all.
Deleting and fiddling around with files in /var/lib/iscsi/* on the xenserver nodes had no impact on the problem.
I had to use other means to reboot these newer boxes too - they would wedge up trying to stop the iscsi service.
And finally the corruption in the IQN name visible in
iscsiadm -m session
has vanished completely. This was possibly related to the MTU mismatch.For future internet searchers - good luck!
Edit: in September 2021, I had exactly the same issue, with a dell MD3800 SAN and some xcp-ng servers. Again, it was caused by mismatched MTU. And Google just happens to serve up this question, which I had completely forgotten. Just goes to show how important it is to provide closure for future readers... that reader might be you.