I've noticed that adding a new disk to an Ubuntu 14.04 VM running on Azure, caused an unconditional reboot of the Azure VM.
When I added a second disk, the VM wasn't rebooted, but the first disk was stolen away and I needed to reboot the VM in order to regain access to the first disk.
This doesn't seem like particularly helpful behaviour.
Is this behaviour documented anywhere?
Updated: Amended with additional details requested in comments.
The procedure I used to attach a new disk was cribbed basically from here:
https://azure.microsoft.com/nl-nl/documentation/articles/virtual-machines-linux-how-to-attach-disk/
with some differences.
This procedure below describes what I did to add a second extra disk (sdd). Last week I did something similar to add a first extra disk (sdc). In that case, the system rebooted by itself. In this case, the system did not reboot itself, but I did lose access to /dev/sdc as soon as I added the new disk.
Note that I didn't use the azure cli to allocate or attach the new disk. Rather, I used the web portal.
Once I did that, I then used:
dmesg | grep scsi
to discover that the new device was called /dev/sdd
Dec 16 14:00:58 azure kernel: [ 2.696851] sd 5:0:0:1: [sdd] Attached SCSI disk
I then formatted the disk with fdisk, per the instructions in the article above.
I then mounted the disk and proceded to rsync the filesystem tree I was moving onto the disk.
It was then that I noticed that the previous disk (sdc) that I had attached last week (the one that caused a reboot) was no longer accessible. (It wasn't visible in blkid, either).
I have since discovered these messages in kern.log which indicate that (sdc) became inaccessible almost as soon as I had completed the attach disk action in the portal (e.g. before I had discovered the new disk by scanning dmesg)
Dec 16 13:23:34 azure kernel: [538544.870108] scsi scan: INQUIRY result too short (5), using 36
Dec 16 13:23:34 azure kernel: [538544.870267] scsi scan: INQUIRY result too short (5), using 36
Dec 16 13:23:35 azure kernel: [538545.824751] hv_storvsc vmbus_0_16: cmd 0x2a scsi status 0x0 srb status 0x20
Dec 16 13:23:35 azure kernel: [538545.824846] end_request: I/O error, dev sdc, sector 130548832
Dec 16 13:23:35 azure kernel: [538545.828189] Aborting journal on device sdc1-8.
Dec 16 13:23:35 azure kernel: [538545.830301] JBD2: Error -5 detected when updating journal superblock for sdc1-8.
Dec 16 13:23:35 azure kernel: [538545.836606] end_request: I/O error, dev sdc, sector 0
Dec 16 13:23:35 azure kernel: [538546.308389] sd 5:0:0:0: [sdc] Synchronizing SCSI cache
Dec 16 13:23:35 azure kernel: [538546.308528] hv_storvsc vmbus_0_16: cmd 0x35 scsi status 0x0 srb status 0x20
Dec 16 13:23:35 azure kernel: [538546.309513] scsi scan: INQUIRY result too short (5), using 36
Dec 16 13:23:35 azure kernel: [538546.309776] scsi 5:0:0:1: Direct-Access Msft Virtual Disk 1.0 PQ: 0 ANSI: 4
Dec 16 13:23:35 azure kernel: [538546.310764] sd 5:0:0:1: Attached scsi generic sg2 type 0
Dec 16 13:23:35 azure kernel: [538546.311251] sd 5:0:0:1: [sdd] 268435456 512-byte logical blocks: (137 GB/128 GiB)
Dec 16 13:23:35 azure kernel: [538546.311254] sd 5:0:0:1: [sdd] 4096-byte physical blocks
Dec 16 13:23:35 azure kernel: [538546.311265] scsi scan: INQUIRY result too short (5), using 36
Dec 16 13:23:35 azure kernel: [538546.312429] scsi scan: INQUIRY result too short (5), using 36
Dec 16 13:23:35 azure kernel: [538546.312737] scsi scan: INQUIRY result too short (5), using 36
Dec 16 13:23:35 azure kernel: [538546.313289] sd 5:0:0:1: [sdd] Write Protect is off
Dec 16 13:23:35 azure kernel: [538546.313293] sd 5:0:0:1: [sdd] Mode Sense: 0f 00 10 00
Dec 16 13:23:35 azure kernel: [538546.313750] sd 5:0:0:1: [sdd] Write cache: enabled, read cache: enabled, supports DPO and FUA
Dec 16 13:23:35 azure kernel: [538546.326335] sdd: unknown partition table
Dec 16 13:23:35 azure kernel: [538546.328228] sd 5:0:0:1: [sdd] Attached SCSI disk
Dec 16 13:23:35 azure kernel: [538546.446399] hv_storvsc vmbus_0_16: cmd 0x85 scsi status 0x2 srb status 0x86
Dec 16 13:23:35 azure kernel: [538546.446404] hv_storvsc vmbus_0_16: stor pkt ffff880159d31180 autosense data valid - len 18
Dec 16 13:23:35 azure kernel: [538546.446406] storvsc: Sense Key : Illegal Request [current]
Dec 16 13:23:35 azure kernel: [538546.446409] storvsc: Add. Sense: Invalid command operation code
Dec 16 13:23:35 azure kernel: [538546.446495] hv_storvsc vmbus_0_16: cmd 0x85 scsi status 0x2 srb status 0x86
Dec 16 13:23:35 azure kernel: [538546.446498] hv_storvsc vmbus_0_16: stor pkt ffff880159d31180 autosense data valid - len 18
Dec 16 13:23:35 azure kernel: [538546.446499] storvsc: Sense Key : Illegal Request [current]
Dec 16 13:23:35 azure kernel: [538546.446501] storvsc: Add. Sense: Invalid command operation code
My reading of these messages is that SCSI subsystem in Ubuntu 14.04 doesn't like whatever Azure is doing to announce new disks and, in fact, loses track of disks already attached.
Other notes:
- the previously added disk sdc did not appear in blkid until the next reboot
- upon reboot, /dev/sdc was usable again (although, of course, it had to be fsck'd and the databases on it were survived as well as databases survive when the disk is ripped out with no warning)
+1 on VM reboots when adding disks to running hosts... Added 22 disks to 22 hosts about half rebooted or entered a stuck state and had to be forcibly rebooted.
uname -a 3.10.0-123.13.2.el7.x86_64 #1 SMP Thu Dec 18 14:09:13 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux waagent --version WALinuxAgent-2.0.14 running on centos
Openlogic CentOS 7.1
Wanted to add an update, microsoft recommended upgrading Kernel + walinuxagent to resolve the issue. They upgrade the driver base automatically which causes instability with previously kernel releases.
Message from Microsoft Support Rep: the latest kernel upgrade available for 7.1 is kernel.x86_64 3.10.0-327.4.4.el7 This will update the LIS drivers. The Linux Integration Services (LIS) drivers for Hyper-V and Azure are kernel modules that Microsoft contributes directly to the upstream Linux kernel. We havent had any issues with the latest version so far. Therefore, I recommend ( after testing it for your application) to have the kernel along with LIS drivers updated.
Just want to add that I have experienced this too. It appears the mere act of adding a disk to a Azure VM can cause it to reboot with no warning. In fact, it can cause multiple VMs to reboot at the same time.
In my case, I was using the Azure web portal to add a single disk to a particular VM and with no warning, the VM rebooted. This reboot caused a site outage.
I've tried to search for any documentation on why this is so but could not find any. And it is not possible to get Azure support to help because... apparently you have to pay them extra money (purchase tech support) to help you out on technical problems - even those caused by Azure itself.
Try to avoid using Azure for your critical online business infrastructure if you can.