I have just installed SCOM and it is monitoring a subset of the Windows servers that we have installed.
There is one rule which is generating alerts which is giving me a little trouble: NTFS - Delayed Write Lost
This is being caused by VMware and our backup solution. All of the machines that are being monitored are VMware virtual machines running on either ESXi 5.5 or 6. They are being backed up by Commvault which generating a quiesced snapshot and backing up the snapshot. While the snapshot is being generated, Windows is generating these events which are being tracked by SCOM. This appears to be a known issue and VMware are doing nothing about it. See here: VMware Site
Since I can't do anything about the warning being generated, I would instead like to suppress the rule while the snapshot is being taken instead. Unfortunately, overrides in SCOM appear to be binary; the rule is either enabled or it isn't. I don't want to disable a rule like that, under any other circumstances a failed delayed write could be a serious matter.
Looking at the event logs on the server, the alert never seems to appear more than 10-15 times in the space of a few seconds. Can SCOM be set to not notify on that rule unless it appears more than X times over Y amount of time? Failing that, can it be set to suppress the rule during the backup window?
Would appreciate any advice :)
I think that using the "Maintenance Mode" is the right choice for you. Any server in SCOM can be switched to the maintenance mode for specific time period and it will suspend all monitoring workflows on that server for that time. Here is a link how to manage it in SCOM: https://technet.microsoft.com/en-us/library/hh212870.aspx
If your backup is a part of the automated workflow you may want to turn that machines into maintenance mode with PS command. Here is more details about PS that cmdlet: https://technet.microsoft.com/en-us/library/hh918505(v=sc.30).aspx
I hope it helps.
Thank you, Roman.
You could disable the original rule and build a new rule or monitor.
Version 1: Instead of targeting the windows class, use the Logical Disk classa target. While the backup runs, set the logical disks in Maintenance Mode. So only the Workflows of the Logical disks are stopped. (You can use any other class for that)
When there are many instances that need to be set in Maintenance Mode at once, you should group them and set the Maintenance for the group. Iterating through the instances and setting Maintenance Mode for them with powershell is really slow in my experience.
Version 2: Use an event correlated rule or monitor. E.g: Alarm only, when the same event doesn't appear again in the next 5 minutes after the first event has risen.