A global outage was caused when Crowdstrike pushed a bad content file. That's well-covered in many other places so I won't elaborate on that.
What I'm interested in is whether company policies to delay automatic updates would have contained the outage.
Crowdstrike has a way to set policies to limit the Falcon sensor version to n-1, n-2, or even a specific build. So you can assign a test group to get the most recent sensor version and your production groups to get a slightly earlier version.
This is a common practice used for other types of patching, such as for Windows updates. I understand that the pace is necessarily faster for information security products, but perhaps some form of vetting might have been possible.
The file that caused the problem is classified as a "content file", and so it's possible that it wouldn't have been prevented by sensor update policies.
On the other hand, Dave Plummer's Youtube video suggested that Crowdstrike was using content updates to patch the sensor code without having to go through Microsoft's driver approval process every time. And the sensor version numbers do appear to increase fairly rapidly. So it's also possible that the policies also control content updates.
So, can we say if a Crowdstrike customer had set up a procedure to test machines against sensor updates before approving them for general release within their company, that the outage could have been contained to test group(s)?
Spoke with their support and Falcon versions do not delay content updates, so those with n-1 were still impacted.
Probably not, due to who the application owners typically are and CrowdStrike and security products in general update philosophy.
These products usually self-update frequently, with almost no formal communication. This is (subjectively) different from a generic application major or minor binary update.
The vendors, and their counterparts at organizations aren't disciplined ITIL practitioners and never will be ...
About seven years ago I was on a similar incident with Trend Micro. They released an update that cratered every Windows cluster we had due to it interfered with lookups for the Cluster Name Object. We were on a conference call with 70 people.
When Microsoft called in, they told us right away that several other customers with the same product had the same problem...
Fortunately not as many organizations had that product.
This is not a traditional product where the customer plans and prepares for updates by mitigating risk and identifying contingencies. This product is co-managed by the vendor, and they are taking actions that the customer is unaware of.
In a sense, this is more of a service than a product. Concerns about implementations and management should be reinforced by contract, or by selecting a different solution or service.