I have an azure web app (S2 app Service plan) that is set to autoscale but has a min instance count of 1. As it mostly idles, the real instance count also is 1 almost all of the time.
Last week "something" happened and the site was no longer available. Every request was answered with http status 500. This went on for about 10 hours, and all of a sudden the site was available again. I did not change anything a few day prior to the error condition, nor did i do anything to make the site come back again.
I started a support reuqest for that and a support engineer has been looking into this. According to him the reason for the problem was:
the root cause is pointing at the Windows Process Activation Service, which was unable to run the process related to your application and the platform was unable to recover it in the specified time Frame
Given that i have no way to configure WPA, i assume this to be a problem with the platform. The support enginneer confirmed.
I think this means that azure should deal with a state like this and do whatever is necessary to bring the app back up again. As it took 10 hours for the service to come back online, i assume this happened by chance and azure has not done anything here. Should i post a bug report concerning this incident? (The support engineer isn't really helpful here...)
Also, the support engineer insists that having more than one instance would have solved the availability problem, because
instance so I can confirm that the redundancy failover option in this scenario would be for you to scale out the site to a minimum of two instances. This way, if one of the instances is unavailable, the second one would take over.
i think this simply cannot be correct because the web app was reported as "healthy" by azure and did respond to request, albeit with status 500.
Would, in this case, azure really send traffic only to the instance that was not returning status 500? And also, given that i do not know what caused the WPA problem in the first place - is it not possible the exact same problem would have also turned up with the second instance?
When you scale to multiple instances of a web app, they will sit behind a load balancer (you won't see this, but that is what happens behind the scenes). The load balancer probe should detect the 500 errors coming from your first instance and not direct any traffic to it.
Your instances of web apps will be running on different VM's under the hood, so if a WPA issue occurs on the first, then it should not impact the second. That said, there is the possibility that another WPA issue could occur on the second host, especially if something in your app is triggering this issue.