This week I've run into a problem with the MSSEARCH wait type, and I haven't been able to fully diagnose the problem.
The server had been running with no problem for several weeks until the other day, when it suddenly started taking too long for replying to users' requests.
My team and I quickly found the problem to be located at the Full-text Search component, but we had no idea what was causing it. (FTS is a feature heavily used in our workload and we've had no issues with it up until now.)
We tried restarting the MSFTE service, but it wouldn't respond.
As the screenshot above shows, the server was just below 400 Waiting Tasks (normal workload of under 10) and rising.
I didn't have too much time to try and diagnose it before restarting our server because we it was running in production, so I was left only with SQL Server's logs and a couple of MSFTE memory dumps after the full server restart.
I was expecting to be able to have a better understanding of the problem with those, but I wasn't able to acquire much information out of those, so I would be really glad if anyone would offer a pointer or shed some light into this.
All we were able to deduce was that the Full-text Search service had stopped working, but I found no evidence of such a bug on the web, and although it seems to be running ok now, I'd like to really understand what happened and prevent it from happening again.
Thank you.
First of all, that's a screenshot from SSMS 2008 ;)
FTE has very different resources requirements from normal database storage; you should set up remote windows perfcounter capture of at least the following counters:
And a few relevant MSSQL counters, too, although you need a running MSSQL instance to be able to set up capture of those remotely. If you don't have one, you need to create the set on the SQL server and export/import the counters.
Capture this data every minute or so, and any trends will be easy to spot.
We haven't been able to diagnose the problem fully, but we did take measures to avoid this from happening again, and I'd like to document it here.
First, we've setup full-text indexing and full-text catalogs population during periods of low system activity, during database maintenance windows and we no longer leave it to the system to automatically take care of it.
Second, we now keep a closer eye on full-text search services, how they're performing and how much resource FTS is taking. We've documented its usage and we monitor, among others, its files sizes and I/O.
Third, I've setup a couple of alerts so that the operators (DBAs) get notified when something goes wrong (and wrong here is relative. In our case, when FTS starts to use more resources than its supposed to, plus a reasonable threshold.)
So far it hasn't happened again (it's been almost a month since it happened the first time), but in case it does happen we're ready to act, preferably before our users get affected.