I have a client that provides some server systems to a hospital, and a support ticket was raised that the desktop application was hanging waiting for the server. We did some extensive testing and its pretty clear that the server is responsive, and the network is fine, and that the problem is on the client end. (no requests are received during the hang etc...)
We take a look at the desktop machines and they should be fine, so we raise tickets with the software vendor who says that it must be the hardware, the hardware company says that it is the software, etc etc
Anyway, so talking to the nurses, they say that these machines often "hang" for 30 seconds at a time, and sometimes during important moments where they need to get data for a patient who is unwell, such as charts and status.
So I want to stick a client on these machines that would be able to detect arbitrary "unresponsiveness" of the keyboard/mouse and log that for analysis later.
Obviously I am wary to suggest some application that takes resources and makes the problem even worse, so I would interested to see any tools that would detect these (is it correct to say that the keyboard interrupts are being discarded?) scenarios by looking for the OS discarding the interrupts, or whatever is appropriate here.
so go on then serverfault, here is your chance to save a life.... ;-)
Edit: I am starting to think that some of the tools associated with real time systems might be appropriate, at least as a diagnostic.
Think of it like the space shuttle. Once the thing is launched, that is it. Its launched and you are stuck with what is installed. So there is no remote management of the machines that I have access to, and I can't sit and look at logs. All the cases would have to be worked out before. (my thinking is that if I could "detect" unresponsiveness, then I could trigger a VBscript to copy the relevant log files and performance metrics into a file, and have a local tech pass those files on)
It would require modifying the client application, but you could add calls to it to post and watch calls to the server and count to the response. This would provide you with a means to establish baselines and establish machines that have a pattern of problems, or whether a machine or the application is, in-fact, unresponsive.
Graphite is especially good for this.
On the other hand, if it's the desktop itself that is the problem, I know of no better way of detecting unresponsiveness than a combination of a User and your direct phone number.
(By definition, the system won't know that it's slow or unresponsive.)
This is a never-ending battle. Hardware company blames Software company... who blames IT staff... who blame... ... ... ... <YEAH OUTSOURCING!>
Unfortunately, "hanging" can be caused by sooo many different things for soooooooo many different reasons. There is not one magic tool that can monitor every possible cause of "wait-time". As far as what you can do... is to use the "perfmon" tool built into Windows, and add different performance counters you are interested in.... which can be anything. (yes, you can monitor remote machines) Start with the basics... like CPU usage, Physical Disk queue lenghts, Network utalization, etc...
If you see a high amount of CPU usage... It's time to figure out what the application is doing and why it is consuming sooo much CPU.
If you see a significant number of things waiting in the disk-queue... perhaps you should optimize your disk (defrag? replace with faster disk drives? check for errors... etc...) If you're still out of luck here... perhaps the application is not very well optimized. Bad developers frequently make mistakes where the application reads 100mb of data when it only needed the last 5 lines of a log.
If you're seeing a large amount of network traffic... time to figure out why. Perhaps there is a lot of "re-transmissions" due to faulty cabling/hardware... perhaps the network has a loop, and the switches don't support spanning-tree... Maybe there's lots of excessive junk on the network like apple-talk/ipx enabled printers ...the list goes on.
You may need to even go one step further and implement something like wire-shark and monitor the packet exchange between the client and server. Perhaps the application sends a packet to the server and waits (blocks) for a response before continuing the execution of the program. Perhaps the server itself is over-taxed and cannot keep up with the number of client connections.
... this is just a scratch in the surface... Troubleshooting "hanging" applications when you do not have access to the source-code or to a developer who knows what they're doing... is a HUGE undertaking.