We have a Citrix PS4.0 farm made up of 2 physical and 2 virtual Citrix servers. Any one of them at some point or another will eventually degrade in performance due to hitting 100% CPU usage. I can see the CPU usage spike in the Virtual Infrastructure Client when this happens on either of the VMware servers.
This is not a load issue related to the number of users as it can happen at any time with any number of users.
Users are running shared desktops, not applications. Installed applications in the desktop are standard office application (Word, Excel, Outlook) with limited Internet Explorer access through a Bluecoat Proxy and a couple industry-specific applications.
What tools can be used to troubleshoot and diagnose the source of the problem? Once the server hits 100% CPU, it is impossible to log onto and see what process is consuming all the resources. The only recourse is to hard reset the machine. All servers restart at 4am each morning on a schedule.
NOTE: I already have ThreadMaster installed on all Citrix servers using the default configuration options and logging activities. The logs do not reveal the source of the problem.
EDIT
- Citrix Presentation Server 4.0, Enterprise Edition
- Hotfix PSE400W2K3R03
- Windows 2003 Server Standard Edition Service Pack 1
- Runs Symantec Client Security 10.0.0.359 configured per the recommendations from Citrix for file exclusions, etc.
Windows 2003 SP1 went out of support in April, so your OS does not get any security patches anymore. You need to upgrade to SP2 ASAP.
SP2 also has lots of random bug fixes in it - your issue could go away.
If your OS has that old a patch level, there is a good chance some drivers - specifically print drivers - could be out of date on the box too. As drivers are a big source of system instability in general, I would try checking they are all signed and up to date. Having a dodgy print driver would explain why it affects both virtual and physical boxes, and appears to occur randomly regardless of load.
Oh and FYI Citrix 4 goes EOM (End of Maintenance, no more bug fixes) at the end of this month June 09, and EOL (End of Life, no more security patches or any other patches) at the end of Dec 09. Enjoy your upgrade cycle!
You can try scheduling a script to run every minute or so that appends the process list to a file:
Something like this might at least give you a clue as to what's going on.
(pslist comes with the Sysinternals Suite)
The built-in Performance Logs and Alerts tool would be a great tool to get you some data about what's going on. You're going to have to use some disk space to generate these logs, but if you stay on top of deleting old log files until the problem occurs you shouldn't have a problem w/ running out of disk.
I'd start up a counter log on each server computer, logging the Process and Processor objects to disk (I'd probably also grab the Memory object, too).
Start / Run / PERFMON
Expand the Performance Logs and Alerts node and highlight the Counter Logs node.
Click Action and New Log Settings. Name the log however you'd like.
Click the Add Objects... button in the log properites window and add the objects to log.
Set an interval. I'd probably choose a 60 second or longer interval. High resolution probably isn't necessary since this is a gradual degredation.
On the Log Files tab, use the Configure button to choose a location for the log file and a base filename. I'd choose a Maximum log size of, say, 5MB - 10MB. This is going to generate a lot of small files, but you will be able to monitor the path where you're storing the files and delete older files that are piling up prior to the problem occurring.
You can start the log by right-clicking the new log instance in the results pane and choosing "Start". The log will run, by default, until you stop it or until you reboot the computer. (See this question for information about starting a log on boot: How to Setup Perfmon to Automaticaly Start an "Alert" At System Startup? (The question talks about starting an alert, but you can use the same command to start a log.)
You can analyze these logs by hand after the issue occurs. You might want to try Microsoft's Performance Analysis of Logs (PAL) tool (http://www.codeplex.com/PAL). I've been happy with the reports that tool has generated, and it's fairly easy to use.
Try to add an extra virtual CPU to the servers IF they only have one vCPU. If it's a singlethreaded application eating up all the CPU you'll atleast get in to kill it instead of reseting the server.
What edition are you running and do you have an SA agreement?
Are you running antivirus on the server?
Also, what hotfix(s)/rollup are you running for PS4 and what SP are you on for Windows?
how many cpu/core per machine ? hitting 100% on many core would mean a multithread application eating all ressources.
Do you have a pattern (peak every X hours or everyday around 2'o clock) ?
Anything in eventlog (like huge printing) ?
Do you have SCOM ?
We had a similar problem with our Internet monitoring software, and it turned out that the XTE (session reliability) process had corrupted the WinSock library and/or the TCP/IP stack. To repair the TCP/IP stack, run the command "netsh winsock reset" on the Citrix server and reboot.
You are also 3 Rollups behind on PS4. May want to upgrade your servers to Rollup 6
Have you considered upgrading to WS2003 Enterprise Edition and taking advantage of Windows System Resource Manager to contain application resources?
About the only problems we've had on our Citrix boxes hitting high CPU is due to bad printer drivers causing the spooler service to go absolutely nuts. Specifically, it was down to HP LaserJet printer drivers, which were notoriously bad until around December last year when they redid the underlying DLL's which fixed a whole bunch of crashes. The change log on their release notes made for interesting reading.
Anyhoo, you could perhaps try a 'sc \servername stop spooler' from your workstation and see if that can connect and kill the print spooler on the errant server, might help rule out printer drivers being the issue.