I want to learn more about how to perform root-cause analysis. More times than not, our department tells the user to try rebooting (thier Windows XP system), which actually "fixes" a good number of problems. When I am in a hurry (and sometimes getting paid hourly contributes to this) I might try to find a workaround in order to get the problem solved quickly instead of actually performing root-cause analysis.
Most of the time I am looking in log files or the event viewer for this information. Sometimes I will use the Sysinternals tools or occasionally run a packet sniffer. I probably don't use the Sysinternals programs as much as I should. Some specific insight on how you use which pf these tools, when and why would also be helpful.
I know this is a wide open question but could you please briefly explain your methodology, tools, etc. that you use? It looks like a lot of admins on SF use a more in-depth process which I would like to learn more about. If this helps narrow down the question any, I would be most interested in tools, tips, tricks, etc. relevant to Windows servers & clients within an AD environment.
Figuring out the root cause of a problem depends on the problem -- Your initial instinct to look at log files/sysinternals tools/packet sniffers is generally correct.
I would add running the MS Malicious Software Removal Tool and a good AV program on Windows systems (and ensuring that they don't have something like CyberDefender or other AV-trojan-malware.
The folks at Stack Exchange are proponents of the "5 Whys" method (http://en.wikipedia.org/wiki/5_Whys, also this nice short PDF that shows it in action). It is a pretty valuable tool for doing root cause analysis.
Beyond that I'll paint two broad categories and some of the questions I usually ask/things I check:
Mysterious behavior not related to the network
e.g. "Word keeps crashing on me"
Basic questions to ask:
(Dont take "nothing" for an answer -- it is the first lie. New software, patches, etc. all count.)
(Try to extract as much detail as possible here -- in my example above "I hit the hotkey for insert initials and the program crashed")
(If so, start looking at stuff from (1) above)
(If so that's a good sign: A tech support call to the vendor may help. If not you'll need to look at the user's system for the rest of these questions.)
My company once had a mysterious system lock-up that related to clicking the mouse at a specific frequency (We still don't know why, but we had to watch a user doing it and practice for a day in order to be able to reproduce it reliably)
Problems related to the network
A lot of this is similar, but with some more specific guidance.
(Yeah, you always start there)
How about by IP? How far does the traceroute get?
(TCP settings, etc. - Usually not the problem, but sometimes.)
In addition to the excellent responses so far, I would add:
Identify the date/time of issue onset. This may seem obvious, but I have seen far too many issues where this was not documented and later on incorrect assumptions were made. This correlates well with the "what changed" step.
Is the issue reproducible or intermittent? This is critical, as reproducible symptoms are far easier and quicker to resolve than those that are intermittent. If it is reproducible, ensure the steps are documented.
Identify the symptom(s). Note that we distinguish between "symptom", which is a manifestation of the root cause, and the actual problem/root cause.
Localize the issue to a likely faulty functional component. If there is an error in a web application, is it in the application code, the web server, the operating system hosting the web server, the network, or the remote end? This is best-guess at this point so that resources are focused on the likely cause, so ensure that others know that this is theory/conjecture.
Question your assumptions, and try to gather empirical data to support to support assumptions and conclusions. It's pretty bad feeling to tell someone that there isn't a problem with x, and it is discovered later that there actually is. Usually when there is an incorrect solution, there could have been data to support a correct solution.
It sounds like you are asking for general troubleshooting help such as Your troubleshooting rules, approach to troubleshooting? rather than how to do a particular kind of RCA ( http://en.wikipedia.org/wiki/Root_cause_analysis).