The small business that I do sysadmin work for on the side uses a mid-2011 Mac Mini Server (running 10.7 Lion) as a fileserver and FileMaker database host. Its 2 750gb HDDs are RAID 1'd together, and it Time Machine backups over USB to a RAID 1 array of 2 1 TB disks.
I set it up about a year and a half ago and had no problems with it until a few months ago. I opened Disk Utility to find that the RAID had degraded and that it was only running on one disk. I went out and bought another 750gb HDD, installed it, and rebuilt the array.
Everything was fine for a week - then, the array degraded again. I rebuilt the array and it was fine again until last week - when again, the array degraded. It keeps degrading on the same device - disk1 has always been fine, but disk2 keeps degrading, regardless of what physical hard drive is in there. I don't think it's a hardware issue.
What should I do? I would reinstall OSX, but I've never restored a backup from Time Machine before and I'm not sure what to expect - if things go sideways, I woud have to reconfigure a lot of stuff, including about 10 user accounts and network shares and stuff (not to mention the FileMaker configuration stuff). This is just a side thing for me, and I really don't want to burn up a Friday-night-to-Monday-morning-nonstop weekend scenario on this because something went wrong and I lost everything.
Have you read any log files that might give you a hint what the issue is? I would definitely not rule out a hardware issue - it's not only the disks that might be damaged but cables and even connections on the main board can be the culprit if they are no up to spec for whatever reason. These can be problematic to get repaired though, especially if errors are only sporadic - many companies, including Apple (in my experience) will disregard errors they can't see after a few seconds of testing.
You will want to be very systematic about isolating the failure by saving the system logs, watching them for filesystem errors and challenging your assumptions.
Why rule out disk 1 - if there is an error writing data to two drives - the system has to pick one and perhaps there isn't a good reason to pick drive 2 to survive or the algorithm is based on something silly like whether the day/week/second when the error is detected is even or odd and you have too few documented failures to notice that pattern.
From the phrasing of the question - you are mixing two problems - lack of a tested rebuild strategy and how to isolate a RAID issue. Try to be frank with yourself and your employer about the risks and let them make a business decision which problem to attack with which budgetary estimate.
As to the main question here - you could also just script a simple check like
diskutil list
and have it send an alert / pager / capture the logs when you detect the next RAID problem. I would also disable RAID software AutoRebuild if you have that enabled just in case the problem is physical with someone jiggling the server and the system picks the wrong spindle to re-mirror when the cables reconnect.