I have an HP ProLiant DL380 G7 server running as a NexentaStor storage unit. The server has 36GB RAM, 2 LSI 9211-8i SAS controllers (no SAS expanders), 2 SAS system drives, 12 SAS data drives, a hot-spare disk, an Intel X25-M L2ARC cache and a DDRdrive PCI ZIL accelerator. This system serves NFS to multiple VMWare hosts. I also have about 90-100GB of deduplicated data on the array.
I've had two incidents where performance tanked suddenly, leaving the VM guests and Nexenta SSH/Web consoles inaccessible and requiring a full reboot of the array to restore functionality. In both cases, it was the Intel X-25M L2ARC SSD that failed or was "offlined". NexentaStor failed to alert me on the cache failure, however the general ZFS FMA alert was visible on the (unresponsive) console screen.
The zpool status
output showed:
pool: vol1
state: ONLINE
scan: scrub repaired 0 in 0h57m with 0 errors on Sat May 21 05:57:27 2011
config:
NAME STATE READ WRITE CKSUM
vol1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c8t5000C50031B94409d0 ONLINE 0 0 0
c9t5000C50031BBFE25d0 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
c10t5000C50031D158FDd0 ONLINE 0 0 0
c11t5000C5002C823045d0 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
c12t5000C50031D91AD1d0 ONLINE 0 0 0
c2t5000C50031D911B9d0 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
c13t5000C50031BC293Dd0 ONLINE 0 0 0
c14t5000C50031BD208Dd0 ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
c15t5000C50031BBF6F5d0 ONLINE 0 0 0
c16t5000C50031D8CFADd0 ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
c17t5000C50031BC0E01d0 ONLINE 0 0 0
c18t5000C5002C7CCE41d0 ONLINE 0 0 0
logs
c19t0d0 ONLINE 0 0 0
cache
c6t5001517959467B45d0 FAULTED 2 542 0 too many errors
spares
c7t5000C50031CB43D9d0 AVAIL
errors: No known data errors
This did not trigger any alerts from within Nexenta.
I was under the impression that an L2ARC failure would not impact the system. But in this case, it surely was the culprit. I've never seen any recommendations to RAID L2ARC. Removing the bad SSD entirely from the server got me back running, but I'm concerned about the impact of the device failure (and maybe the lack of notification from NexentaStor as well).
Edit - What's the current best-choice SSD for L2ARC cache applications these days?
ZFS does not do disk I/O, device drivers below ZFS do disk I/O. If the device does not respond in a timely manner, or as in this case, disrupts all other devices on the expander, then it is not visible as a failure to ZFS. All ZFS sees is a slow I/O.
There is a bug in Intel X-25M firmware that affects their behaviour during heavy loads and can cause reset storms. This problem affects all OSes and cannot be solved at the OS layer. Please contact your hardware supplier for fixes or remediation.
If a read is expected to be satisfied by the L2ARC, then the read will be attempted there. ZFS then relies on the lower-layer drivers to report an error. For this case, the drive continues to reset and retry for as many as 5 minutes before declaring the I/O as failed, depending on the driver, device, and default timeout settings. Only after the lower layer drivers declare the I/O as failed will ZFS retry on the pool.
NexentaStor's volume-check and disk-check runners look for additional error messages and alert you via email and fault logging. The disk-check runner has been improved in the 3.1 release to help alert you for specifically the conditions exhibited by broken firmware in SSDs.
Bottom line: your hardware is faulty and will need to be fixed or replaced.
Are you connecting the X25-M SSD to the backplane? There's a known issue with Nexenta and accessing the L2ARC over a backplane. Your best bet is to connect the SSD directly into a SATA port on the motherboard. Make sure it's configured to use AHCI as well.
If you're running anything mission critical on this server I would switch to a SLC SSD (like the X25-E or a STEC SSD). That being said, you'll probably be ok with the X25-M if it's not.
Ed, there are several that you can use ranging from relatively reasonable in price to pretty darn expensive. I prefer to deploy SAS SSD's in all cases and have done very well with both STEC and Pliant. Both now offer an MLC drive that will work famously has an L2ARC device. Not yet tested but coming soon is the SSD offering from Seagate that is SLC SAS 2.0 and rumored to be "not expensive". Stay tuned....
-PB