Background:
We need ready access to 30TB of audio data, although only a small fraction of it is ever requested for playback, that playback needs to be done immediately even for multi-year old data. The data resides in a SAN of multiple arrays and a nightly backup is performed on new data. Some data is also removed every night as well. Since both are write events, call it 20GB a night. The overall trend is more new data is written than old data is removed.
Weekly Patrol Reads(PR) and Consistancy Checking(CC) account for most of the disk activity on the arrays, other than them just spinning until they fail.
Question:
I'm trying to figure out if the if the disk based SAN should be replaced with one using using NVMe, what RAID level to consider and if it makes sense to reduce the frequency of PR or CC activity for the VNAND technology?
It is my understanding, what kills the VNAND is writes, and we would be writing way less data than the daily minimum on most drives even considering the consistency checking.
I have been able to find almost no testing of RAID 5/6 on NVMe or even SSD in general. I'm after primarily long term availability.
Research:
Most of the other questions on this topic predate NVMe technology and are 6-7 years old. This one is an exception but doesn't really cover this scenario either.
Understanding NVMe storage and hardware requirements
Related:
Long term storage of business critical data
Long term archival of video & Audio files
One Year Raid 0 setup
By using SSDs over HDDs you will get some power benefit and likely have a reliability benefit (enterprise grade SSDs are far more reliable than enterprise grade HDDs). There is no issue with the nand endurance especially not at the level of activities that you have and even at higher levels the endurance is not a real issue. You can most likely also go for the relatively cheaper read-optimized drives (with 0.3 DWPD) and have no worries with regard to the disk endurance.
The only question in such a use case is if the cost of the drives warrants the power and reliability advantages.
As for the reliability/availability, all enterprise grade SSDs I've seen advertise MTBF of 2 million hours and those I've worked with have exceeded that mark. The opposing side is that all enterprise grade HDDs claim 1.2M hours of MTBF and none got even halfway there so you will see a big reliability jump upwards with the move. Again, if it's really worth it for the cost or not is your calculation to make.
My qualification here is that I worked on enterprise storage systems involving HDDs and SSDs and worked on the hardware/software integration and was deeply involved in the reliability of the combined systems. The data sets I relied on are private so there is no open research that I can point to though.
Electrical charges fading also kills NAND. Probably very slowly on a good solid state, but noticeable after time. Quite different from magnetic spindles which hold data for 10 years or more. If they spin up again, that is.
Look up reliability data as a function of bytes written, hours spinning, and other metrics. Vendor specs as well as any public data sets. Replace drives whenever they show wear. Especially near the end of their warranty, maybe 3 years old.
Use different media for your backups than online data. If the primary storage is solid state, use tape or magnetic spindles for the protection storage.
Reevaulate archive media at least every 10 years. Transfer old backups you care about to whatever the current protection media is.
Being a good archivist is not specific to the media type or redundancy scheme, storage evolves over time. There is not one answer here, even for similar performance, availability, and cost requirements.
Flash storage is still too new for there to be any good at-scale studies into long-term longevity to rely on. So far, the indications of SLC and MLC flash looks good, and seems to give you as good or better longevity as spinning rust. TLC and especially QLC flash is way too new to make any qualified predictions about, but they could reasonably be expected to provide worse longevity than SLC and MLC flash. Personally, I wouldn't move from spinners to flash for longevity reasons, but possibly for other reasons such as performance. Instead, I'd look into the integrity features of the storage management system, and make sure it can properly deal with partially lost or corrupted data. ZFS is possibly the leader in this respect.