The Mean Time Between Failures, or MTBF, for this SSD is listed as 1,500,000
hours.
That is a lot of hours. 1,500,000
hours is roughly 170
years. Since the invention of this particular SSD is post-Civil War, how do they know what the MTBF is?
A couple of options that make sense to me:
- Newegg just has a typo
- The definition of mean time between failures is not what I think it is
- They are using some type of statistical extrapolation to estimate what the MTBF would be
Question:
How is the Mean Time Between Failures (MTFB) obtained for SSD/HDDs?
Drive manufacturers specify the reliability of their products in terms of two related metrics: the annualized failure rate (AFR), which is the percentage of disk drives in a population that fail in a test scaled to a per year estimation; and the mean time to failure (MTTF).
The AFR of a new product is typically estimated based on accelerated life and stress tests or based on field data from earlier products. The MTTF is estimated as the number of power on hours per year divided by the AFR. A common assumption for drives in servers is that they are powered on 100% of the time.
http://www.cs.cmu.edu/~bianca/fast/
MTTF of 1.5 million hours sounds somewhat plausible.
But I expect that is simply the constant "random" mechanical/electronic failure rate.
Assuming that failure rates follow the bathtub curve, as mentioned in the comments, the manufacturer's marketing team can massage the reliability numbers a bit, for instance by not including DOA'S (dead on arrival, units that passed quality control but fail when the end-user installs them) and stretching the DOA definition to also exclude those in the early failure spike. And because testing isn't performed long enough you won't see age effects either.
I think the warranty period is a better indication for how long a manufacturer really expects a SSD to last!
That definitely won't be measured in decades or centuries...
Associated with the MTBF is the reliability associated with the finite number of write cycles NAND cells can support. A common metric is the total write capacity, usually in TB. In addition to other performance requirements that is one big limiter.
To allow a more convenient comparison between different makes and differently sized sized drives the write endurance is often converted to daily write capacity as a fraction of the disk capacity.
The higher that number, the more suited the disk is for write intensive IO.
At the moment (end of 2014) value server line SSD's have a value of 0.3-0.8 drive/day, mid-range is increasing steadily from 1-5 and high-end seems to sky-rocket with write endurance levels of up to 25 * the drive capacity per day for 3-5 years.
Some real world tests show that sometimes the vendor claims can be massively exceeded, but driving equipment way past the vendor limits isn't always an enterprise consideration... Instead buy correctly spec'd drives for your purposes.
Unfortunately the MTBF isn't what most people think...
It is not how long an individual drive will last.
Manufacturers expect their drives to last as long as the warranty, after that it really isn't their problem. Older electromagnetic platter hard drives will seize up after 10 or so years. Integrated circuits last an extremely long time, but other components (notably capacitors) wear out after somewhat predictable number of cycles.
It is how many of these drives you would need to expect 1 drive to fail every hour.
As others have pointed out manufactures do various testing over a reasonable period of time and determine a failure rate. There's a fair amount of variance in these sorts of tests and marketing often has "input" as to what the final number should be. Regardless they make a best effort guess as to how many drives would be needed to average one failure per hour.
For situations with less drives you can infer a statistical probability of failure based on the MTBF, but keep in mind that failures in well designed products should follow a "bathtub" curve - that is higher failure rates when devices are initially put into service and after their warranty period has expired, with lower failure rates in between.
They come from a statistical evaluation based on a small sample size and a short amount of time. There's really no universally agreed upon method or process so it's really just silly 'marketing'.
This article may explain it a bit more. And Wikipedia has some formulas which might be what you're looking for?
Essentially, for nearly everything (including general household machines such as a dishwasher) several products are run for X amount of time. How many failures happen during this period are used to calculate the MTFB.
It's of course not feasible to run products through an entire lifecycle, i.e SSDs, which will last a long time. They are mostly limited by the amount of writes rather than mechanical failure (which is what MTFB is for)
Bad news about MTBF is that common evaluation metodics suppose evenly distributed write load among all NAND cells. But cells are grouped into the clusters and when one single cell fails - whole cluster is marked as dead and is replaced with new one from the reserve. Usually reserve is about 20% of the SSD volume. When reserve is exhausted whole SSD will be marked as dead.
IRL SSD contains persistent data as well as volatile. Imagine that you have 90% of SSD filled with static data, and the 10% rest is under the heavy write load. SSD controller spread the load among the available free clusters. That 10% exhausts their lifespan 10 times faster than you have estimated. They will be replaced from the reserve again and again till the end.
In the really bad case where persistent/volatile data amount is 30:1 or greater, for example - pile of photos and relatively small database for popular website, your SSD will die in a year.
One of my customers was very impressed with SSD characteristics and insisted to equip his DBMS-server with pair of them. In the next 12 months we have replaced both of them twice.
But accordingly to the marketing materials lifespan of SSD is 170 years. Sure.
MTBF is not relevant for measuring SSD drive endurance since SSD is not sensitive for the time itself like ordinary spinning HDD drive but for the number of re-writes for SSD cells. More relevant measure for SSD is Drive Writes Per Day (DWPD). For example some enterprise class SSD disks 3.2TB endurance would be 3 DWPD for 5 years.
Some times SSD vendor provide endurance in terms of (Total) Terabytes Written (TBW) or "Write Cycles" which can be easily translated to DWPD and vice versa knowing time and maximum throughput for the given SSD drive.
For the given example with 3.2Tb SSD drive:
TBW = DriveSize * Years * DWPD;
TBW = 3.2TB * 5*365 * 3d = 17520 TB for 5 years
If the drive provide 80 MByte per second sustainable write throughput, then
WriteCycles = DWPD * Years;
WriteCycles = 3 * 365*5 = 5475 total write cycles for the given disk
What is important to notice we are calculating the worst case if you will provide 100% utilization throughput for the drive (which is very likely not possible).