Minimizing Hard Disk Drive Failure and Data Loss/Drive Life-Cycle Management
Some brands of drives are more reliable than others. While reliability data for particular models can be hard to come by, various factors can be used to estimate a drive's reliability. These factors include product ratings, and some suitability and physical attributes.
As is obvious, drives with relatively higher user ratings and good reviews should be preferred. Drives with relatively lower ratings should not be purchased except if used in a RAID environment. The reliability of a model with less than five user ratings, as is the case with brand-new models, is harder to estimate.
Enterprise class drives are advertised as having slightly higher reliability than standard desktop class drives, but of course they cost more.
Error recovery mechanism
Hard drives can come with an in-built recovery mechanism which attempts repair if an error occurs. This recovery cycle attempts to recover data from the problematic area, and then reallocates a dedicated area to replace the problematic area. This process can take up to up to a few minutes depending on the severity of the issue.
Drives meant to be used in a RAID environment must have a feature which prevents them from entering a long recovery cycle, failing which the RAID controller can drop the drive from the array. This feature is known as Time-Limited Error Recovery (TLER) by Western Digital, Error Recovery Control (ERC) by Seagate, and Command Completion Time Limit (CCTL) by Samsung and Hitachi.
Desktop drives that can enter a long recovery cycle should therefore not be used in RAID environments, although drives with TLER / ERC / CCTL can be used in non-RAID environments.
Number of heads
There exists a strong positive correlation between the number of heads in a drive and its failure rate. When choosing between two drives of equal capacity and speed, the one with a fewer number of heads is therefore preferred. This point, however, may not be useful because drives with similar features may tend to have the same number of heads.
A drive has a higher chance than usual of failure in its first few months of use. This increased rate is due to assembly, configuration, or component-level problems. If a drive is susceptible to failing due to such a problem, it would be beneficial if this problem can be detected before the drive is put into use. Care must be taken to ensure that a drive does not overheat during a burn-in.
To aid with this, new drives can first be put through a short burn-in process using special software. This process performs read and write stress tests on the drive. It thus aims to catch problems in the drive that may lead to its early failure. One commercial software application for both Windows and Linux that performs this and other burn-in tests is PassMark BurnInTest.
S.M.A.R.T. reliability data can be queried before and after the burn-in. If a new error is found after the burn-in, it can be indicative of the drive being susceptible to an early failure.
Routine drive upgrades
While planned functional obsolescence is something that can be expected from a company selling a product, in this case it is necessitated by the consumer. Older, smaller drives can routinely be replaced by newer, larger ones. In addition to the increased storage capacity that becomes available, because the older drive is replaced well before its life runs out, the risk of loss of the data contained in that drive is reduced. This is particularly applicable to consumers who require increasing amounts of storage, as they benefit most from the increased storage capacity.
Drives can be replaced based on their features, age, or their fitness as determined by S.M.A.R.T. parameters.
"What is the difference between Desktop edition and RAID (Enterprise) edition hard drives?". Western Digital: Knowledge Base: Frequently Asked Questions. Western Digital. http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_adp.php?p_faqid=1397. Retrieved 2008-12-11. "If you install and use a desktop edition hard drive connected to a RAID controller, the drive may not work correctly unless jointly qualified by an enterprise OEM. This is caused by the normal error recovery procedure that a desktop edition hard drive uses.
When an error is found on a desktop edition hard drive, the drive will enter into a deep recovery cycle to attempt to repair the error, recover the data from the problematic area, and then reallocate a dedicated area to replace the problematic area. This process can take up to 2 minutes depending on the severity of the issue. Most RAID controllers allow a very short amount of time for a hard drive to recover from an error. If a hard drive takes too long to complete this process, the drive will be dropped from the RAID array. Most RAID controllers allow from 7 to 15 seconds for error recovery before dropping a hard drive from an array. Western Digital does not recommend installing desktop edition hard drives in an enterprise environment (on a RAID controller).
Western Digital RAID edition hard drives have a feature called TLER (Time Limited Error Recovery) which stops the hard drive from entering into a deep recovery cycle. The hard drive will only spend 7 seconds to attempt to recover. This means that the hard drive will not be dropped from a RAID array. Though TLER is designed for RAID environments, it is fully compatible and will not be detrimental when used in non-RAID environments."
- Jon G. Elerath and Sandeep Shah (January 2003). "Disk drive reliability case study: Dependence upon fly-height and quantity of heads". Proceedings of the Annual Symposium on Reliability and Maintainability. Annual Symposium on Reliability and Maintainability. pp. 608–612. http://rams.org/.