Minimizing Hard Disk Drive Failure and Data Loss

From Wikibooks, the open-content textbooks collection

Jump to: navigation, search

While every hard disk drive can eventually be expected to fail, some preparation and practical information can be used to minimize this failure, and avoid data loss when failure is likely.

Contents

[edit] Introduction

The strategies for minimizing hard disk drive failure and the consequent data loss are:

  • Delaying drive failure by identifying its causes and acting to prevent some of those causes
  • Detecting an impending drive failure by monitoring drive health and using early warning signs
  • Data redundancy using RAID, backups, parchives, and sharing
  • Reducing one's data by means of routine cleanups and data compression software
  • Managing drive life cycle by appropriate drive selection, performing a burn-in, and doing routine replacements

Data loss does not have to be caused by a drive failure. It can also be caused by a virus or a user.

A drive may appear to stop working, but this doesn't necessarily have to be the result of a drive failure. It can often be the result of a problem with a data or power cable being loose or damaged, or with an expansion card or enclosure. The process of elimination can be used to aid in troubleshooting. The relevant cables can be swapped and checked to ensure that they are correctly plugged in. If an expansion card is used, the drive can be tested with a different port on the card. If an enclosure is used, the drive can be tested in a different enclosure, or a different drive can be tested in the enclosure.

Information for data recovery once data loss has occurred is not included. Additionally, relevant information is not duplicated en masse from Wikibooks or Wikipedia. This is in an attempt to reduce information duplication. Links to such information are included instead.

While most of the stated strategies apply exclusively to hard disk drives, some also apply to solid-state drives. In particular, these include electricity control, redundancy, and antivirus protection.

[edit] Prerequisite reading

[edit] Optional reading

[edit] Delaying drive failure

There exist several causes why a drive can fail. A subset of these causes can be acted upon and prevented.

[edit] Environmental control

[edit] Temperature control

Overheating is purported to be a common cause of drive failure. Overheating can cause the platters to expand. If the disk's read-and-write head comes in contact with the disk's surface, a catastrophic head crash can result.

Each drive has a specified lower and upper bounded operating temperature. In addition, drives that constantly run relatively hot, i.e. near the upper bound of the operating temperature are thought to have a reduced lifetime.

Inadequate ventilation, especially during the summer months, can cause a drive's temperature to exceed safe levels. In desktops, this can be handled by ensuring that a computer fan is installed near each drive to move hot air outside. Other types of computer cooling can also be used as an alternative or in addition to basic air cooling. Air conditioning can be used if the room or the area in which the computer is present becomes too hot.

Laptops also can be cooled more using a laptop cooler, with an active cooler preferred over a passive cooler. This can be especially important if the drive's temperature is high.

External hard disk drives must preferably be enclosed in a disk enclosure that has a fan, rather than one without a fan. An absence of a fan in the enclosure can be partly compensated for by using an ordinary table fan to improve airflow around the enclosure. Stacking multiple external drives together, especially if they do not have fans, is strongly discouraged as it impedes heat transfer.

[edit] Temperature monitoring

Several drives include a temperature sensor and a thermal monitoring feature. The sensor can be queried using software and the drive's current temperature can be steadily monitored. Two free Windows software applications that do this are HD Tune and SpeedFan. Several other programs are available as well. If the temperature exceeds a preset threshold, perhaps 50 °C, the monitoring application can be configured to log the event, warn the user, and shut down the drive or computer. If the drive includes a thermal monitoring feature, it shuts down the drive if its temperature reaches a critical level, perhaps 65 °C.

A common misconception is that a colder hard drive will last longer than a hotter hard drive. A study by Google showed the reverse to be true.[1] Hard drives with average temperatures below 27 °C had a failure rate worse than hard drives with the highest reported average temperature of 50 °C, and a failure rate at least twice as high as the optimum temperature range of 37 °C to 46 °C.[1]


Average temperatures versus annual failures rates for HDDs

It is recommended that the operating temperature of a drive not steadily exceed 47 °C, as this may disproportionately reduce its life. This, however, may not be feasible in laptops.

Even in the U.S., as is true in most engineering fields, it is highly recommended that the Celsius temperature scale be used for managing computer temperatures.

[edit] Unreadable sensor data

At times, a drive may include a temperature sensor, but the temperature data may not be readable. This is possible under at least three conditions:

  1. The drive is part of a RAID. Especially in case of hardware-based RAID, the drive itself will not be seen by the operating system; only the logical RAID drive will be seen.
  2. The drive is connected to a controller card, irrespective of whether or not the card implements RAID.
  3. The drive is external to the computer.

Under such situations, it might be possible to glue or tape an external temperature sensor to the drive's surface. Alternatively, if the drive is in a storage backplane, the backplane may have a built-in temperature sensor with a configurable threshold. With such external sensors, the temperature threshold can be set to a few degrees, perhaps 5 °C less than that for an internal sensor.

[edit] Condensation control

If a computer is moved from a cold place, such as outdoors, to a relatively warm place, such as indoors, it can result in condensation inside the drive and on other system components. Damage can ensue if the condensation is not given sufficient time to evaporate before the device is powered on. Depending upon the change in the device's temperature, the time needed to acclimatize the device can be up to several hours. Rapid and extreme temperature changes should be avoided for this reason.

[edit] Air quality control

Tobacco smoke and other particulates in the air near a computer may adversely affect the drive. Smoking in the presence of a computer is therefore discouraged. Particulate reduction, if necessary, may be achieved by the use of an effective air purifier.

[edit] Vibration control

Powerful vibrations caused near the computer, such as those caused by a subwoofer, may increase the risk of a head crash. Accordingly, such vibrations can be limited. One way to isolate low frequency vibrations is by supporting the speakers or the drive enclosure on spikes.

[edit] Motion control

The sudden accelerating movement of the computer, especially when it is powered on, may possibly result in damage to the drive. Laptops are especially prone to such damage. Such movements should therefore be avoided.

An external drive, if placed upright without a stand, is at risk of tipping and falling when in use, thus possibly causing damage. Laying it on a level position prevents this risk.

[edit] Shipment damage control

During shipment, a drive is at risk of being damaged due to shock and vibrations. Adequate cushioning can be used to reduce the risk of damage.

[edit] Magnetic field control

Data is stored on a drive using magnetism. An external device with a strong magnetic field has a risk of causing data loss if the device is brought close to the computer. Such devices often come with a warning notice which states the minimum distance they are to be kept away from other electronic devices such as computers.

[edit] Electricity control

[edit] Power protection

An energy spike can burn the circuitry of a drive and also destroy the data in it, in addition to damaging other system components. To deal with these and other power problems, a surge protector is essential.

An energy spike can also be transmitted into the computer via other cables, such as a telephone, coaxial or network cable. Special surge protectors to deal with surges from these sources are available. One such item for comprehensive surge protection for use in the U.S. is APC PF11VNT3. While the risk of a spike from these sources may depend upon the locality and is low, damage to a drive can ensue in the event of a surge.

In locations susceptible to brownouts or other power quality issues, a power conditioner or voltage regulator can be used. Some PSUs may be designed to work with variable or reduced voltages.

A power outage can also possibly result in damage to a drive. In locations that are particularly susceptible to power outages, a UPS can be used at least for desktops. This is because desktops, unlike laptops, do not contain a battery to power them.

An external drive that obtains its power from the computer, such as via USB or Firewire, should not be unplugged until it has been prepared for removal first and has stopped spinning.

[edit] Electrostatic protection

The circuitry on a drive is electrostatic sensitive. It is therefore essential to ground oneself prior to coming in contact with a drive. In addition, antistatic devices such as an antistatic agent and an antistatic wrist strap can be used. A drive must be enclosed in an antistatic bag or antistatic bubble wrap if it is to be stored. The risk of an electrostatic discharge is higher when the humidity is low. In practice, however, an electrostatic discharge is generally not one of the causes of a drive failure.

[edit] Cabling

PATA drives use a molex connector for supplying power to the drive. This connector is polarized, so it cannot be inserted incorrectly into the drive. A badly designed connector on a power cable, however, may not be polarized. A lack of polarization creates an opportunity for the connector to be inserted incorrectly. Incorrect insertion will switch the +12V yellow and +5V red cable connections. This will cause an overvoltage in the drive which can possibly damage it. Cables with such connectors and PSUs that have such cables should therefore never be used. Due to the different type of power connector used by SATA drives, those drives do not carry this risk.

An incorrect power cord plugged into an external drive can cause overcurrent in the drive possibly damaging it. This is a risk when multiple power cords and adapters for different devices are present in the vicinity of the external drive. Power cords can be labeled if necessary to ensure that only the correct cord is used.

[edit] Stress control

[edit] Fragmentation control

File system fragmentation increases disk head movement, thus possibly decreasing the life of a drive. It is therefore possible that defragmentation increases the lifespan of the drive by minimizing its head movement and simplifying data access operations.

Additionally, file systems such as NTFS and most Unix and Linux file systems are designed to decrease the likelihood of fragmentation. This is an added reason for preferring NTFS over FAT32 under Windows. In Windows 2000 and higher, a FAT32 file system can be converted to NTFS using the convert.exe tool.

Routine defragmentation of solid-state drives is not recommended, as it may reduce the drive's lifespan.[2]

[edit] Power cycling control

Shutting down and rebooting a computer or resuming it from hibernation cycles the power to the drives in the computer. The spin-up operation performed by a drive after a power cycle is believed to place more stress on the drive than running the drive continuously for a long period of time.

Based on professional experience of system administrators, it is believed that there is a direct relationship between the number of power cycles of a computer and the probability of failure of its drives. In other words, a computer with a high uptime may have a lower probability of drive failure than one that has its power cycled routinely.

[edit] Detecting impending drive failure

[edit] Diagnosis and repair

Operating system tools such as chkdsk on Windows and fsck on Unix can be used routinely, perhaps once every three months, to check the integrity of the file system used on the drive and repair errors as possible. Third party tools for scanning are available as well. In addition to routine scans, a scan must also be run immediately if problems are experienced working with files on the drive. Typical examples of such problems are a hang or a CRC error when moving files.

A diagnostic check can also include a bad sector scan. While running a bad sector scan for a large drive takes several hours, it is recommended. The presence of several or an increased number of bad sectors on a drive can be indicative of poor drive health. Such a drive can be replaced to avoid risking further loss of data.

[edit] S.M.A.R.T.

S.M.A.R.T. reliability data can be queried from drives using various S.M.A.R.T. tools. This data can be used as an estimate of drive health. Based on the data, if the software reports the drive health as being unacceptably low, the drive can be preemptively replaced.

Software applications exist to automatically monitor S.M.A.R.T. data based on a schedule. The application can then alert the user if a minimum reliability threshold is crossed. Such an application may be preferred over one that only manually queries the S.M.A.R.T. data. One such free software application for Windows is PassMark DiskCheckup.

Software also exists to interpret S.M.A.R.T. data and assign a numerical percentage value to a drive's health. One such software is SpeedFan when used in conjunction with its online analysis feature.

As with temperature data, it is possible that the S.M.A.R.T. data provided by a drive is not readable for various reasons. In particular, S.M.A.R.T. data is not readable from the majority of drives connected externally via USB and Firewire. This is because the protocol bridge between the USB and ATA protocols does not seem to support S.M.A.R.T. data.

[edit] Relevant parameters

While S.M.A.R.T. has several parameters, a subset of these parameters has a large impact on failure probability. These parameters are scan errors, reallocation counts, offline reallocation counts, and probational counts.[1] The critical threshold for each of these four parameters is one.[1]


Parameter Number of times the drive is more likely to fail within 60 days after reaching the parameter's critical threshold of one[1]
Scan errors 39*
Reallocation counts 14
Offline reallocation counts 21
Probational counts 16

*A scan error in a young drive increase its probability of failure more dramatically than it does for an older drive. While drives with just one scan error are more likely to fail than those with none, drives with multiple scan errors fail even more quickly.[1]

Unfortunately, it is unlikely that S.M.A.R.T. data by itself can be used to develop an effective predictive model of individual drive failures. This is because a significant percentage of drives that fail have no S.M.A.R.T. errors whatsoever.[1]

[edit] System event logs

A sample disk warning event as shown by Event Viewer

The operating system logs system events. Of particular interest are system events triggered by a disk or a disk controller. Only events logged as errors or warnings are of concern, and not those that are logged solely for informational purposes. Under Windows, events can be viewed using the built-in Event Viewer application. Under other operating systems, other applications may be available for viewing event logs.

The system event log can be monitored for the presence of disk related errors and warnings. If any such events are logged, they can be checked to see which drive or device they pertain to. If similar events are suddenly logged by multiple drives in a short span of time, it is more likely that the problem is with a common controller card or motherboard component than with the individual drives.

Depending upon the event and its frequency, if the problem is with a drive, a diagnostic software can be run. If the event continues, it can serve as a sign of an impending drive failure. The relevant device can be replaced if the error persists.

[edit] Data redundancy

[edit] RAID

While RAID can be used to reduce the risk of data loss due to drive failure, it costs more to have the same amount of storage capacity available this way, and requires some amount of technical planning and expertise.

RAID 0 should not be used for the operating system because it does not provide any data redundancy and it has a greatly increased probability of failure resulting in data loss. RAID 6 is recommend over RAID 5 for increased redundancy.

[edit] Backups

Having a backup is an obvious way to reduce the risk of data loss due to drive failure. For many users, however, having a backup of all their data can be entirely impractical when they have large and increasing amounts of data. Nonetheless, it is critical to backup at least the data that is most important, such as the home directory.

As is obvious, a backup of the files on a hard disk should not be on the same disk, but on a different disk or other location instead.

[edit] Parchives

A parchive can be created and stored for sets of important files. This will allow those important files to be recovered if they later become corrupted. A parchive applies particularly to files that do not get modified, e.g. digital media. If any file in the set of source files is modified, the parchive will have to be regenerated.

A parchive can also be integrated with a backup to ensure a robust backup.

Alternatively, file verification using a checksum can be used for verifying the integrity of files. SFV is an adequate file format for this purpose. In contrast with a parchive, file verification only allows file corruption to be detected — it does not allow a corrupted file to be repaired.

[edit] Sharing

Only wimps use tape backup: _real_ men just upload their important stuff on ftp, and let the rest of the world mirror it ;) — Linus Torvalds

Sharing is the ultimate means of preventing data loss. Everything that may be of common interest to others can be shared. This includes the burgeoning data individual users acquire from Usenet and file sharing networks. The philosophy behind sharing is that the shared content will be available for download from others when it is needed back in the event of data loss.

On Usenet, sharing implies posting files and also filling requests when possible. In file sharing networks it means uploading as much as possible what is downloaded. On university campuses, much can be shared on fast networks such as Direct Connect. To encourage users to download files in such a network, it helps to have files named correctly and categorized in a suitable directory structure. Large amounts of data can also be shared with trusted friends using a designated external hard disk drive.

Continuous sharing puts a continuous strain on the drives containing the shared data. While there is a chance that this can reduce the life of the drive, a risk-benefit analysis is not clear-cut.

[edit] Data reduction

Reducing the amount of data one has can reduce the number of hard disk drives needed to store that data and its backups. This consequently decreases the probability of failure of any one drive in a given period of time. Having a lesser amount of data also makes it easier to safeguard it using redundancy methods.

[edit] Routine cleanup

Routinely going through one's files and deleting those which are no longer useful is a basic way of reducing one's data. Having a reduced number of files can also make it easier to find those for which there may be a need. The same applies to uninstalling applications which are no longer required. This can perhaps be done once every three months.

[edit] Data compression

Data compression software can be used to compress files or sets of files so they occupy less space. One such utility is 7-Zip. This applies particularly to files that do not get modified and are likely to benefit from compression, e.g. archived sets of documents and spreadsheets. In contrast, depending upon the specific format, digital media files are often highly compressed as it is, and are not likely to be compressed much further.

Besides the space savings, an additional benefit of keeping files compressed is that they will backup faster than if they were uncompressed. This does not apply to NTFS compression which actually slows backups.

[edit] Drive life cycle management

[edit] Drive selection

Some brands of drives are more reliable than others. While reliability data for particular models can be hard to come by, various factors can be used to estimate a drive's reliability. These factors include product ratings, and some suitability and physical attributes.

[edit] Product rating

As is obvious, drives with relatively higher user ratings and good reviews should be preferred. Drives with relatively lower ratings should not be purchased except if used in a RAID environment. The reliability of a model with less than five user ratings, as is the case with brand-new models, is harder to estimate.

Newegg is one of the websites that provides user ratings and reviews for many drives. Google Product Search provides an aggregation of user ratings and reviews from various websites.

[edit] Suitability attributes

[edit] Drive class

Enterprise class drives are advertised as having slightly higher reliability than standard desktop class drives, but of course they cost more.

[edit] Error recovery mechanism

Hard drives can come with an in-built recovery mechanism which attempts repair if an error occurs. This recovery cycle attempts to recover data from the problematic area, and then reallocates a dedicated area to replace the problematic area.[3] This process can take up to up to a few minutes depending on the severity of the issue.[3]

Drives meant to be used in a RAID environment must have a feature which prevents them from entering a long recovery cycle, failing which the RAID controller can drop the drive from the array. This feature is known as Time-Limited Error Recovery (TLER)[3] by Western Digital, Error Recovery Control (ERC) by Seagate, and Command Completion Time Limit (CCTL) by Samsung and Hitachi.

Desktop drives that can enter a long recovery cycle should therefore not be used in RAID environments, although drives with TLER[3] / ERC / CCTL can be used in non-RAID environments.

[edit] Physical attributes

[edit] Number of heads

There exists a strong positive correlation between the number of heads in a drive and its failure rate.[4] When choosing between two drives of equal capacity and speed, the one with a fewer number of heads is therefore preferred. This point, however, may not be useful because drives with similar features may tend to have the same number of heads.

[edit] Burn-in

A drive has a higher chance than usual of failure in its first few months of use. This increased rate is due to assembly, configuration, or component-level problems. If a drive is susceptible to failing due to such a problem, it would be beneficial if this problem can be detected before the drive is put into use. Care must be taken to ensure that a drive does not overheat during a burn-in.

To aid with this, new drives can first be put through a short burn-in process using special software. This process performs read and write stress tests on the drive. It thus aims to catch problems in the drive that may lead to its early failure. One commercial software application for both Windows and Linux that performs this and other burn-in tests is PassMark BurnInTest.

S.M.A.R.T. reliability data can be queried before and after the burn-in. If a new error is found after the burn-in, it can be indicative of the drive being susceptible to an early failure.

[edit] Routine drive upgrades

While planned functional obsolescence is something that can be expected from a company selling a product, in this case it is necessitated by the consumer. Older, smaller drives can routinely be replaced by newer, larger ones. In addition to the increased storage capacity that becomes available, because the older drive is replaced well before its life runs out, the risk of loss of the data contained in that drive is reduced. This is particularly applicable to consumers who require increasing amounts of storage, as they benefit most from the increased storage capacity.

Drives can be replaced based on their features, age, or their fitness as determined by S.M.A.R.T. parameters.

[edit] Additional measures

[edit] Antivirus protection

A particularly malicious or buggy virus can cause data loss, typically without any resultant physical damage to the underlying drive. Additionally, in theory, a virus can stress a particular portion of a drive to the point that the drive has an error. Antivirus software and other measures exist to combat these and other threats posed by viruses.

[edit] Revision control

Revision control is an essential technique for avoiding data loss. In particular, it guards against data loss caused by a user.

It is useful only for files that may be modified, such as documents, spreadsheets, logos, and source code. As such, revision control contrasts with parchives, which applies to files that do not get modified.

Having a backup does not obviate revision control. This is because revision control by and large ensures that a file can be reverted to any of its older versions, whereas a backup does not offer this guarantee.

[edit] Appendixes

[edit] Managing backups and revisions

If multiple generations of backups or revisions are stored, as they usually are, older generations must eventually be deleted due to storage capacity limitations.

[edit] FIFO approach

A simple approach to delete past generations would be to always delete the oldest generations until there is sufficient capacity for the upcoming generations. This is analogous to FIFO. This approach, however, is naive and can lead to data loss. To understand why, consider a file in which an an error is introduced. Several generations of backups and revisions have since occurred. The error is then detected. At this time, it would be pointless to have all of the most recent generations because all of them have the error. It would instead be beneficial to have at least one of the older generations, as it would not have the error.

[edit] Weighted random approach

A better approach is to keep generations distributed across all points in time. One way to achieve this goal is to delete past generations (except the first and the last generation) when necessary in a weighted-random fashion. For each desired deletion, the weight assigned to each of the past generations signifies the probability of it being deleted. One acceptable weight is a constant exponent (possibly the square) of the multiplicative inverse of the duration (possibly expressed in the number of days) between the date of the generation and the generation available before it.

Using a larger exponent leads to a more uniform distribution of generations, whereas a smaller exponent lead to a distribution with more recent and less older generations. While a proof is not offered for this assertion, empirical results suggest it to be true. This technique thus ensures that past generations are always distributed across all points in time as desired.

[edit] References

  1. a b c d e f g Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz André Barroso (February 2007). "Failure Trends in a Large Disk Drive Population" in 5th USENIX Conference on File and Storage Technologies (FAST 2007). USENIX Conference on File and Storage Technologies. Retrieved on 2008-09-15. 
  2. OCZ Vertex Series SATA II 2.5" SSD. OCZ Technology. Retrieved on 2009-02-24. “Solid State Drives do not require defragmentation. It may decrease the lifespan of the drive.
  3. a b c d What is the difference between Desktop edition and RAID (Enterprise) edition hard drives?. Western Digital: Knowledge Base: Frequently Asked Questions. Western Digital. Retrieved on 2008-12-11. “If you install and use a desktop edition hard drive connected to a RAID controller, the drive may not work correctly unless jointly qualified by an enterprise OEM. This is caused by the normal error recovery procedure that a desktop edition hard drive uses.

    When an error is found on a desktop edition hard drive, the drive will enter into a deep recovery cycle to attempt to repair the error, recover the data from the problematic area, and then reallocate a dedicated area to replace the problematic area. This process can take up to 2 minutes depending on the severity of the issue. Most RAID controllers allow a very short amount of time for a hard drive to recover from an error. If a hard drive takes too long to complete this process, the drive will be dropped from the RAID array. Most RAID controllers allow from 7 to 15 seconds for error recovery before dropping a hard drive from an array. Western Digital does not recommend installing desktop edition hard drives in an enterprise environment (on a RAID controller).

    Western Digital RAID edition hard drives have a feature called TLER (Time Limited Error Recovery) which stops the hard drive from entering into a deep recovery cycle. The hard drive will only spend 7 seconds to attempt to recover. This means that the hard drive will not be dropped from a RAID array. Though TLER is designed for RAID environments, it is fully compatible and will not be detrimental when used in non-RAID environments.

  4. Jon G. Elerath and Sandeep Shah (January 2003). "Disk drive reliability case study: Dependence upon fly-height and quantity of heads" in Annual Symposium on Reliability and Maintainability. Proceedings of the Annual Symposium on Reliability and Maintainability: 608–612. 

[edit] Related reading