Monday, March 15, 2010

Western Digital "green" drives Load_Cycle_Count on Linux

Western Digital claim their "green" hard disk drives are very power efficient. One of the methods they use to reduce hdd power usage is to aggressively park or "unload" the hard disk's heads. This results in less friction on the disk platters and less power consumption.

About 2 years ago I bought 4 of these drives for a media server. Using linux software RAID5, I set them up as a 3TB RAID array running xfs:

root@orbit:~# for dev in /dev/sd[a-d]; do smartctl -a $dev | grep Device\ Model; done;
Device Model: WDC WD10EACS-00C7B0
Device Model: WDC WD10EACS-00C7B0
Device Model: WDC WD10EACS-00C7B0
Device Model: WDC WD10EACS-00C7B0

Several months later a post drew my attention to the fact that some WD hard disks have been loading/unloading very frequently. A quick check of the SMART drive data using smartctl (sudo apt-get install smartmontools) revealed the issue:

root@orbit:/# for dev in /dev/sd[a-d]; do smartctl -a $dev | grep Load_Cycle_Count; done;
193 Load_Cycle_Count 0x0032 120 120 000 Old_age Always - 240530
193 Load_Cycle_Count 0x0032 122 122 000 Old_age Always - 236124
193 Load_Cycle_Count 0x0032 119 119 000 Old_age Always - 244179
193 Load_Cycle_Count 0x0032 119 119 000 Old_age Always - 244882

root@orbit:/# for dev in /dev/sd[a-d]; do smartctl -a $dev | grep Power_On_Hours; done;
9 Power_On_Hours 0x0032 080 080 000 Old_age Always - 7306
9 Power_On_Hours 0x0032 080 080 000 Old_age Always - 7306
9 Power_On_Hours 0x0032 080 080 000 Old_age Always - 7306
9 Power_On_Hours 0x0032 080 080 000 Old_age Always - 7306

These WD10EACS drives are rated at only 300,000 reloads in their lifetime. After only 10 months of use they had averaged 33 reloads per hour... and used 80% of their designed lifetime max!

The drives can be switched into a low-noise mode using hdparm's "Automatic Acoustic Management". This causes the drives to slow down their head movements making them quieter and potentially increasing life expectancy, while making them only slightly slower:

root@orbit:~# for dev in /dev/sd[a-d]; do hdparm -M128 $dev; done;

In this mode the only noise I can hear from the WD10EACS's is the actual clicking of the load cycles.

Over several weeks of testing and experimenting I was unable to halt the rapidly increasing load cycle count. I was slightly concerned; RAID5 arrays can only handle a single disk failure and are not very safe if multiple drives are likely to fail at any moment. In an attempt to allow the drives to remain in low-power state for longer I bought and migrated the Ubuntu OS to a 64GB SSD.

I checked the drives about 5 month later and was alarmed to find that the load cycle count had accelerated even more and was at nearly double the manufacturers design limit:

root@orbit:~# for dev in /dev/sd[a-d]; do smartctl -a $dev | grep Load_Cycle_Count; done;
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 611032
193 Load_Cycle_Count 0x0032 002 002 000 Old_age Always - 595279
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 604626
193 Load_Cycle_Count 0x0032 003 003 000 Old_age Always - 593278

The drives were now averaging 84 reloads per hour. At this point I did more tests and more googling. One such test to confirm that no data was being written to the drives /dev/sd[a-d] and their parent RAID arrays /dev/md0 and /dev/md1 over a 60 second period:

root@orbit:~# iostat 60
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 0.00 0.00 0.00 0 0
sdb 0.00 0.00 0.00 0 0
sdc 0.00 0.00 0.00 0 0
sdd 0.00 0.00 0.00 0 0
sde 6.90 0.93 96.00 56 5760
md0 0.00 0.00 0.00 0 0
md1 0.00 0.00 0.00 0 0

Here only the SSD /dev/sde was being accessed, the zeros (no data read/written) should have meant the drives stayed in low-power mode. Yet over a few minutes the drives still load cycled 13 times.

In this middle of this forum post I found the hint that identified the problem: querying the WD hard disks' SMART data brings the drives out of low-power mode. My proactive hard drive monitoring was in fact wearing the drives out faster!  Since the initial load cycle issues were discovered I had started using hddtemp to monitor the hard drive temperatures via their SMART data and collectd to log the data into RRD databases.

To confirm hddtemp was causing the excessive load cycles:

root@orbit:~# /etc/init.d/hddtemp stop
* Stopping disk temperature monitoring daemon hddtemp [ OK ]

root@orbit:~# tail /var/log/syslog Nov 24 22:43:31 orbit collectd[3032]: hddtemp plugin: connect (127.0.0.1, 7634) failed: Connection refused
Nov 24 22:43:31 orbit collectd[3032]: hddtemp plugin: Could not connect to daemon.
Nov 24 22:43:31 orbit collectd[3032]: read-function of plugin `hddtemp' failed. Will suspend it for 10 seconds.

Monitoring the load cycles confirmed this was part of the problem and the drives were now staying in low-power mode for longer. The only viable solution at this point was to disable the hard disk temperature monitoring.

Another 4 months later and the hard disks had definitely slowed down:

root@orbit:~/# for dev in /dev/sd[a-d]; do smartctl -a $dev | grep Load_Cycle_Count; done;
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 713744
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 682175
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 691961
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 675499

Calculating (713744-611032)/(14807-11688) = 33 reloads per hour. This is back at the rate they were advancing when they were first installed. Its better, but still a concern. All 4 drives are now about 2.5 times the manufacturers rating and appear to be working fine. Higher values have been reported in various forums, so I'm not panicking just yet.

Update June 2010:

To periodically monitor the load cycle, the counters can be output into CSV format:

root@orbit:~/# A=`date +\%s` && for dev in /dev/sd[a-d]; do A=$A,`smartctl -a $dev | grep Load_Cycle_Count | awk '{print $10}'`; done && echo $A
1277268248,816341,776081,786689,769881
root@orbit:~/# A=`date +\%s` && for dev in /dev/sd[a-d]; do A=$A,`smartctl -a $dev | grep Load_Cycle_Count | awk '{print $10}'`; done && echo $A
1277268315,816342,776082,786690,769882

The first number is the current unix time-stamp (change the '+%s' for a more human readable format eg. '+%F %R' gives '2010-06-23 16:10'). The next 4 numbers are the load cycle counts for sda[a-d]. To record the values once every hour on the hour I added the following line to the end of the /etc/crontab file.

00 * * * * root A=`date +\%s` && for dev in /dev/sd[a-d]; do A=$A,`smartctl -a $dev | grep Load_Cycle_Count | awk '{print $10}'`; done && echo $A >> /var/log/hdd.csv

Note the back-quoting of the date formatter. The CSV readings are appended to the file /var/log/hdd.csv every hour. Note that running this will take the drives out of low-power mode and increment the cycle counter... I can hear all 4 drives clicking when this script runs.

Update July 2010:

Further investigation and it turns out the swap partitions were causing the rest of the excessive load cycling:

root@orbit:~# cat /proc/swaps
Filename Type Size Used Priority
/dev/sda2 partition 979960 229844 0
/dev/sdb2 partition 979960 230644 0
/dev/sdc2 partition 979960 234002 0
/dev/sdd2 partition 979960 230882 0

These pictures below say it all. The server was being used lightly during this period:




The excessive load cycles were stopped by flushing the swap file:

swapoff -a && swapon -a

Relevant links:

4 comments:

  1. Great article, I was having the same problem. I ended up using WDIdle3 on another computer to disable head parking.

    ReplyDelete
  2. I agree with Brian. I just learned about load_cycle_count this morning and checked it on a drive I had been running for almost two years. The drive is a WD 2.5" that I use in my linux router machine (runs 24hrs a day) and it has always made the bzzz-tick noise. I RMA'd the first drive to newegg right after I bought it because of the noise, when I got another drive that did the same thing I figured it was normal (but very annoying). This morning, using WDIdle3 I was able to disable the bzzz-tick; which is, apparently, the noise that the drive makes when it parks or unparks the heads (load_cycle). I used smartctl to check my load_cycle_count and it is almost 3 million! My drive was set up for load-cycle at 800ms, and it performed an average of 185 load-cycles an hour over the last ~2 years! :( Still it appears to still be working and, hopefully, it will continue to work.

    ReplyDelete
  3. I would love to know were you stand at the moment in the same place you are and have the drives on about 700k

    This also might help http://forums.storagereview.com/index.php/topic/29253-newer-western-digital-hdd-head-parking-and-you/

    ReplyDelete
  4. Sure, 2.5 years later and these 4 WD drives are still running fine. For the last 2 years I have had the main Ubuntu OS running from a separate SSD, including swap.

    Assuming these rarely used drives have been idling in low power mode for the last 2 years I was very surprised to see:

    # for dev in /dev/sd[b-e]; do smartctl -a $dev | grep Load_Cyc | awk '{print $10}'; done;
    2627999
    2604655
    2619279
    2670247

    About 2.6 million cycles each, or 8.6 times their rating! I now have the RAID5 array backed-up to an external 3TB drive.

    ReplyDelete