Last sunday I got hit by an unexpected hard drive failure (are hard drive failures ever expected?). Good thing is I had most of the data backed up. The worst part is, it was the system drive, so the PC was effectively shut down.
The drive is a Seagate Barracuda 7200.11, also known as ST3500320AS. I have been using the PC normally the day before, and the disk was working as it has been for the past 4 years: smooth and silent. Unfortunately, the next day my computer greeted me with
“NON SYSTEM DISK OR DISK ERROR”
Not good. I have rebooted this time closely watching the BIOS messages:
SATA Port P0: Port reset error!
Okay, so the BIOS can’t access the disk. It must be the cabling! I have opened the case and carefully pushed all the plugs on all hard disks and DVD-ROM drive.
No go. Still the same error.
Okay, let’s switch the cables. I have plugged the affected disk to a different port on the motherboard. Error message changed a little bit:
SATA Port P3: Port reset error!
Right, the disk has failed. Damn! Let’s see what uncle Google has to say about this. I entered the disk model number… and… surprise! It’s a known firmware bug.
The disk I have is running firmware version SD15. As it turns out, Seagate had a “black series” of 7200.11 Barracudas, which had a firmware bug. The bug usually surfaced much earlier for other owners of the disk (one month up to a few months max). Mine worked for four years, but it eventually got hit by the bug as well.
It is worth noting that these Seagate drives store most of their internal calibration and configuration data on the platters, rather than in NVRAM, so replacing the PCB (which was my immediate idea) wouldn’t work. It seems that the bug is somehow related to this service data: when the drive is powered, it conducts some tests and then attempts to read the configuration information. And it hangs. Hence the bug is also known as “stuck in BSY” or simply BUSY bug.
But can it be fixed?
The answer is yes, it can be fixed. You will need a special serial console cable (Nokia CA-42 cable can be adapted for this purpose), which will allow running diagnostic commands on the drive itself. An external USB-to-SATA interface with own power supply will be handy as well. You also need a Torx T-6 size screwdriver, because you will need to separate the PCB from the drive for a while.
After connecting the drive to the CA-42 cable (see photo above for pin order).
As expected, the drive hangs with an error message shortly after it spins up. Of course the console is inaccesible.
The detailed instruction is here: Fixing a Seagate 7200.11 drive. I copied it for reference in case site above goes down:
You will need a terminal program on the computer to which you connect the RS232-to-TTL adapter. You can use Hyperterminal, which comes with Windows XP & earlier. I suggest using putty, but any terminal program will do.
Configure your terminal program to use the serial port with the following settings:
Baud 38400 Data Bits 8 Stop Bits 1 Parity none Flow Control none
Prepare the drive
Unscrew all screws holdiong the green PCB on the drive. Place some cardstock between the PCB and the contacts for the drive head. Leave the drive motor contacts in place. Tighten the three screws closest to the motor contacts. Leave the other three screws loose.
Wait for the drive to spin up, then press CTRL+Z. You should see a prompt:F3 T>
If you don’t see a prompt, you may have the TX & RX wires swapped. Swap them and try again.
Go to console access Level 2 (type /2):F3 T>/2 (enter) F3 2>
Wait about 20 seconds, then issue a command spinning down the motor:F3 2>Z (enter) Spin Down Complete Elapsed Time 0.147 msecs F3 2>
If you instead see a message similar to this:LED: 000000CE FAddr: 00280D4D
Then you entered the commands too quickly after supplying power to the drive. Cycle power, wait 20 seconds, then try again.
Very carefully, remove the cardstock that you placed between the PCB and the drive head contacts. Carefully replace and tighten the 3 remaining screws.
Then start the motor:F3 2>U (enter) Spin Up Complete Elapsed Time 7.093 secs F3 2>
Next go to Level 1 (type /1):F3 2>/1 (enter)
And do a S.M.A.R.T. erase (create S.M.A.R.T. sector):F3 1>N1 (enter)
When the prompt comes back up, turn off power to the hard drive, wait a few seconds, then turn it back on. Wait about 20 seconds, then finally do partition regeneration:F3 T>m0,2,2,0,0,0,0,22 (enter)
After 15-30 seconds, you should see something like:Max Wr Retries = 00, Max Rd Retries = 00, Max ECC T-Level = 14, Max Certify Rewrite Retries = 00C8 User Partition Format 10% complete, Zone 00, Pass 00, LBA 00004339, ErrCode 00000080, Elapsed Time 0 mins 05 secs User Partition Format Successful - Elapsed Time 0 mins 05 secs
Do not turn off drive until you see this message.
Once seen, drive can be turned off.
Power down everything, place drive back into your computer, and confirm that it’s working.
The disk has been put back into the PC and is working just like before, with all the data intact.
I immediately run a Seagate firmware update for ST3500320AS to prevent the bug from appearing again. (Fun fact: it actually installs a tiny linux environment first and runs the update from there. Worry not, your original OS will be restored once the udpate is complete).
Do you still trust Seagate?
Yes I still do. I have many reasons to do that. First of all, I had many hard drives of different vendors, and they all break down roughly the same. Each of those vendors had some “black series” of drives which broke down more often than others. I also do realize that every hard drive will give up eventually. That’s why you can expect hard drives in server disk arrays to fail. That’s why they are so easy to replace. That’s what RAID disk arrays are for. Hard disks are a sort of long life consumables.
But the most important reason is a nearly 20 year old hard drive from my first PC. Guess what? It is still in working order. Although the capacity is orders of magnitude less than the current hard drives, and it is extremely slow and rather noisy, it still works. I need no further proofs that Seagate makes decent hard drives 😉