About six months ago I had my first drive crash on my file server. I use Western Digital Green 1,5 TB and they’re not the best disks for RAID, but for me it’s a matter of cost. I like cheap disks and so do ZFS. Anyway, yesterday it happened again. Or to be precise, the night before yesterday at about 04.00. When I woke up op5 had sent me both e-mails and SMS about it, so I just had to shut down the file server (no hotswap, it’s cheaper) and replace the disk. Since I always assume the worst, I had a spare disk waiting in case of a crash. The RAIDZ started generating at 09.00 and was done at around 05.00 today, 20 hours later. I know, it’s a looong rebuild time… but at least my data is intact.
Mar 6 04:32:58 titan ahci: [ID 693748 kern.warning] WARNING: ahci0: ahci port 3 task_file_status = 0x4041 Mar 6 04:32:58 titan ahci: [ID 657156 kern.warning] WARNING: ahci0: error recovery for port 3 succeed Mar 6 04:33:01 titan ahci: [ID 296163 kern.warning] WARNING: ahci0: ahci port 3 has task file error Mar 6 04:33:01 titan ahci: [ID 687168 kern.warning] WARNING: ahci0: ahci port 3 is trying to do error recovery Mar 6 04:33:01 titan ahci: [ID 693748 kern.warning] WARNING: ahci0: ahci port 3 task_file_status = 0x4041 Mar 6 04:33:01 titan ahci: [ID 657156 kern.warning] WARNING: ahci0: error recovery for port 3 succeed Mar 6 04:33:01 titan ahci: [ID 811322 kern.info] NOTICE: ahci0: ahci_tran_reset_dport port 3 reset device Mar 6 04:33:04 titan ahci: [ID 296163 kern.warning] WARNING: ahci0: ahci port 3 has task file error Mar 6 04:33:04 titan ahci: [ID 687168 kern.warning] WARNING: ahci0: ahci port 3 is trying to do error recovery
This message is what I got over and over again until the disk finally crashed. Instead it now says this:
ZFS calls the rebuilding process resilvering, but it’s the same thing. The nice thing about ZFS, since it’s both the volume manager and the file system, is that it knows which data is live data and don’t have to rebuild the entire disk. In this case i had 527 GB data on each disk, that means about 1/3 of the disk. If this had been a hardware RAID it would have taken three times this time to rebuild the entire disk. Talk about waste, rebuilding the data which doesn’t really contain anything.
