Web lists-archives.com

Re: Failing disk advice




On 2017-03-05 at 16:02, Gregory Seidman wrote:

> I have a disk that is reporting SMART errors. It is an active disk in
> a (kernel, not hardware) RAID1 configuration. I also have a hot spare
> in the RAID1, and md hasn't decided it should fail the disk and
> switch to the hot spare. Should I proactively tell md to fail the
> disk (and let the hot spare take over), or should I just wait until
> md notices a problem?

So, you're saying you have a two-disk RAID-1 array with a third disk as
hot spare?

Under those circumstances, I would be inclined to leave it alone until
either md fails the one disk out or I start noticing visible symptoms,
but I'm not an expert and I'm not sure what the best practice is.
Certainly the paranoid, better-be-safe-than-sorry approach would be to
fail it out, let the hot spare take over, then swap in a cold spare as
the new hot spare.


For my own main array (RAID-5 with no hot spares), I don't necessarily
replace the disk as soon as I start noticing SMART errors - but I do
start monitoring the situation more closely, and as soon as I start to
see other indications (most prominently read- or write-related notices
from dmesg), I arrange for a replacement, fail out the disk, and swap in
the new one. (The only reason I have no hot spare is that there are no
unused SATA ports in the system.)

I initially expected that I would not fail out the disk manually at all,
but the last time I saw drive errors md was not automatically failing a
disk out of the array even when that disk was exhibiting read issues so
severe that the entire UI was hanging for 15-to-60 seconds on any read
attempt against the failed portions of the disk.

-- 
   The Wanderer

The reasonable man adapts himself to the world; the unreasonable one
persists in trying to adapt the world to himself. Therefore all
progress depends on the unreasonable man.         -- George Bernard Shaw

Attachment: signature.asc
Description: OpenPGP digital signature