Web lists-archives.com

Re: Failing disk advice




On Mon, Mar 06, 2017 at 12:17:03PM +0100, Mirko Parthey wrote:
> On Sun, Mar 05, 2017 at 08:38:27PM -0800, David Christensen wrote:
> > On 03/05/2017 01:02 PM, Gregory Seidman wrote:
> > >I have a disk that is reporting SMART errors. It is an active disk in
> > >a (kernel, not hardware) RAID1 configuration. I also have a hot spare
> > >in the RAID1, and md hasn't decided it should fail the disk and switch
> > >to the hot spare. Should I proactively tell md to fail the disk (and
> > >let the hot spare take over), or should I just wait until md notices a
> > >problem?
> > 
> > I'm confused by "I also have a hot spare in the RAID1".  Do you have a
> > two-member RAID1 with a hot spare, or a three-member RAID1?  I would
> > prefer the latter:
> > 
> > https://manpages.debian.org/jessie/mdadm/md.4.en.html
> 
> Refining this advice a bit, I would convert the spare to a full RAID
> member now, without explicitly failing the disk that reports SMART
> errors first.
> Assuming you have a two-member RAID1 with a hot spare, the command
> should be similar to this (untested):
>   mdadm -G /dev/mdX -n 3 
> This ensures you keep redundancy during further maintenance actions.

I was unaware that this was possible. I've run it and mdadm -D reports that
it is now in the "clean, degraded, rebuilding" state. Thank you! I wish I
had room in my system to add the fourth (which I've ordered) without
removing the failing disk, but I do not.

> Which SMART errors do you get, and who reports them?

I get emails sent to root:

	This message was generated by the smartd daemon running on:

	   host name:  XXXXXX
	   DNS domain: YYYYYY

	The following warning/error was logged by the smartd daemon:

	Device: /dev/sdc [SAT], 8 Currently unreadable (pending) sectors

	Device info:
	ST31500341AS, S/N:9VS43CV9, WWN:5-000c50-0208aa9a3, FW:CC1H, 1.50 TB

	For details see host's SYSLOG.

	You can also use the smartctl utility for further investigation.
	The original message about this issue was sent at Wed Dec 14 00:51:36 2016 EST
	Another message will be sent in 24 hours if the problem persists.

...and...

	This message was generated by the smartd daemon running on:

	   host name:  XXXXXX
	   DNS domain: YYYYYY

	The following warning/error was logged by the smartd daemon:

	Device: /dev/sdc [SAT], 8 Offline uncorrectable sectors

	Device info:
	ST31500341AS, S/N:9VS43CV9, WWN:5-000c50-0208aa9a3, FW:CC1H, 1.50 TB

	For details see host's SYSLOG.

	You can also use the smartctl utility for further investigation.
	The original message about this issue was sent at Wed Dec 14 00:51:37 2016 EST
	Another message will be sent in 24 hours if the problem persists.

(Yes, I know, I've been letting it do this since mid-December, which is not
great.)

> What is the output of the following command for the failing drive?
>   smartctl -A /dev/sdY

	# smartctl -A /dev/sdc  
	smartctl 6.4 2014-10-07 r4002 [i686-linux-3.16.0-4-686-pae] (local build)
	Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

	=== START OF READ SMART DATA SECTION ===
	SMART Attributes Data Structure revision number: 10
	Vendor Specific SMART Attributes with Thresholds:
	ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
	  1 Raw_Read_Error_Rate     0x000f   119   099   006    Pre-fail  Always       -       205161943
	  3 Spin_Up_Time            0x0003   100   091   000    Pre-fail  Always       -       0
	  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1055
	  5 Reallocated_Sector_Ct   0x0033   099   099   036    Pre-fail  Always       -       41
	  7 Seek_Error_Rate         0x000f   092   060   030    Pre-fail  Always       -       1743842168
	  9 Power_On_Hours          0x0032   039   039   000    Old_age   Always       -       53898
	 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
	 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       85
	184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
	187 Reported_Uncorrect      0x0032   097   097   000    Old_age   Always       -       3
	188 Command_Timeout         0x0032   100   098   000    Old_age   Always       -       133146017827
	189 High_Fly_Writes         0x003a   007   007   000    Old_age   Always       -       93
	190 Airflow_Temperature_Cel 0x0022   060   040   045    Old_age   Always   In_the_past 40 (Min/Max 26/45 #502)
	194 Temperature_Celsius     0x0022   040   060   000    Old_age   Always       -       40 (0 18 0 0 0)
	195 Hardware_ECC_Recovered  0x001a   038   023   000    Old_age   Always       -       205161943
	197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
	198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8
	199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       1
	240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       53897 (15 186 0)
	241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       917595486
	242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1262569510

> Regards,
> Mirko

Thanks for the help so far,
--Greg