Web lists-archives.com

Re: Talking about RAID - disks with same id




On 11/09/17 13:04, deloptes wrote:
David Christensen wrote:
What RAID technology are you using?

Linux software raid - kernel is 4.12.10

Most people call it 'mdadm', after the command-line tool. I am running the same, but on Debian "stable":

2017-11-09 14:00:32 root@dipsy ~
# dpkg-query --show mdadm
mdadm	3.4-4+b1

2017-11-09 14:00:40 root@dipsy ~
# cat /etc/debian_version
9.2

2017-11-09 14:01:00 root@dipsy ~
# uname -a
Linux dipsy 4.9.0-4-amd64 #1 SMP Debian 4.9.51-1 (2017-09-28) x86_64 GNU/Linux

2017-11-09 14:01:06 root@dipsy ~
# dpkg-query --show mdadm
mdadm	3.4-4+b1


Take a look at:

# smartctl --xall /dev/sdg

This is nothing spectacular - see attachment.

I'll comment on the information I think I understand...


> Device Model:     ST3500630AS

I deal with 8 @ ST31500341AS drives, which I believe are of the same vintage. They all seem good.


> SMART overall-health self-assessment test result: PASSED

That is good.


> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>   1 Raw_Read_Error_Rate     POSR--   105   095   006    -    0
>  10 Spin_Retry_Count        PO--C-   100   100   097    -    0
> 187 Reported_Uncorrect      -O--CK   100   100   000    -    0
> 198 Offline_Uncorrectable   ----C-   100   100   000    -    0

A RAW_VALUE of 0 for these attributes is good.


> 199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    7

7 is low, but the two in my file server are both 0.


Check your cable connections -- they should be fully engaged and not loose. Otherwise swap the cable. (I wrote a serial number on all of my SATA cables with Sharpie and track which cable is where.)


>   9 Power_On_Hours          -O--CK   034   034   000    -    58404

If 58404 means ~6.6 years (and I think it does), that is a lot of time. But, I would not worry based on just this value.


>   7 Seek_Error_Rate         POSR--   088   060   030    -    747385748
> 195 Hardware_ECC_Recovered  -O-RC-   064   056   000    -    179548239

I don't know how to interpret these raw values.  STFW I am not alone.


> SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
> No Errors Logged

That is good.


In fact I think it does not
write to this disk at all as the partition in the raid setup shows to a
disk with same id.

I think the problem is that blkid reports the same ID for both and that
somehow RAID is using this information, rather than using some of the other
mechanisms - UUID or UDEV Maker/Model/Serial .. which can be found
under /dev/disk

As I understand it, when mdadm creates an array, mdadm puts a metadata header into each device that includes identification of the array and identification of each member.


When the system boots, mdadm reads /etc/mdadm/mdadm.conf for array specifications, scans all devices for mdadm metadata, and then assembles the specified arrays using the devices it finds (as best it can).


It looks like you partitioned your drives with one large partition on each drive, and then created the array on the partitions.


The matching PTUUID values for both drives, and matching UUID and PARTUUID values for both partitions, indicates that one drive was cloned onto the other at some point after creating the array. I agree that this is likely a mistake, and is likely to confuse mdadm.


If you learn smartctl well enough, capture reports on a schedule
(weekly?), and look for trends, you might be able to predict failure.
STFW for information on this approach.


Download the bootable CD image of Seagate Seatools and run it:

https://www.seagate.com/support/downloads/seatools/


might do that,

You want that CD as part of your tool kit -- it makes running the SMART tests easy, lets you know if everything passed, and helps you understand anything that is questionable.


but I think the problem is in raid itself as it does not
indicate activity on the second disk and blkid reports the same id for two
disks - I really might need to look into the raid code if blkid is used in
any way.

Another alternative to crawling code would be to build another array on a pair of USB flash drives using the same process as you used for your 500 GB drives, and then see what blkid(8) says about the USB drives.


Do you have the console session from when you built the array?


Be sure to keep a console session of any and all mdadm commands you issue from now on.


[the drives] are in server that virtually runs 24/7 and indeed I have replaced
many over the years. In fact most of the old disks are gone. The Seagate is
the oldest there ... the only left, so I think I'll just replace it so that
I may sleep well ... the problem is I don't know which disk is really
writing, might be the Seagate and the WD is not operational ... I think it
is best to be on the safe side :)

If the array is working, leave it alone. Backup/ archive, build a replacement array, rsync the data over, validate, migrate services to the new array, validate services, and backup again (to validate your backup process). Once the new array has been up and running for a while, tear it down and pull the drives.


David