Web lists-archives.com

Software RAID blocks




Hi!

TLDR;
My /home on dmcrypt -> software Raid5 blocks irregular usually without
any error messages.

I can get it going again with "fdisk -l /dev/sdx".

Do you have an ideas how I can debug this issue further? Is it a dmcrypt,
a dm-softraid or a hardware issue?

---------------------------------------------------------------

Long version:
My /home "partition" is a dmcrypt on software RAID5 with 5 SATA disks.
See System info further down in this mail.
Once in a while user programs freeze because the dmcrypt or something
else further down the chain blocks during a write? on /home.
Am I lycky and had a running root shell open I can run a
  $ fdisk -l /dev/sdx
to one of the harddisks in the RAID and the block disappears instantly.

I checked if it could be a spindown power management problem but all
disks which have a PM feature have it disabled. So I don't think this is
the problem.

Last night I got a "blocked for more than 300 seconds." message in syslog -
see <https://paste.debian.net/1060134/ <https://paste.debian.net/1060134/>> (link valid for 90 days).

Log summary:
Jan 13 02:34:44 osprey kernel: [969696.242745] INFO: task md127_raid5:238 blocked for more than 300 seconds.
Jan 13 02:34:44 osprey kernel: [969696.242772] Call Trace:
Jan 13 02:34:44 osprey kernel: [969696.242789]  ? __schedule+0x2a2/0x870
Jan 13 02:34:44 osprey kernel: [969696.242995] INFO: task dmcrypt_write:904 blocked for more than 300 seconds.
Jan 13 02:34:44 osprey kernel: [969696.243223] INFO: task jbd2/dm-2-8:917 blocked for more than 300 seconds.
Jan 13 02:34:44 osprey kernel: [969696.243525] INFO: task mpc:6622 blocked for more than 300 seconds.
Jan 13 02:34:44 osprey kernel: [969696.243997] INFO: task kworker/u8:0:6625 blocked for more than 300 seconds.

In this case I did a
  $ fdisk -l /dev/sdf
and everything worked again.

As I understand the log mpc (user program) started and maybe accessed the
config file on /home. The ext4 tried to save the new access time which
got down the chain jbd2 -> dmcrypt and blocked in the end in md127_raid5.

So it is most likely that I have a problem with the software raid or the
harddisks, isn't it? SMART is activated on all disks and does not show
any error.

How can I debug this further to solve the problem? Thanks in advance for
your suggestions.

Tom

---------------------------------------------------------------
System info:
============
Debian testing

$ uname -a
Linux osprey 4.19.0-1-amd64 #1 SMP Debian 4.19.12-1 (2018-12-22) x86_64 GNU/Linux

$ lsblk -i
NAME              MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sda                 8:0    0 74.5G  0 disk
|-sda1              8:1    0    4G  0 part
| `-cswap1        253:1    0    4G  0 crypt [SWAP]
`-sda2              8:2    0 70.5G  0 part
  `-osprey_root   253:0    0 70.5G  0 crypt /
sdb                 8:16   0  2.7T  0 disk
`-sdb1              8:17   0  2.7T  0 part
  `-md127           9:127  0 10.9T  0 raid5
    `-osprey_home 253:2    0 10.9T  0 crypt /home
sdc                 8:32   0  2.7T  0 disk
`-sdc1              8:33   0  2.7T  0 part
  `-md127           9:127  0 10.9T  0 raid5
    `-osprey_home 253:2    0 10.9T  0 crypt /home
sdd                 8:48   0  2.7T  0 disk
`-sdd1              8:49   0  2.7T  0 part
  `-md127           9:127  0 10.9T  0 raid5
    `-osprey_home 253:2    0 10.9T  0 crypt /home
sde                 8:64   0  2.7T  0 disk
`-sde1              8:65   0  2.7T  0 part
  `-md127           9:127  0 10.9T  0 raid5
    `-osprey_home 253:2    0 10.9T  0 crypt /home
sdf                 8:80   0  2.7T  0 disk
`-sdf1              8:81   0  2.7T  0 part
  `-md127           9:127  0 10.9T  0 raid5
    `-osprey_home 253:2    0 10.9T  0 crypt /home

$ sdparm --get STANDBY /dev/sd[bcdef]
    /dev/sdb: ATA       ST3000VN000-1H41  SC43
STANDBY not found in Power condition [po] mode page
    /dev/sdc: ATA       WDC WD30EURX-63T  0A80
STANDBY not found in Power condition [po] mode page
    /dev/sdd: ATA       TOSHIBA DT01ACA3  ABB0
STANDBY not found in Power condition [po] mode page
    /dev/sde: ATA       ST3000DM001-1CH1  CC27
STANDBY not found in Power condition [po] mode page
    /dev/sdf: ATA       WDC WD30EFRX-68E  0A80
STANDBY not found in Power condition [po] mode page

$ hdparm -B /dev/sd[bcdef]
/dev/sdb:
APM_level      = 254
/dev/sdc:
APM_level      = not supported
/dev/sdd:
APM_level      = off
/dev/sde:
APM_level      = 254
/dev/sdf:
APM_level      = not supported

$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md127 : active raid5 sdc1[1] sdd1[2] sdb1[0] sdf1[5] sde1[3]
      11719766016 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
      bitmap: 1/22 pages [4KB], 65536KB chunk
unused devices: <none>

$ for i in {b..f}; do echo "DISK: ${i}"; smartctl -a "/dev/sd${i}" |grep "SMART overall-health self-assessment test result"; done
DISK: b
SMART overall-health self-assessment test result: PASSED
DISK: c
SMART overall-health self-assessment test result: PASSED
DISK: d
SMART overall-health self-assessment test result: PASSED
DISK: e
SMART overall-health self-assessment test result: PASSED
DISK: f
SMART overall-health self-assessment test result: PASSED