Web lists-archives.com

filesystem slowdown with backports kernel




Hi,

we have a NAS system acting as a place to store our server's backups
(via rsync with link-dest). On that NAS we switched from the stable
kernel (4.9) to the one provided by backports (4.18) because of an
unrelated problem. When we do that, we see a slowdown of our backup
process, from the backup via rsync itself to deleting old backup
directories. The slowdown seems to be connected to the number of
files/directories as backups of systems with less files seem less
affected than the ones with many files.


So we started benchmarking and the following seems to do the trick in
showing our problem by creating about 100k directories and files (10
dirs containing 10000 directories and files for easier deleting between
tries):

#!/bin/bash
time (
	for i in {0..9};do
        	for j in {0000..9999};do
                	mkdir -p $i/$j
	                touch $i/$j/1
		done
	done
)


We get the following results (with a variance within a few seconds)

4.9 ext4:
real	2m13.303s
user	0m4.976s
sys	0m20.424s

4.9 xfs:
real	2m7.416s
user	0m5.076s
sys	0m20.960s

4.18 ext4:
real	4m3.276s
user	2m46.401s
sys	1m12.546s

4.18 xfs:
real	3m53.430s
user	2m46.841s
sys	1m12.716s

About a 50% slowdown in time elapsed and quite an increase in user and sys.


To rule out something like spectre/meltdown-mitigations we tried the
oldest kernel package that's a higher version number than in stable we
could find on http://snapshot.debian.org from July 2017.

4.11 ext4:
real	3m28.443s
user	2m29.551s
sys	1m0.924s

4.11 xfs
real	3m32.438s
user	2m31.349s
sys	1m3.333s

It's a little faster than 4.18 but the problem still persists.


The NAS is using a software RAID 6 via MD, and we tested with the same
script on a desktop system to rule out the RAID as a problem source and
see the same thing:

4.9 ext4 desktop:
real	2m22.525s
user	0m6.176s
sys	0m20.872s

4.18 ext4 desktop:
real	4m16.412s
user	3m2.282s
sys	1m19.308s


So to us at looks like something is seriously wrong somewhere but have
no clue where exactly to look for anymore. Is the test flawed, did we
miss something about an expected slowdown in the news, is it really a
bug and if so where can we look to locate it more precisely?

Thanks in advance,
Jens Holzkämper