Web lists-archives.com

Re: The Internet locks up Buster


On Wed, Jun 06, 2018 at 05:51:34PM -0400, Borden Rhodes wrote:
> On 6 June 2018 at 15:26,  <debian-user-digest-request@xxxxxxxxxxxxxxxx> wrote:
> >         Hi.
> >
> > On Tue, Jun 05, 2018 at 09:25:25PM -0400, Borden Rhodes wrote:
> >> During the freeze, hard drive activity goes through the roof. Yes, my
> >> hard drive is fine and works normally when doing anything else. I also
> >> don't get these lockups when browsing in Windows.
> >
> > That looks like an OOM-killer that's desperately trying to free some RAM.
> > Or good old 12309 bug.
> >
> > Please share the contents of /var/log/kern.log that correspond to the
> > time of the problem. Any single one will do.
> >
> > Reco
> Thank you very much for following up, Reco. I had no difficulty
> triggering a hard drive thrash (thank you, Linotype!), and the problem
> is definitely memory-related. I managed to stabilise my system by
> increasing the swap partition. Unfortunately, the hard drive didn't
> stop thrashing and after an hour I gave up and hard-powered off. I
> did, however, save this kernel log from 20 minutes into the OOM
> https://paste.debian.net/1028268/ (link's good until 9 June)

Thanks. But it's not OOM-killer. It's this kernel thread that's blocking
your userspace:

Jun  6 16:47:06 Debian kernel: [15588.272422] INFO: task btrfs-transacti:1301 blocked for more than 120 seconds.
Jun  6 16:47:06 Debian kernel: [15588.272430]       Not tainted 4.16.0-2-amd64 #1 Debian 4.16.12-1
Jun  6 16:47:06 Debian kernel: [15588.272433] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  6 16:47:06 Debian kernel: [15588.272437] btrfs-transacti D    0 1301      2 0x80000000

I.e. 12309 bug is back. It's obscure and presumably fixed (at least four
times fixed) bug that happens with relatively slow filesystem (be it
SSD/HDD/NFS or whatever) and a large amount of free RAM. I first
encountered the thing back in 2.6.18 days, where it was presumably
implemented (as in - nobody complained before ;).

The idea behind that bug is simple - first, the kernel accumulates a
certain amount of 'dirty' (i.e. changed) filesystem blocks. Since the
amount of free RAM is large, the amount of such blocks is huge too.
Next, kernel realizes that it's time for a 'barrier write' - everything
that was happening before the barrier must be written onto persistent
storage. And since it's 'barrier write time', everyone at userspace are
blocked from making new changes for the existing filesystems, i.e.
everyone are blocked on I/O.
Since the amount of dirty blocks is huge, and the filesystem is slow -
the kernel takes its time and writes dirty blocks. But - it writes them
slowly, and new I/O requests are accumulating faster than it's possible
for the kernel to write them. Hence the lookup.

> So, as I think you suggested, it seems that OOM-killer isn't getting
> in quickly enough to kill a program and/or not working correctly.
> Suggestions?

Limit the size of dirty blocks cache. Kernel defaults are insanely large.
What I'm using here is:

$ cat /etc/sysctl.d/12309.conf

You may need to set lower, as these kernel tunables are percentage of
available RAM.
If you're using a laptop, you may also need to watch for userspace that
tries to modify those tunables (laptop-mode-tools come to mind).

Setting these tunable *will* lower your I/O throughput, but I'll take
slowness over the hangs every day.