Web lists-archives.com

Need help analyzing (kernel?) memory usage and reclaiming RAM (Debian Stretch)




Hello,

(please let me know if this is more appropriate somewhere else, e.g. on
ebian-kernel)

I need help debugging/solving a weird memory problem. The symptoms are
the usual ones for high memory usage: free/available memory is getting
low, systems start swapping, disk I/O increases, performance drops.

However, from what I can see, the memory is not used up by user space
processes but from the Kernel (NOT caches/buffers), see commands output
at the end.

I'm still puzzled about what exactly eats all the RAM and how to reclaim
it (without rebooting the machine, of course!). Any help would be highly
appreciated!

Some findings so far:

- same problem on many systems, all Debian 9 Stretch, all running stock
  4.9 kernel from the official package, all amd64 virtual machines on
  several (different) VMware ESXi hosts.
- not all Stretch systems seem to be affected, but we haven't yet found
  the common ground.
- problem can occur after some days or some weeks, not at the same time
  on all affected machines. And not at the same time for all VMs on the
  same host
- problem only occurs on Stretch systems, not Jessie, even running on
  the same host.
- we haven't yet seen the problem on real hardware machines, only VMs
  (but since the vast majority of our systems are VMs, this may not be
  relevant)
- problem seems not directly related to the machine's load. it occurs on
  machines that are mostly idle as well as on more heavily-loaded
  systems
- problem occurs the same on single-core VMs as well as on multi-core
  VMs
- problem occurs the same on VMs running on single-socket hosts as well
  as on multi-socket hosts
- problem occurs the same on VMs running on hosts with different
  hypervisor releases, both VMware ESXi 5.5 and 6.5, both standalone and
  in a vSphere cluster.

Here's the output from some commands I hope to be helpful:

The machine in this example is a RADIUS server but has not even gone
productive ... no incoming client requests yet.  (But the problem is not
related to the RADIUS server software - OSC Radiator - since the same
symptoms show on different machines: not only RADIUS servers but also
nameservers, shell servers or jumphosts, etc.)

[values while the problem persists:]
------------------------------------------------------------------------
root@rad-m2m-srv02:~# free -thwl
              total        used        free      shared     buffers       cache   available
Mem:           987M        910M         59M          0B        704K         16M         13M
Low:           987M        927M         59M
High:            0B          0B          0B
Swap:          2,0G        345M        1,7G
Total:         3,0G        1,2G        1,7G
root@rad-m2m-srv02:~# smem -twk
Area                           Used      Cache   Noncache 
firmware/hardware                 0          0          0 
kernel image                      0          0          0 
kernel dynamic memory        914.9M      11.1M     903.8M 
userspace memory              13.0M       5.5M       7.4M 
free memory                   59.4M      59.4M          0 
----------------------------------------------------------
                             987.3M      76.1M     911.2M 
root@rad-m2m-srv02:~# smem -uktr
User     Count     Swap      USS      PSS      RSS 
root        39   332.8M    10.4M    12.4M    44.7M 
msch         6     7.0M        0   607.0K     8.3M 
_chrony      1   360.0K     4.0K    20.0K   572.0K 
messagebus     1   580.0K     4.0K    17.0K   480.0K 
postfix      2     1.6M        0    13.0K   568.0K 
daemon       1   208.0K     4.0K     6.0K    72.0K 
---------------------------------------------------
            50   342.5M    10.4M    13.0M    54.7M 
root@rad-m2m-srv02:~# sort -k2,2nr /proc/meminfo
VmallocTotal:   34359738367 kB
CommitLimit:     2602636 kB
SwapTotal:       2097148 kB
SwapFree:        1741028 kB
MemTotal:        1010976 kB
DirectMap4k:     1007488 kB
Committed_AS:     465128 kB
Slab:              79680 kB
SUnreclaim:        69268 kB
MemFree:           61068 kB
DirectMap2M:       40960 kB
SReclaimable:      10412 kB
Active:             6944 kB
Inactive:           6660 kB
AnonPages:          6608 kB
PageTables:         5804 kB
Cached:             5748 kB
Mapped:             4660 kB
SwapCached:         3988 kB
Active(file):       3920 kB
Inactive(anon):     3828 kB
Active(anon):       3024 kB
KernelStack:        2992 kB
Inactive(file):     2832 kB
Hugepagesize:       2048 kB
Buffers:            1020 kB
Dirty:                 8 kB
AnonHugePages:         0 kB
Bounce:                0 kB
HardwareCorrupted:     0 kB
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
HugePages_Total:       0
MemAvailable:          0 kB
Mlocked:               0 kB
NFS_Unstable:          0 kB
Shmem:                 0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
Unevictable:           0 kB
VmallocChunk:          0 kB
VmallocUsed:           0 kB
Writeback:             0 kB
WritebackTmp:          0 kB
root@rad-m2m-srv02:~# ps aux --sort=-rss | head -15
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     34718 12.0  0.5  29596  5672 ?        D    09:01   0:00 /usr/bin/python3 -Es /usr/bin/lsb_release --short --description
root     26491  3.1  0.2  79328  2860 ?        D    08:04   1:50 apt-get update -qq
root     32551  6.8  0.2 119036  2800 ?        D    08:51   0:43 /usr/bin/python3 /usr/bin/unattended-upgrade
root     34719  0.0  0.2  41164  2232 pts/1    R+   09:02   0:00 ps aux --sort=-rss
msch     33960  0.1  0.1  23720  1844 pts/0    Ss   08:58   0:00 -bash
root     34492  0.2  0.1  23816  1812 pts/1    S    09:00   0:00 -bash
msch     33996  0.0  0.1  23576  1768 pts/1    Ss   08:58   0:00 bash -i
root     12792  2.2  0.1 159720  1748 ?        D    06:06   3:54 /usr/bin/perl -w /usr/bin/apt-show-versions -i
root     34521  0.7  0.1  95180  1712 ?        Ss   09:01   0:00 sshd: root@notty
root     15502  2.4  0.1 167660  1608 ?        D    06:25   3:51 /usr/bin/perl -w /usr/bin/apt-show-versions -i
root     34527  1.7  0.1  14096  1596 ?        Ss   09:01   0:00 /bin/bash /usr/bin/check_mk_agent
root     33947  0.0  0.1  95180  1564 ?        Ss   08:58   0:00 sshd: msch [priv]
root     26486  0.0  0.1   9600  1436 ?        S    08:04   0:00 /bin/bash 3600/mk_apt
root     26483  0.0  0.1   9588  1424 ?        S    08:04   0:00 /bin/bash
root@rad-m2m-srv02:~# lsof | wc -l
1943
root@rad-m2m-srv02:~# df -Th -t tmpfs
Filesystem     Type   Size  Used Avail Use% Mounted on
tmpfs          tmpfs   99M   12M   87M  12% /run
tmpfs          tmpfs  494M     0  494M   0% /dev/shm
tmpfs          tmpfs  5,0M     0  5,0M   0% /run/lock
tmpfs          tmpfs  494M     0  494M   0% /sys/fs/cgroup
tmpfs          tmpfs  1,0G     0  1,0G   0% /tmp
tmpfs          tmpfs   99M     0   99M   0% /run/user/0
tmpfs          tmpfs   99M     0   99M   0% /run/user/2029
root@rad-m2m-srv02:~# vmware-toolbox-cmd stat balloon
0 MB
root@rad-m2m-srv02:~# cat /sys/kernel/debug/vmmemctl
balloon capabilities:   0x1e
used capabilities:      0x1e
is resetting:           n
target:                    0 pages
current:                   0 pages
rateSleepAlloc:         2048 pages/sec

timer:               3968363
doorbell:                  0
start:                     7 (   0 failed)
guestType:                 7 (   0 failed)
2m-lock:                   0 (   0 failed)
lock:                      0 (   0 failed)
2m-unlock:                 0 (   0 failed)
unlock:                    0 (   0 failed)
target:              3968363 (   6 failed)
prim2mAlloc:               0 (   0 failed)
primNoSleepAlloc:          0 (   0 failed)
primCanSleepAlloc:         0 (   0 failed)
prim2mFree:                0
primFree:                  0
err2mAlloc:                0
errAlloc:                  0
err2mFree:                 0
errFree:                   0
doorbellSet:               6
doorbellUnset:             7
root@rad-m2m-srv02:~# nice vmstat -w 1 10
procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu--------
 r  b         swpd         free         buff        cache   si   so    bi    bo   in   cs  us  sy  id  wa  st
 0  5       356620        60868         1140        16280   37   19   704    31    3    2   1   2  97   1   0
 1  4       356180        60372          320        16224 3008  624  6180  1236 1109 1915   2  18   0  80   0
 2  5       356632        61476          320        15568 2776 1452  3128  2012 1146 1802   1  14   0  85   0
 1  3       356592        62228          324        15244 2848  952  3784  1564 1029 1780   0  11   0  89   0
 2  4       356732        61492          612        15544 2864 1144  3932  1720 1164 1839   2   9   0  89   0
 1  4       357252        62836          556        15248 4000 1800  4432  3048 1398 2359   1  15   0  84   0
 0  4       356700        61744          448        15248 3368  668  3368  1276 1093 2039   0   9   0  91   0
 2  4       356708        61372          456        16272 1940  868  4744   888  876 1377   0  12   0  88   0
 0  4       356704        61744         1156        14700 2740  660  4828  1940 1123 1768   0  14   0  86   0
 0  4       357556        62240          680        15568 2908 1476  5436  2064 1062 1804   1  15   0  84   0
root@rad-m2m-srv02:~# lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 9.8 (stretch)
Release:	9.8
Codename:	stretch
root@rad-m2m-srv02:~# uname -a
Linux rad-m2m-srv02 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3 (2019-02-02) x86_64 GNU/Linux
root@rad-m2m-srv02:~# w
 09:02:30 up 45 days, 22:20,  1 user,  load average: 5,13, 5,03, 6,58
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
msch     pts/0    10.208.105.87    08:58    4.00s  0.26s  0.03s script memdebug
root@rad-m2m-srv02:~# 

[values directly after rebooting:]
------------------------------------------------------------------------
root@rad-m2m-srv02:~# w
 09:23:02 up 4 min,  1 user,  load average: 0,01, 0,08, 0,04
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
msch     pts/0    10.208.105.87    09:21    6.00s  0.26s  0.02s sshd: msch [priv]   
root@rad-m2m-srv02:~# free -thwl
              total        used        free      shared     buffers       cache   available
Mem:           987M        112M        610M        4,3M         16M        247M        735M
Low:           987M        377M        610M
High:            0B          0B          0B
Swap:          2,0G          0B        2,0G
Total:         3,0G        112M        2,6G
root@rad-m2m-srv02:~# smem -twk
Area                           Used      Cache   Noncache 
firmware/hardware                 0          0          0 
kernel image                      0          0          0 
kernel dynamic memory        287.1M     226.6M      60.5M 
userspace memory              93.8M      37.8M      56.0M 
free memory                  606.4M     606.4M          0 
----------------------------------------------------------
                             987.3M     870.8M     116.5M 
root@rad-m2m-srv02:~# smem -uktr
User     Count     Swap      USS      PSS      RSS 
root        19        0    62.9M    72.8M   128.7M 
postfix      6        0     7.9M    12.1M    42.9M 
msch         4        0     3.7M     7.3M    19.4M 
messagebus     1        0     1.2M     1.5M     3.8M 
_chrony      1        0   896.0K  1020.0K     2.8M 
daemon       1        0   228.0K   309.0K     2.1M 
---------------------------------------------------
            32        0    76.9M    95.0M   199.7M 
root@rad-m2m-srv02:~# sort -k2,2nr /proc/meminfo
VmallocTotal:   34359738367 kB
CommitLimit:     2602636 kB
SwapFree:        2097148 kB
SwapTotal:       2097148 kB
MemTotal:        1010976 kB
DirectMap2M:      983040 kB
MemAvailable:     753520 kB
MemFree:          624520 kB
Cached:           234508 kB
Active:           161672 kB
Inactive:         142964 kB
Inactive(file):   138936 kB
Committed_AS:     124808 kB
Active(file):     108028 kB
DirectMap4k:       65408 kB
Active(anon):      53644 kB
AnonPages:         53300 kB
Slab:              36968 kB
Mapped:            36760 kB
SReclaimable:      19424 kB
SUnreclaim:        17544 kB
Buffers:           16836 kB
Shmem:              4392 kB
Inactive(anon):     4028 kB
PageTables:         3836 kB
KernelStack:        2748 kB
Hugepagesize:       2048 kB
Dirty:                60 kB
AnonHugePages:         0 kB
Bounce:                0 kB
HardwareCorrupted:     0 kB
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
HugePages_Total:       0
Mlocked:               0 kB
NFS_Unstable:          0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
SwapCached:            0 kB
Unevictable:           0 kB
VmallocChunk:          0 kB
VmallocUsed:           0 kB
Writeback:             0 kB
WritebackTmp:          0 kB
root@rad-m2m-srv02:~# ps aux --sort=-rss | head -15
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       651  0.1  2.6  78748 26992 ?        S    09:18   0:00 /usr/bin/perl /opt/radiator/bin/radiusd -daemon -pid_file /var/run/radiator.pid -config_file /opt/radiator/etc/radiator.cfg -I /opt/radiator/share/perl/5.24.1/
root       411  0.0  1.7 153488 18144 ?        Ss   09:18   0:00 /usr/bin/VGAuthService
root       221  0.1  1.0 136488 10464 ?        Ss   09:18   0:00 /usr/bin/vmtoolsd
postfix   2033  0.1  0.8  88652  8968 ?        S    09:22   0:00 smtp -t unix -u
postfix   2034  0.0  0.8  87480  8132 ?        S    09:22   0:00 tlsmgr -l -t unix -u
root         1  0.3  0.6  57052  6736 ?        Ss   09:18   0:00 /sbin/init
root      1462  0.0  0.6  95180  6736 ?        Ss   09:21   0:00 sshd: msch [priv]
postfix   2031  0.0  0.6  83352  6700 ?        S    09:22   0:00 cleanup -z -t unix -u
postfix    649  0.0  0.6  83296  6600 ?        S    09:18   0:00 qmgr -l -t unix -u
postfix   2032  0.0  0.6  83260  6600 ?        S    09:22   0:00 trivial-rewrite -n rewrite -t unix -u
postfix    648  0.0  0.6  83248  6284 ?        S    09:18   0:00 pickup -l -t unix -u
root       527  0.0  0.6  69952  6168 ?        Ss   09:18   0:00 /usr/sbin/sshd -D
msch      1464  0.0  0.6  64832  6144 ?        Ss   09:21   0:00 /lib/systemd/systemd --user
root       251  0.0  0.5  47844  5872 ?        Ss   09:18   0:00 /lib/systemd/systemd-udevd
root@rad-m2m-srv02:~# lsof | wc -l
1605
root@rad-m2m-srv02:~# df -Th -t tmpfs
Filesystem     Type   Size  Used Avail Use% Mounted on
tmpfs          tmpfs   99M  4,3M   95M   5% /run
tmpfs          tmpfs  494M     0  494M   0% /dev/shm
tmpfs          tmpfs  5,0M     0  5,0M   0% /run/lock
tmpfs          tmpfs  494M     0  494M   0% /sys/fs/cgroup
tmpfs          tmpfs  1,0G     0  1,0G   0% /tmp
tmpfs          tmpfs   99M     0   99M   0% /run/user/2029
root@rad-m2m-srv02:~# vmware-toolbox-cmd stat balloon
0 MB
root@rad-m2m-srv02:~# cat /sys/kernel/debug/vmmemctl
balloon capabilities:   0x1e
used capabilities:      0x1e
is resetting:           n
target:                    0 pages
current:                   0 pages
rateSleepAlloc:         2048 pages/sec

timer:                   292
doorbell:                  0
start:                     1 (   0 failed)
guestType:                 1 (   0 failed)
2m-lock:                   0 (   0 failed)
lock:                      0 (   0 failed)
2m-unlock:                 0 (   0 failed)
unlock:                    0 (   0 failed)
target:                  292 (   0 failed)
prim2mAlloc:               0 (   0 failed)
primNoSleepAlloc:          0 (   0 failed)
primCanSleepAlloc:         0 (   0 failed)
prim2mFree:                0
primFree:                  0
err2mAlloc:                0
errAlloc:                  0
err2mFree:                 0
errFree:                   0
doorbellSet:               1
doorbellUnset:             1
root@rad-m2m-srv02:~# nice vmstat -w 1 10
procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu--------
 r  b         swpd         free         buff        cache   si   so    bi    bo   in   cs  us  sy  id  wa  st
 0  0            0       622948        16868       254624    0    0   728   254  104  231   4   2  88   5   0
 0  0            0       622948        16868       254624    0    0     0     0   53   98   0   0 100   0   0
 0  0            0       622948        16876       254600    0    0     0    20   50   96   0   0 100   0   0
 0  0            0       622948        16876       254600    0    0     0     0   50   91   0   0 100   0   0
 0  0            0       622948        16876       254600    0    0     0     0   43   84   0   0 100   0   0
 0  0            0       622948        16876       254604    0    0     0     0   57  105   1   0  99   0   0
 0  0            0       622948        16876       254600    0    0     0     0   53  106   0   1  99   0   0
 0  0            0       622948        16876       254600    0    0     0     0   50   91   1   0  99   0   0
 1  0            0       622948        16876       254600    0    0     0     0   49   96   0   0 100   0   0
 0  0            0       622948        16876       254600    0    0     0    12   50   94   0   1  99   0   0
root@rad-m2m-srv02:~# 

------------------------------------------------------------------------

Anything else I could check to help pinpoint the memory hog?


Thanks in advance!
Martin

-- 
Martin Schwarz * Karlsruhe, Germany * http://kuroi.de/