Web lists-archives.com

[Samba] Samba 4.9.3 and the "10 hour problem"




We have a setup with a bunch of Dell servers acting as fileserver for the university here, running FreeBSD with Samba (and NFS & SFTP) to provide service for staff & students. 

This works in general really (very) well. We have around 4000-5000 SMB connections spread over 6 servers during “rush hours” (8am-5pm). 

However, due to some strange (I bet it’s due to the service ticket timeout) reason exactly 10 hours after we restart the Samba daemons (and every 10:th hour after that) we are seeing problems with clients connecting to the Samba servers (connections with valid passwords and/or valid Kerberos ticket are getting a “timeout” - users with invalid passwords/invalid Kerberos ticket gets (correctly) denied service though!

At Samba 4.7.6 we just see a short “hickup” at the 10 hour mark but with all the later versions service grinds to a halt. I’ve tried to read the diffs to see if I can find the change that causes this but so far no luck.

So we decided to try Samba 4.9.3 a little bit more and it has been running well on our test systems (with no or very few concurrent users). Today we tried to use it one one of the production servers and … it failed at 17:00 again (we restart the Samba Damons at 07:00 every day in order to make sure the problem doesn’t occur during prime time (we originally restarted them at 04:00 but that caused the trouble to occur at 2pm - not good…)

This production server had around 400 concurrent users when at the time it failed this time.

In order to get it to start servicing users again I tried to kill the “Winbindd” daemons using “pkill winbindd” and all but one of the daemons died:
 
> root@filur05# ps auxww | egrep winbind
> root      1550    0.0  0.0  271956  127236  -  S    07:12       0:23.66 winbindd: domain child [AD] (winbindd)
> root     29659    0.0  0.0  237240  122108  -  I    10:04       0:00.02 winbindd: domain child [FILUR05] (winbindd)
> root     99774    0.0  0.1  267384  143028  -  Ss   07:00       5:23.92 /liu/sbin/winbindd --daemon --configfile=/liu/etc/samba/smb.conf
> root     99775    0.0  1.8 4913320 4748952  -  S    07:00      68:08.91 winbindd: domain child [AD] (winbindd)
> root     99778    0.0  0.0  197228   89496  -  S    07:00       0:01.07 winbindd: domain child [FILUR05] (winbindd)
> root     99779    0.0  0.2  679988  533092  -  S    07:00      14:16.53 winbindd: domain child [AD] (winbindd)
> root     99780    0.0  0.0   43336   30064  -  I    07:00       0:04.61 winbindd: idmap child (winbindd)
> root     99781    0.0  0.0  155968   36980  -  I    07:00       0:00.03 winbindd: domain child [BUILTIN] (winbindd)
> root       678    0.0  0.0    6740    2716  1  S+   17:01       0:00.00 egrep winbind
> root@filur05# pkill winbindd
> root@filur05# ps auxww | egrep winbind
> root     99780    0.0  0.0  43336  30064  -  S    07:00       0:04.61 winbindd: idmap child (winbindd)
> root       730    0.0  0.0   6740   2716  1  S+   17:02       0:00.00 egrep winbind
> ...
> root@filur05# !ps
> ps auxww | egrep winbind
> root     99780    0.0  0.0  43336  30064  -  S    07:00       0:04.61 winbindd: idmap child (winbindd)
> root       738    0.0  0.0   6740   2716  1  S+   17:02       0:00.00 egrep winbind
> root@filur05# kill -9 99780
> root@filur05# /liu/sbin/winbindd --daemon
> root@filur05# ps auxww | egrep winbind
> root       745    1.9  0.0 220752 113184  -  Ss   17:02       0:05.14 /liu/sbin/winbindd --daemon
> root       746    0.0  0.1 255160 175056  -  S    17:02       0:02.41 winbindd: domain child [AD] (winbindd)
> root       748    0.0  0.0 161024  61276  -  S    17:02       0:00.17 winbindd: idmap child (winbindd)
> root       750    0.0  0.0 198032 101088  -  S    17:02       0:00.01 winbindd: domain child [FILUR05] (winbindd)
> root       813    0.0  0.0 200912  75004  -  S    17:03       0:00.18 winbindd: domain child [AD] (winbindd)
> root      1530    0.0  0.0   6740   2716  1  S+   17:07       0:00.00 egrep winbind


And then it started servicing users again.

On one of our test servers I tried (at the same time) to run 55 concurrent smbclient connections to see if that would cause the same failure, but it worked just fine.
A “pkill winbindd” on that server killed all the Winbindd processes - no dangling “idmap winbindd"

So right now my main suspect is that something goes wrong in the “winbindd: idmap child (winbindd)” subprocess… Unfortunately I didn’t get a detailed stack trace of what it was doing at the time but I’ll try to get one tomorrow.

Any ideas?


Files open at the time of non-responsiveness:

> 17:01/fstat.log:root     winbindd   99775 text /liu      38725 -rwxr-xr-x  4711000  r
> 17:01/fstat.log:root     winbindd   99775   wd /             4 drwxr-xr-x      29  r
> 17:01/fstat.log:root     winbindd   99775 root /             4 drwxr-xr-x      29  r
> 17:01/fstat.log:root     winbindd   99775    0 /dev         26 crw-rw-rw-    null rw
> 17:01/fstat.log:root     winbindd   99775    1 /dev         26 crw-rw-rw-    null rw
> 17:01/fstat.log:root     winbindd   99775    2 /liu/var  31071 -rw-r--r--  3583449  w
> 17:01/fstat.log:root     winbindd   99775    3 /dev          8 crw-r--r--  random  r
> 17:01/fstat.log:root     winbindd   99775    4 /liu/var   7929 -rw-r--r--      21 rw
> 17:01/fstat.log:root     winbindd   99775    5* pipe fffff813a2cf4be0 <-> fffff813a2cf4d48      0 rw
> 17:01/fstat.log:root     winbindd   99775    6 /dev          8 crw-r--r--  random  r
> 17:01/fstat.log:root     winbindd   99775    7* local dgram fffff801e9e884b0
> 17:01/fstat.log:root     winbindd   99775    8 /liu/var     18 -rw-r-----   12288 rw
> 17:01/fstat.log:root     winbindd   99775    9 /liu       1117 -rw-------  561152 rw
> 17:01/fstat.log:root     winbindd   99775   10 /liu       1122 -rw-------   28672 rw
> 17:01/fstat.log:root     winbindd   99775   11 /dev         26 crw-rw-rw-    null rw
> 17:01/fstat.log:root     winbindd   99775   12 /dev         26 crw-rw-rw-    null rw
> 17:01/fstat.log:root     winbindd   99775   13 /liu/var   7924 -rw-r--r--       6  w
> 17:01/fstat.log:root     winbindd   99775   14* pipe fffff801cc3d6000 <-> fffff801cc3d6168      0 rw
> 17:01/fstat.log:root     winbindd   99775   15 /liu/var  31071 -rw-r--r--  3583449  w
> 17:01/fstat.log:root     winbindd   99775   16* pipe fffff813a2cf4d48 <-> fffff813a2cf4be0      0 rw
> 17:01/fstat.log:root     winbindd   99775   17 /liu/var   7930 -rw-r--r--  101470208 rw
> 17:01/fstat.log:root     winbindd   99775   18 /liu/var     39 -rw-------  4227072 rw
> 17:01/fstat.log:root     winbindd   99775   19* local dgram fffff801e9e5da50 <-> fffff801e9e65b40
> 17:01/fstat.log:root     winbindd   99775   20 /liu/var   7926 -rw-------  69033984 rw
> 17:01/fstat.log:root     winbindd   99775   21* local stream fffff801e9e4a0f0
> 17:01/fstat.log:root     winbindd   99775   22* local stream fffff8020f01fb40
> 17:01/fstat.log:root     winbindd   99775   23* local stream fffff811b72d5870 <-> fffff801e9e53870
> 17:01/fstat.log:root     winbindd   99775   24 /liu/var     20 -rw-r--r--  16228352 rw
> 17:01/fstat.log:root     winbindd   99775   25* internet stream tcp fffff812c8690820
> 17:01/fstat.log:root     winbindd   99775   26* internet stream tcp fffff8020f525410
> 17:01/fstat.log:root     winbindd   99775   27 /liu/var     34 -rw-------   20480 rw
> 17:01/fstat.log:root     winbindd   99775   28* internet stream tcp fffff81250627410
> 17:01/fstat.log:root     winbindd   99775   29* internet stream tcp fffff814f4815000

Network sockets:
> 17:01/netstat.log:fffff801e9e884b0 dgram       0      0 fffff812fd6fe3b0                0                0                0 /liu/etc/samba/private/msg.sock/99775

Process kernel stacks for the process:
> 17:01/procstat-kk-a.log:99775 102090 winbindd            -                   pmap_remove_ptes+0x101 pmap_remove+0x35b vm_map_delete+0x14e kern_munmap+0x8d amd64_syscall+0xa38 fast_syscall_common+0x101 

Ps auxwww output:
> 17:01/ps-auxwww.log:root     99775  100.0  1.8 4888744 4702144  -  R    07:00      67:53.99 winbindd: domain child [AD] (winbindd)


(At 17:02 I “kill -9” killed it)


Setup:
FreeBSD 11.2-RELEASE-p5
Samba 4.9.3 / 4.7.6 (with a local patch to allow larger socket listening queues)
(ZFS storage)

AD-joined with Microsoft Windows AD servers. Some 100k users and groups...

smb.conf:

> [global]
> private directory = /liu/etc/samba/private
> lock directory    = /liu/var/samba/locks
> cache directory   = /liu/var/samba/cache
> state directory   = /liu/var/samba/state
> ncalrpc dir       = /liu/var/samba/ncalrpc
> 
> ;; Network interfaces
> bind interfaces only = true
> interfaces = lagg0
> 
> ;; Server names
> server string = Filur05 File Server
> netbios name = FILUR05
> 
> ;; Security type
> security = ADS
> realm = AD.LIU.SE
> workgroup = AD
> 
> ;; ID Mappings
> idmap config * : backend = tdb
> idmap config * : range = 2000000001-2100000000
> idmap config AD : backend = ad
> idmap config AD : range = 1-2000000000
> idmap config AD : schema_mode = rfc2307
> idmap config AD : unix_primary_group = yes
> 
> 
> ;; Enable shared robust mutexes
> dbwrap_tdb_mutexes:* = yes
> 
> ;; VFS objects to enable
> vfs objects = shadow_copy2 zfsacl full_audit
> full_audit:prefix = %u|%I
> full_audit:success = none
> full_audit:failure = none
> full_audit:facility = LOCAL7
> full_audit:priority = notice
> 
> ;; Snapshots/Previous Versions
> shadow:snapdir = .zfs/snapshot
> shadow:format = auto-%Y-%m-%d.%H:%M:%S
> shadow:sort = desc
> shadow:localtime = yes
> shadow:snapdirseverywhere = yes
> 
> ;; Handle ZFS ACLs
> zfsacl:acesort = dontcare
> 
> nfs4:mode = special
> nfs4:acedup = merge
> 
> kerberos method = system keytab
> kernel change notify = false
> socket options = TCP_NODELAY SO_RCVBUF=65536 SO_SNDBUF=65536
> client ldap sasl wrapping = seal
> socket listen backlog = 1024
> invalid users = root
> min protocol = SMB2
> smb encrypt = auto
> time server = true
> hostname lookups = false
> follow symlinks = true
> wide links = false
> unix extensions = false
> winbind nested groups = false
> winbind enum users = false
> winbind enum groups = false
> winbind use default domain = yes
> winbind normalize names = yes
> winbind max clients = 1000
> winbind max domain connections = 10
> winbind nss info = template
> template homedir = /home/%U
> template shell = /bin/sh
> nt acl support = yes
> inherit acls = yes
> inherit permissions = yes
> store dos attributes = yes
> ea support = false
> map hidden = false
> map system = false
> map archive = false
> map read only = permissions
> case sensitive = auto
> logging = syslog:1 file:2
> log level = 1 auth:2 winbind:2 acls:0
> load printers = false
> printing = bsd
> printcap name = /dev/null
> disable spoolss = true
> veto files = /.zfs/
> hide files = /~$*/
> delete veto files = yes

> [homes]
> browseable = false
> printable = false
> public = false
> writeable = true

> [students]
> copy = homes
> comment = Student Home Directories
> path = /export/students
> inherit owner = yes
> 
> [staff]
> copy = homes
> comment = Staff Home Directories
> path = /export/staff
> inherit owner = yes


- Peter



-- 
To unsubscribe from this list go to the following URL and read the
instructions:  https://lists.samba.org/mailman/options/samba