Web lists-archives.com

Re: [Samba] samba getting stuck, highwatermark replication issue?




On 10/12/2017 3:17 AM, mj wrote:
Hi all, James,

After following James' suggestions fixing the several dbcheck errors, and having observed things for a few days, I'd like to update this issue, and hope for some new input again. :-)

Summary: three DCs, all three running Version 4.5.10-SerNet-Debian-16.wheezy, samba-tool dbcheck --cross-ncs reports no errors, except for two (supposedly innocent) dangling forward links that I'm ignoring for now. Time is synced. Very basic smb.conf, posted earlier, can post again if needed.

samba-tool ldapcmp dcX dcY --filter=whenChanged shows that they are in sync, and also samba-tool drs showrepl shows that replication seems to be stable.

The "getting stuck" from the subject line has not occured for a few days, perhaps the dbcheck fixes have solved that, or perhaps we've just been lucky.

All in all this appears pretty healthy, but there is a remaing problem:

At ANY given time, ONE RANDOM single DC shows high cpu usage on one samba process. And on that DC (can be any of the three DCs) the logs fill up with this:

[2017/10/12 08:38:57.956586,  3] ../source4/smbd/service_stream.c:66(stream_terminate_connection)   Terminating connection - 'ldapsrv_accept_tls_loop: tstream_tls_accept_recv() - 104:Connection reset by peer' [2017/10/12 08:38:57.956638,  3] ../source4/smbd/process_single.c:114(single_terminate)   single_terminate: reason[ldapsrv_accept_tls_loop: tstream_tls_accept_recv() - 104:Connection reset by peer] [2017/10/12 08:38:57.956823,  3] ../source4/smbd/service_stream.c:66(stream_terminate_connection)   Terminating connection - 'ldapsrv_accept_tls_loop: tstream_tls_accept_recv() - 104:Connection reset by peer' [2017/10/12 08:38:57.956869,  3] ../source4/smbd/process_single.c:114(single_terminate)   single_terminate: reason[ldapsrv_accept_tls_loop: tstream_tls_accept_recv() - 104:Connection reset by peer] [2017/10/12 08:38:57.956990,  3] ../source4/auth/ntlm/auth.c:271(auth_check_password_send)   auth_check_password_send: Checking password for unmapped user []\[]@[(null)]
  auth_check_password_send: mapped user is: []\[]@[(null)]
[2017/10/12 08:38:57.958675,  3] ../source4/smbd/service_stream.c:66(stream_terminate_connection)   Terminating connection - 'ldapsrv_call_loop: tstream_read_pdu_blob_recv() - NT_STATUS_CONNECTION_RESET' [2017/10/12 08:38:57.958728,  3] ../source4/smbd/process_single.c:114(single_terminate)   single_terminate: reason[ldapsrv_call_loop: tstream_read_pdu_blob_recv() - NT_STATUS_CONNECTION_RESET] [2017/10/12 08:38:57.958948,  3] ../source4/smbd/service_stream.c:66(stream_terminate_connection)   Terminating connection - 'ldapsrv_call_loop: tstream_read_pdu_blob_recv() - NT_STATUS_CONNECTION_RESET' [2017/10/12 08:38:57.958994,  3] ../source4/smbd/process_single.c:114(single_terminate)   single_terminate: reason[ldapsrv_call_loop: tstream_read_pdu_blob_recv() - NT_STATUS_CONNECTION_RESET] [2017/10/12 08:38:57.969111,  0] ../source4/rpc_server/drsuapi/getncchanges.c:1961(dcesrv_drsuapi_DsGetNCChanges)   ../source4/rpc_server/drsuapi/getncchanges.c:1961: DsGetNCChanges 2nd replication on DN DC=samba,DC=company,DC=com older highwatermark (last_dn CN=Schema Admins,CN=Users,DC=samba,DC=company,DC=com) [2017/10/12 08:38:57.969762,  2] ../source4/rpc_server/drsuapi/getncchanges.c:1483(getncchanges_collect_objects)   ../source4/rpc_server/drsuapi/getncchanges.c:1483: getncchanges on DC=samba,DC=company,DC=com using filter (uSNChanged>=1) [2017/10/12 08:38:58.378265,  0] ../source4/rpc_server/drsuapi/getncchanges.c:1961(dcesrv_drsuapi_DsGetNCChanges)   ../source4/rpc_server/drsuapi/getncchanges.c:1961: DsGetNCChanges 2nd replication on DN DC=samba,DC=company,DC=com older highwatermark (last_dn CN=Schema Admins,CN=Users,DC=samba,DC=company,DC=com) [2017/10/12 08:38:58.379160,  2] ../source4/rpc_server/drsuapi/getncchanges.c:1483(getncchanges_collect_objects)   ../source4/rpc_server/drsuapi/getncchanges.c:1483: getncchanges on DC=samba,DC=company,DC=com using filter (uSNChanged>=1) [2017/10/12 08:38:58.810202,  0] ../source4/rpc_server/drsuapi/getncchanges.c:1961(dcesrv_drsuapi_DsGetNCChanges)   ../source4/rpc_server/drsuapi/getncchanges.c:1961: DsGetNCChanges 2nd replication on DN DC=samba,DC=company,DC=com older highwatermark (last_dn CN=Schema Admins,CN=Users,DC=samba,DC=company,DC=com) [2017/10/12 08:38:58.810868,  2] ../source4/rpc_server/drsuapi/getncchanges.c:1483(getncchanges_collect_objects)   ../source4/rpc_server/drsuapi/getncchanges.c:1483: getncchanges on DC=samba,DC=company,DC=com using filter (uSNChanged>=1) [2017/10/12 08:38:59.251863,  0] ../source4/rpc_server/drsuapi/getncchanges.c:1961(dcesrv_drsuapi_DsGetNCChanges)   ../source4/rpc_server/drsuapi/getncchanges.c:1961: DsGetNCChanges 2nd replication on DN DC=samba,DC=company,DC=com older highwatermark (last_dn CN=Schema Admins,CN=Users,DC=samba,DC=company,DC=com) [2017/10/12 08:38:59.252418,  2] ../source4/rpc_server/drsuapi/getncchanges.c:1483(getncchanges_collect_objects)   ../source4/rpc_server/drsuapi/getncchanges.c:1483: getncchanges on DC=samba,DC=company,DC=com using filter (uSNChanged>=1) [2017/10/12 08:38:59.692247,  0] ../source4/rpc_server/drsuapi/getncchanges.c:1961(dcesrv_drsuapi_DsGetNCChanges)   ../source4/rpc_server/drsuapi/getncchanges.c:1961: DsGetNCChanges 2nd replication on DN DC=samba,DC=company,DC=com older highwatermark (last_dn CN=Schema Admins,CN=Users,DC=samba,DC=company,DC=com)

I've seen "last_dn" be various things, system groups like above, but also regular users, computers, and groups that we created. We have even had (very few) cases were it was:

./log.samba.3.gz: ../source4/rpc_server/drsuapi/getncchanges.c:1961: DsGetNCChanges 2nd replication on DN DC=samba,DC=company,DC=com older highwatermark (last_dn DC=samba,DC=company,DC=com)

Can anyone explain what is happening here, or help me understand this?

I have read that highwatermark errors are not neccesarily bad, but the fact that they cause continuous high cpu usage on a DC (80, 90%), until the point where this behaviour "transfers" to a next DC makes me feel that in this case, this is not normal, and indicates some kind of problem.

Thanks for input!

MJ

MJ

MJ,

    A dev or someone else may to assist but your replication isn't syncing correctly among each other.  Those dangling links should have purged by now if it's in reference to a DC removed several years ago.

Did you do a full replication from a known good DC to the other two? This doesn't always fix the issue but is a good start. You didn't by chance restore a DC recently from backup or had one offline and recently powered on?

The highwatermark value tells the source DC what objects the destination DC is requesting to update. The high CPU usage seems due to the DC doing a full partition replication. The fact you stated this issue can happen on all 3 makes it ever tougher to help. I would normally advise to just demote the affected DC and join again.


--
--
James


--
To unsubscribe from this list go to the following URL and read the
instructions:  https://lists.samba.org/mailman/options/samba