Web lists-archives.com

Re: [Samba] gencache.tdb size and cache flush




Hi!

Technical description below, but the exec summary is: Yes, we have a
performance problem with gencache.

On Wed, Aug 29, 2018 at 10:28:05AM +0200, Francesco Malvezzi via samba wrote:
> Hi all,
> 
> I have a midsize AD domain with some 50k users but only 100 workstations
> joined.
> 
> Sometimes I find server CPU throttling at 100%. In order to let it drop

Can you find out where *exactly* that 100% is spent? gstack on the
spinning process with debug symbols would be very helpful here.

> and have smooth performance I delete cache:
> 
> systemctl stop samba
> net cache flush
> systemctl start samba
> 
> First of all, is it needed a samba stop to flush the cache?

No.

> Even if cache flush does the job to restore performance, I am clueless
> about the root cause of the problem. Before flushing cache the
> gencache.tdb had 15k entries. Is it large? Do you think is it worth time
> to investigate why it grows so much or is it just normal?

15k entries is not really silly large. I've seen much larger ones.
What kind of OS do you have? The question is -- does it have the
ability to use robust mutexes? (FreeBSD 11 and recent Linux).

The other thing is -- we don't have code to do cache pruning at this
point. gencache used to be simple for just a few types of entries.
It is a very important performance improvement for many workloads, but
as git grew more and more types of entries (which IMHO is a good
thing), we need to trim the expired entries.

However, traversing the whole gencache periodically is expensive too.
We don't have a good, low-cost and background style traversal
routines, which is needed for tdb files.

This digresses into a technical discussion: I believe we need to
expose something like a "quickly traverse all records for one hash
chain", holding the hash chain lock just over that full traverse. This
would allow the cheap, background style gencache pruning.

Also, gencache has another problem: It's the stabilize calls that
happen frequently. gencache holds a few entries that *need* to survive
a reboot of a box. Mainly the saf_join cache entries come to mind. We
need to separate those out into a persistent gencache.

For the rest -- it's not vital for the system to keep them around,
however performance would significantly suffer if we for example lost
all idmap cache entries. This means we need a middle ground between
CLEAR_IF_FIRST and persistent tdb files. What we can not do is a full
tdb_check upon every daemon startup. We had that, and this sent
winbind into minutes of just reading a corrupt 4GB tdb file. We need
to make live tdb 100% robust against accidential (or even malicious)
corruption.

3 aspects here:

* The recent hardening patches in tdb make me confident we improved here
  a lot.

* We need bullet-proof circular chain detection. We need to put the same
  logic that tdb_check has into the normal flow of tdb_find().

* We need record crc checks. Easiest would be in gencache "user
  space", but tdb-level we could benefit too.

Volker

-- 
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kontakt@xxxxxxxxx

Meet us at Storage Developer Conference (SDC)
Santa Clara, CA USA, September 24th-27th 2018

-- 
To unsubscribe from this list go to the following URL and read the
instructions:  https://lists.samba.org/mailman/options/samba