Web lists-archives.com

Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"

On Tue, Mar 06 2018, Nguyễn Thái Ngọc Duy jotted:

> pack-objects could be a big memory hog especially on large repos,
> everybody knows that. The suggestion to stick a .keep file on the
> giant base pack to avoid this problem is also known for a long time.
> Let's do the suggestion automatically instead of waiting for people to
> come to Git mailing list and get the advice. When a certain condition
> is met, "gc --auto" tells "git repack" to keep the base pack around.
> The end result would be two packs instead of one.
> On linux-2.6.git, valgrind massif reports 1.6GB heap in "pack all"
> case, and 535MB [1] in "pack all except the base pack" case. We save
> roughly 1GB memory by excluding the base pack.
> gc --auto decides to do this based on an estimation of pack-objects
> memory usage, which is quite accurate at least for the heap part, and
> whether that fits in half of system memory (the assumption here is for
> desktop environment where there are many other applications running).
> Since the estimation may be inaccurate and that 1/2 threshold is
> really arbitrary, give the user a finer control over this mechanism:
> if the largest pack is larger than gc.bigBasePackThreshold, it's kept.
> PS. A big chunk of the remaining 535MB is the result of pack-objects
> running rev-list internally. This will be dealt with when we could run
> rev-list externally. Right now we can't because pack-objects internal
> rev-list does more regarding unreachable objects, which cannot be done
> by "git rev-list".
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx>
> ---
>  Documentation/config.txt |   7 ++
>  Documentation/git-gc.txt |  13 ++++
>  builtin/gc.c             | 153 +++++++++++++++++++++++++++++++++++++--
>  builtin/pack-objects.c   |   2 +-
>  config.mak.uname         |   1 +
>  git-compat-util.h        |   4 +
>  pack-objects.h           |   2 +
>  t/t6500-gc.sh            |  29 ++++++++
>  8 files changed, 204 insertions(+), 7 deletions(-)
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index f57e9cf10c..120cf6bac9 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -1549,6 +1549,13 @@ gc.autoDetach::
>  	Make `git gc --auto` return immediately and run in background
>  	if the system supports it. Default is true.
> +gc.bigBasePackThreshold::
> +	Make `git gc --auto` only enable `--keep-base-pack` when the
> +	base pack's size is larger than this limit (in bytes).
> +	Defaults to zero, which disables this check and lets
> +	`git gc --auto` determine when to enable `--keep-base-pack`
> +	based on memory usage.
> +

I'm really keen to use this (and would be happy to apply a patch on
top), but want to get your thoughts first, see also my just-sent

The thing I'd like to change is that the underlying --keep-pack= takes a
list of paths (good!), but then I think this patch needlessly
complicates things by talking about "base packs" and having the
implementation limitation that we only ever pass one --keep-pack down to
pack-objects (bad!).

Why don't we instead just have a gc.* variable that you can set to some
size of pack that we'll always implicitly *.keep? That way I can
e.g. clone a 5GB pack and set the limit to 2GB, then keep adding new
content per the rules of gc.autoPackLimit, until I finally create a
larger than 2GB pack, at that point I'll have 5GB, 2GB, and some smaller
packs and loose objects.

We already have pack.packSizeLimit, perhaps we could call this
e.g. gc.keepPacksSize=2GB?

Or is there a use-case for still having the concept of a "base" pack? Is
it magic in some way? Maybe I'm missing something but I don't see why,
we can just stop thinking about whether some one pack is larger than X,
and consider all packs larger than X specially.

But if we do maybe an extra gc.keepBasePack=true?

Finally I wonder if there should be something equivalent to
gc.autoPackLimit for this. I.e. with my proposed semantics above it's
possible that we end up growing forever, i.e. I could have 1000 2GB
packs and then 50 very small packs per gc.autoPackLimit.

Maybe we need a gc.keepPackLimit=100 to deal with that, then e.g. if
gc.keepPacksSize=2GB is set and we have 101 >= 2GB packs, we'd pick the
two smallest one and not issue a --keep-pack for those, although then
maybe our memory use would spike past the limit.

I don't know, maybe we can leave that for later, but I'm quite keen to
turn the top-level config variable into something that just considers
size instead of "base" if possible, and it seems we're >95% of the way
to that already with this patch.

Finally, I don't like the way the current implementation conflates a
"size" variable with auto detecting the size from memory, leaving no way
to fallback to the auto-detection if you set it manually.

I think we should split out the auto-memory behavior into another
variable, and also make the currently hardcoded 50% of memory

That way you could e.g. say you'd always like to keep 2GB packs, but if
you happen to have ended up with a 1GB pack and it's time to repack, and
you only have 500MB free memory on that system, it would keep the 1GB
one until such time as we have more memory.

Actually maybe that should be a "if we're that low on memory, forget
about GC for now" config, but urgh, there's a lot of potential
complexity to be handled here...