Web lists-archives.com

Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"

On Wed, Mar 7, 2018 at 2:19 AM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>> +--keep-base-pack::
>> +     All packs except the base pack are consolidated into a single
>> +     pack. The largest pack is considered the base pack.
> This makes it sound as if packs with .keep are also repacked unless
> they meet the threshold for "base pack".  Is that what you actually
> implemented?

Copy/paste problem. That is, I copied this from --auto description,
but I missed the "(except those marked with a `.keep` file)" part. No,
I don't think ignoring .keep files is a good idea, at least no by

> In order to do so, [2/5] needs to allow the "--keep-pack" option
> override the on-disk .keep files.  In an earlier message, I wondered
> if such an arrangement is useful in some use cases; I think it is,
> and because those who do want to see the on-disk .keep files honored
> can collect and include them in the set of packs to be kept via
> "--keep-pack" (after all this is an option for low-level scripting
> anyway).

At gc level I don't think we need to allow this. But yeah git-repack
has --pack-kept-objects to ignore .keep. If they specify this, then
repack should ignore .keep (but still follow whatever --keep-pack is
specified). There's some interesting interaction between .keep and
pack bitmap feature in pack-objects though. I'm not so sure what
happens down there yet.

>> +Set environment variable `GIT_TRACE` in order to see the memory usage
>> +estimation in `git gc --auto` that determines whether the base pack is
>> +kept.
> This is somewhat a puzzling use of trace.  It sounds more like a way
> to find out "how" memory usage estimation is done and arriving at a
> wrong value for those who want to debug the feature.

Yeah. I'm not sure if this estimation could be really problematic that
people need to debug this way. A cleaner way (if we think people will
need this often) is just add a new option in "git gc" to report this
estimation breakdown and do nothing else.

>> +static struct packed_git *find_the_base_pack(void)
>> +{
>> +     struct packed_git *p, *base = NULL;
>> +
>> +     prepare_packed_git();
>> +
>> +     for (p = packed_git; p; p = p->next) {
>> +             if (p->pack_local &&
>> +                 (!base || base->pack_size < p->pack_size))
>> +                     base = p;
>> +     }
>> +
>> +     return base;
>> +}
> This is finding the largest pack.

The discussion on .keep files raises one question for me, what if the
largest pack already has a .keep file, do we still consider it the
base pack, or should we find the next largest non-kept pack?

I'm guessing we keep things simple here and ignore .keep files.

>> +#ifdef HAVE_SYSINFO
>> +     struct sysinfo si;
>> +
>> +     if (!sysinfo(&si))
>> +             return si.totalram;
>> +#elif defined(HAVE_BSD_SYSCTL) && defined(HW_MEMSIZE)
>> +     int64_t physical_memory;
>> +     int mib[2];
>> +     size_t length;
>> +
>> +     mib[0] = CTL_HW;
>> +     mib[1] = HW_MEMSIZE;
>> +     length = sizeof(int64_t);
>> +     if (!sysctl(mib, 2, &physical_memory, &length, NULL, 0))
>> +             return physical_memory;
>> +#elif defined(GIT_WINDOWS_NATIVE)
>> +     MEMORYSTATUSEX memInfo;
>> +
>> +     memInfo.dwLength = sizeof(MEMORYSTATUSEX);
>> +     if (GlobalMemoryStatusEx(&memInfo))
>> +             return memInfo;ullTotalPhys;
> Is this legal C in Microsoft land?

That's the problem with writing code without a way to test it. At
least travis helped catch a compiler bug on mac.

I'm torn between just #error here and let Windows/Mac guys implement
it (which they may scream "too much work, I don't wanna") but if I
help write something first, yes things are potentially broken and need
verification from those guys. I guess I'll just fix this up and hope
non-linux guys do the rest.

>> +#else
>> +     fprintf(stderr, _("unrecognized platform, assuming %lu GB RAM\n"),
>> +             default_ram);
>> +#endif
>> +     return default_ram * 1024 * 1024 * 1024;
>> +}
> I wonder if the above should go somewhere under compat/ without
> ifdef but split into separate files for individual archs (I do not
> know the answer to this question).

My first choice too. I chose this way after seeing online_cpus()
thread-utils.c. Not sure which way is best either.

>> @@ -427,8 +557,19 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>>                        */
>>                       daemonized = !daemonize();
>>               }
>> -     } else
>> -             add_repack_all_option();
>> +     } else {
>> +             struct packed_git *base_pack = find_the_base_pack();
>> +             struct packed_git *exclude = NULL;
>> +
>> +             if (keep_base_pack != -1) {
>> +                     if (keep_base_pack)
>> +                             exclude = base_pack;
> OK, --keep-base-pack option always wins if given...
>> +             } else if (base_pack && big_base_pack_threshold &&
>> +                        base_pack->pack_size >= big_base_pack_threshold)
>> +                     exclude = base_pack;
> ...and then if the largest one is larger than the threshold, it (and
> it alone) is kept, but otherwise nothing is kept automatically.
> But to those who say "packs larger than this value is too big" via
> configuration, keeping only the largest of these above-threshold
> packs would look counter-intuitive, wouldn't it, I wonder?

I think I'll just clarify this in the document. There may be a use
case for keeping multiple large packs, but I don't see it (*). We can
deal with it when it comes.

(*) Well I see one. In submodule setting, we if merge object stores of
all submodules back to the supermodule, we have multiple base packs
and probably want to keep them that way. It's still a long way to get
there (not even sure if submodule people want to get there)

>> diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
>> index 41b0be575d..863fdbb0fd 100755
>> --- a/t/t6500-gc.sh
>> +++ b/t/t6500-gc.sh
>> @@ -5,6 +5,13 @@ test_description='basic git gc tests
>>  . ./test-lib.sh
>> +test_expect_success 'setup' '
>> +     # do not let the amount of physical memory affects gc
>> +     # behavior, make sure the pack_objects_uses_too_much_memory()
>> +     # always returns false
>> +     git config gc.bigBasePackThreshold 2g
> Hmph, that is because the configuration wins and we know the trash
> repository will never have a pack that large.  OK.

Credit goes to travis linux32 job. I wouldn't notice this otherwise.

A thought came across me when I wrote this though. Should we support
special value "infinite" (or just "max")- in our config code?

The use of super large gc.bigBasePackThreshold to disable this keeping
base pack is intended. But I can't go too high here it may break
limits on 32 bit platforms. And 2g sounds really arbitrary.