Web lists-archives.com

Re: [PATCH v2] repack: Add option to preserve and prune old pack files




On Sunday, March 12, 2017 11:03:44 AM Junio C Hamano wrote:
> Jeff King <peff@xxxxxxxx> writes:
> > I can think of one downside of a time-based solution,
> > though: if you run multiple gc's during the time
> > period, you may end up using a lot of disk space (one
> > repo's worth per gc). But that's a fundamental tension
> > in the problem space; the whole point is to waste disk
> > to keep helping old processes.
> 
> Yes.  If you want to help a process that mmap's a packfile
> and wants to keep using it for N seconds, no matter how
> many times somebody else ran "git repack" while you are
> doing your work within that timeframe, you somehow need
> to make sure the NFS server knows the file is still in
> use for that N seconds.
> 
> > But you may want a knob that lets you slide between
> > those two things. For instance, if you kept a sliding
> > window of N sets of preserved packs, and ejected from
> > one end of the window (regardless of time), while
> > inserting into the other end. James' existing patch is
> > that same strategy with a hardcoded window of "1".
> 
> Again, yes.  But then the user does not get any guarantee
> of how long-living a process the user can have without
> getting broken by the NFS server losing track of a
> packfile that is still in use.  My suggestion for the
> "expiry" based approach is essentially that I do not see
> a useful tradeoff afforded by having such a knob.
> > The other variable you can manipulate is whether to gc
> > in the first place. E.g., don't gc if there are N
> > preserved sets (or sets consuming more than N bytes, or
> > whatever). You could do that check outside of git
> > entirely (or in an auto-gc hook, if you're using it).
> Yes, "don't gc/repack more than once within N seconds" may
> also be an alternative and may generally be more useful
> by covering general source of wastage coming from doing
> gc too frequently, not necessarily limited to preserved
> pack accumulation.

As someone who helps manage a Gerrit server for several 
thousand repos, all on the same NFS disks, a time based 
expiry seems unpractical, and not something that I am very 
interested in having.  I favor the simpler (single for now) 
repacking cycle approach, and it is what we have been using 
for almost 6 months now successfully, without suffering any 
more stale file handle exceptions.

While time is indeed the factor that is going to determine 
whether someone is going to see the stale file handles or 
not, on a server (which is what this feature is aimed at), 
this is secondary in my mind to predictability about space 
utilization.  I have no specific minimum time that I can 
reason about, i.e. I cannot reasonably say "I want all 
operations that last less than 1 hour, 1 min, or 1 second... 
to succeed".  I don't really want ANY failures, and I am 
willing to sacrifice some disk space to prevent as many as 
possible.  So the question to me is "How much disk space am 
I willing to sacrifice?", not "How long do I want operations 
to be able to last?".  The only way that time enters my 
equation is to compare it to how long repacking takes, i.e. 
I want the preserved files cleaned up on the next repack.   
So effectively I am choosing a repacking cycle based 
approach, so that I can reasonably predict the extra disk 
space that I need to reserve for my collection of repos.  
With a single cycle, I am effectively doubling the "static" 
size of repos.  

To achieve this predictability with a time based approach 
requires coordination between the expiry setting and the 
repacking time cycle.  This coordination is extra effort for 
me, with no apparent gain.  It is also an additional risk 
that I don't want to have.  If I decide to bump up how often 
I run repacking, and I forget to reduce the expiry time, my 
disk utilization will grow and potentially cause serious 
issues for all my repositories (since they share the same 
volume).  This problem is even more difficult if I decide to 
use a usage (instead of time) based algorithm to determine 
when I repack.

Admittedly, a repacking cycle based approach happens to be 
very easy and practical when it is a "single" cycle.  If I 
determine eventually empirically that a single cycle is not 
long enough for my server, I don't know what I will do?  
Perhaps I would then want a switch that preserves the repos 
for another cycle?  Maybe it could work the way that log 
rotation works, add a number to the end of each file name for 
each preserved cycle?  This option seems preferable to me 
than a time based approach because it makes it more obvious 
what the impact on disk utilization will be.  However, so 
far in practice, this does not seem necessary.

I don't really see a good use case for a time based expiry 
(other than "this is how it was done for other things in 
git").  Of course, that doesn't mean such a use case doesn't 
exist, but I don't support adding a feature unless I really 
understand why and how someone would want to use it.  I 
think that a time based expiry should only be added if 
someone has a specific use case they expect to achieve with 
it, and they actually plan to use it that way, not just for 
uniformity. 

One might even eventually decide that some of the other 
current use cases for time based expiries should be 
converted to cycle based expiries; I suspect server admins 
will have fewer surprises that way,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation