Re: Bug? Error during commit
- Date: Sat, 10 Feb 2018 07:16:07 -0500
- From: Jeff King <peff@xxxxxxxx>
- Subject: Re: Bug? Error during commit
On Sat, Feb 10, 2018 at 05:21:05PM +0700, Duy Nguyen wrote:
> > But it doesn't seem to work:
> > $ yes | head -c $((1024*1024*1024 - 10*1024*1024)) >file
> > $ git add file
> > $ git commit -m one
> > $ yes | head -c $((1024*1024*1024)) >file
> > $ git commit -am two
> > fatal: unable to generate diffstat for file
> > What's weird is that if I run "git show --stat" on the same commit, it
> > works! So there's something about how commit invokes the diff that
> > doesn't let the big-file check kick in.
> I have a flashback about checking big_file_threshold in this code. I
> vaguely remember doing something in this code but not sure if it was
> I finally found 6bf3b81348 (diff --stat: mark any file larger than
> core.bigfilethreshold binary - 2014-08-16) so it's merged after all,
> but this commit is about --stat apparently ;-)
The confounding factor here is actually the break-detection that
"commit" does. After running the "commit" above (which does succeed
despite the "fatal", since that happens only as part of the diff summary
we print), you can replicate the problem with "git show -B --stat".
The break-detection code unconditionally loads the file data. Then when
the stat code later checks whether it's binary, it just uses the data
as-is. So that leads me to a few questions / lines of thought:
1. When we're checking binary-ness, we shouldn't assume if the data
is loaded that the size is sane. We should check it against
2. Should break detection (and probably inexact rename detection) skip
files that are over big_file_threshold?
I think our code is capable of actually performing these operations
on large files (at least on systems with 64-bit ulong; I'd be
willing to bet you get funny results due to integer overflow on
32-bit systems or on 64-bit Windows). But I'm not sure it's doing
anybody any good. And it's way faster not to. For example, here's
"git show" before and after the patch below (running on the repo
from my example):
$ time git show --oneline -B --stat | cat
fatal: unable to generate diffstat for file
$ time git.compile show --oneline -B --stat | cat
file | Bin 1063256064 -> 1073741824 bytes
1 file changed, 0 insertions(+), 0 deletions(-)
We can produce the useful answer in a fraction of the time, since
we don't even need to load the blob content. The downside is that
in theory we could break-detect these, and then find renames. So I
guess it comes down to the philosophy of core.bigfilethreshold: how
much effort are we willing to put into such files on the off chance
that they produce a useful output?
If we go this route, then in theory the fix in (1) becomes
redundant, though I'd probably do both just as a
3. To what degree should we override the user's config to treat these
files as binary. If I set core.bigfilethreshold to "10G", or if I
use gitattributes to mark the file as non-binary, then we're still
going to feed it to xdiff (which is going to choke and die).
Should we override these when we approach MAX_DIFF_SIZE, and just
treat the file as binary? Should we barf and tell the user to fix
I guess I argued for overriding attributes earlier, and that
probably ought to be independent of core.bigfilethreshold. Perhaps
we should issue a warning in that case, to let the user know their
config is nonsense.
> > I think we probably ought to consider anything over big_file_threshold
> > to be binary, no matter what. Possibly even if the user gave us a
> > .gitattribute that says "no really, this is text". Because that 1GB
> > limit is a hard limit that the code can't cope with; our options are
> > either to generate a binary diff or to die.
> Agreed. While at there perhaps we need to improve this "unable to
> generate diffstat" message a bit too.
One reason the message is so vague is that we don't actually have any
kind of error code. Though really the only reason for xdiff to fail is
due to file size. So we could perhaps offer some advice there. But if we
do all the things I suggested above, then we'd automatically handle all
the cases we know about.
so my hope was that we would make it impossible to trigger this message.
In which case it maybe ought to be a BUG(). :)
Here's the patch to make "show -B --stat" work. I'll give some more
thought to whether this is a good idea and prepare a series.
One downside is that in the common case it causes us to look up each
object twice (once to get its size, and then again to load the content).
I wonder if we should have a function for "read this object, unless it's
over N bytes, in which case just tell me the size". That's weirdly
specific, but I think pretty much every user of core.bigfilethreshold
would want it.
diff --git a/diffcore-break.c b/diffcore-break.c
index c64359f489..35f5b50bcc 100644
@@ -61,6 +61,13 @@ static int should_break(struct diff_filespec *src,
return 0; /* they are the same */
+ if (diff_populate_filespec(src, CHECK_SIZE_ONLY) ||
+ diff_populate_filespec(dst, CHECK_SIZE_ONLY))
+ return 0; /* error but caught downstream */
+ if (src->size > big_file_threshold || dst->size > big_file_threshold)
+ return 0; /* too big to be worth computation */
if (diff_populate_filespec(src, 0) || diff_populate_filespec(dst, 0))
return 0; /* error but caught downstream */