Re: Git blame performance on files with a lot of history
- Date: Mon, 17 Dec 2018 12:30:56 -0800
- From: Clement Moyroud <clement.moyroud@xxxxxxxxx>
- Subject: Re: Git blame performance on files with a lot of history
On Fri, Dec 14, 2018 at 2:48 PM Ævar Arnfjörð Bjarmason
> On Fri, Dec 14 2018, Clement Moyroud wrote:
> > My group at work is migrating a CVS repo to Git. The biggest issue we
> > face so far is the performance of git blame, especially compared to
> > CVS on the same file. One file especially causes us trouble: it's a
> > 30k lines file with 25 years of history in 3k+ commits. The complete
> > repo has 200k+ commits over that same period of time.
> There's a real-world repo with a shape & size very similar to this that
> has good performance, gcc.git: https://github.com/gcc-mirror/gcc
> $ wc -l ChangeLog
> 20240 ChangeLog
> $ git log --oneline -- ChangeLog | wc -l
> $ git log --oneline | wc -l
> $ time git blame ChangeLog >/dev/null
> real 0m1.977s
> user 0m1.909s
> sys 0m0.069s
> Its history began in 1997, and the changes to the ChangeLog file by its
> nature is fairly evenly spread through that period.
> So check out that repo to see if you have similar or worse
> performance. Does your work repo show the same problem with a history
> produced with 'git fast-export --anonymize', and if so is that something
> you'd be OK with sharing?
I see around 3s here on the GCC repo, but I'm on a VM and the repo is
cloned on an NFS disk, so I'd say it matches :) It's around 45x faster
than my repo, on the same NFS share and VM. So there's definitely
something to improve here on my end (see my reply to Bryan re: repack
in a separate e-mail).
The anonymized export won't work in that case: all file contents are
replaced with 'anonymous blob <n>', so there's no per-line history for
blame to follow. Let me see if I can post-process a non-anonymized
version to keep the relevant data available.