Re: New command/tool: git filter-repo

On Thu, 31 Jan 2019 at 22:37, Elijah Newren <newren@xxxxxxxxx> wrote:
> On Thu, Jan 31, 2019 at 8:09 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
> > Elijah Newren <newren@xxxxxxxxx> writes:
> >
> > > git-filter-repo[1], a filter-branch-like tool for rewriting repository
> > > history, is ready for more widespread testing and feedback.  The rough
> > > edges I previously mentioned have been fixed, and it has several useful
> > > features already, though more development work is ongoing (docs are a
> > > bit sparse right now, though -h provides some help).
> > >
> > > Why filter-repo vs. filter-branch?

I like the name! I think a lot of users are interested in filtering
their entire repo, rather than rewriting a single branch.

> > How does it compare with bfg-repo-cleaner?  Somehow I was led to
> > believe that all serious users of filter-branch like functionality
> > are using bfg-repo-cleaner instead.
> No, bfg-repo-cleaner only covers an important subset of the usecases.

That's true - the focus with BFG Repo-Cleaner is on removing unwanted
data - completely eradicating it from a repo's history. There are some
mistakes in history that repo owners just really *do not* want to
share (ie large files, private data/credentials), and they can be a
critical blocker to sharing or working with a Git repo. In terms of
rewriting history, my internal criterion for what I features I really
want to be in the BFG is: is this unwanted data completely stopping
many users from sharing their code or doing their work?

I understand that when it comes to rewriting history, there are loads
of other operations that people sometimes want to perform, beyond
removing unwanted data - merging/splitting of history,
anonymization/renaming of committers, etc. Some of those might be nice
to add to the BFG - but as with many OSS-maintainers, I have limited
time, and a life to balance outside of software...!

> bfg-repo-cleaner does a really good job if your goal is to remove a
> few big files and/or to remove some sensitive text (matched via
> regexes) from all blobs.  It was designed for that specific role and
> has more options in this area than filter-repo currently has.  But
> even within this design space it was optimized for, it is missing two
> things that I really want:
>   * pruning of commits which become empty due to filtering

There certainly have been several users asking for this feature on the
BFG, and even a kindly contributed PR for the functionality which I've
yet to merge. As it doesn't actually stop users from doing work - so
far as I can see - it's something that I've done a poor job of
following up.

>   * providing a way for the user to know what needs to be cleaned up.
> It has options like --strip-blobs-bigger-than <size> or
> --strip-biggest-blobs <NUM>, but no way for the user to figure out
> what <size> or <NUM> should be.

For users of GitHub, It's normally 100MB with
--strip-blobs-bigger-than <size> :-)

> Also, since it just focuses on really
> big blobs, it misses cases like someone checking in directories with a
> huge number of small-to-moderately sized files (e.g. bower_components/
> or node_modules/, though these could also contain a few big blobs

For those use-cases, it might be that BFG's --delete-folders flag is
useful, especially given the protected-head-commit feature of the BFG.

It's getting late for me, must be even later in Brussels - I wish I
could have made it there to join in! Merry Git Merge to you all, and
good luck to you Elijah with git-filter-repo.