Web lists-archives.com

Re: Design of multiple hash support

On Mon, Nov 05, 2018 at 10:03:21AM -0800, Stefan Beller wrote:
> On Sun, Nov 4, 2018 at 6:36 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
> >
> > "brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> writes:
> >
> > > I'm currently working on getting Git to support multiple hash algorithms
> > > in the same binary (SHA-1 and SHA-256).  In order to have a fully
> > > functional binary, we'll need to have some way of indicating to certain
> > > commands (such as init and show-index) that they should assume a certain
> > > hash algorithm.
> > >
> > > There are basically two approaches I can take.  The first is to provide
> > > each command that needs to learn about this with its own --hash
> > > argument.  So we'd have:
> > >
> > >   git init --hash=sha256
> > >   git show-index --hash=sha256 <some-file
> > >
> > > The other alternative is that we provide a global option to git, which
> > > is parsed by all programs, like so:
> > >
> > >   git --hash=sha256 init
> > >   git --hash=sha256 show-index <some-file
> >
> > I am assuming that "show-index" above is a typo for something like
> > "hash-object"?

> Actually both seem plausible, as both do not require
> RUN_SETUP, which means they cannot rely on the
> extensions.objectFormat setting.

Correct.  In general, I assume that options that want a repository will
use the repository for that information.  There are a small number of
programs, such as init, that need to either set up a repository (without
reference to another repository) or need to inspect files without
necessarily being in a repository.

For example, we will want to have a way of indicating which hash we
would like to use in a fresh repository.  I am for the moment assuming
that we're in a stage 4 configuration: that is, that we're all SHA-1 or
all SHA-256.  A clone will provide this for us, but a git init will not.

Also, our pack index v3 format knows about which hash algorithm is in
use, but packs are not labeled with the algorithm they use.  This isn't
really a problem in normal use, since we always know from context which
algorithm is in use, but we'll need to indicate to index-pack (which
technically need not run in a repository) which algorithm it should use.

show-index will eventually learn to parse the index itself to learn
which algorithms are in use, so it is technically not required here.

> When having a global setting, would that override the configured
> object format extension in a repository, or do we error out?
> So maybe
>   git -c extensions.objectFormat=sha256 init
> is the way to go, for now? (Are repository format extensions parsed
> just like normal config, such that non-RUN_SETUP commands
> can rely on the (non-)existence to determine whether to use
> the default or the given hash function?)

The extensions callbacks are only handled in check_repo_format, so they
necessarily require a repository.  This is not new with my code.

Furthermore, one would have to specify "-c
core.repositoryformatversion=1" as well, as extensions require that
version in order to have any effect.

My current approach for the testsuite is to have git init honor a new
GIT_DEFAULT_HASH environment variable so we need not modify every place
in the testsuite that calls git init (of which there are many).  That
may or may not be greeted with joy by reviewers, but it seemed to be the
minimum viable approach.

> There is a section "Object names on the command line"
> in Documentation/technical/hash-function-transition.txt
> and I assume that this before the "dark launch"
> phase, so I would expect the latter to work (no error
> but conversion/translation on the fly) eventually as a goal.
> But the former might be in scope of one series.

Currently, I'm not implementing the stage 1-3 implementations.  I'm
merely going from the point where we have a binary that does only
SHA-256 and cannot perform SHA-1 operations at all to a stage 4
implementation, where the binary can do either, but a repository is
wholly one or the other.

> > It can work this way:
> >
> >  - read HEAD, discover that I am on 'master' branch, read refs/heads/master
> >    to learn the object name in 40-hex, realize that it cannot be
> >    sha256 and report "corrupt ref".
> >
> > Or it can work this way:
> >
> >  - read repository format, realize it is a good old sha1 repository.
> >
> >  - do the usual thing to get to read_object() to read the commit
> >    object data for the commit at HEAD, doing all of it in sha1.
> >
> >  - in the commit object data, locate references to other objects
> >    that use sha1 name.
> >
> >  - replace these sha1 references with their sha256 counterparts and
> >    show the result.
> >
> > I am guessing that you are doing the former as a good first step, in
> > which case, as an option that changes/affects the behaviour of git
> > globally, I think "git --hash=sha256" would make sense, like other
> > global options like --literal-pathspecs and --no-replace-objects.

Right now, we always read the repository configuration when possible,
and honor that.  I'm not planning, even when we have a full
implementation, to let the configuration of input and output format be
modified by command-line options.  That's a configuration of the
repository in the current transition plan, and I have no intention of
changing that (apart from possibly honoring "git -c").
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

Attachment: signature.asc
Description: PGP signature