Web lists-archives.com

Re: git archive generates tar with malformed pax extended attribute




On Sat, May 25 2019, René Scharfe wrote:

> Am 24.05.19 um 10:13 schrieb Jeff King:
>> On Fri, May 24, 2019 at 09:35:51AM +0200, Keegan Carruthers-Smith wrote:
>>
>>>> I can't reproduce on Linux, using GNU tar (1.30) nor with bsdtar 3.3.3
>>>> (from Debian's bsdtar package). What does your "tar --version" say?
>>>
>>> bsdtar 2.8.3 - libarchive 2.8.3
>>
>> Interesting. I wonder if there was a libarchive bug that was fixed
>> between 2.8.3 and 3.3.3.
>>
>>>> Git does write a pax header with the commit id in it as a comment.
>>>> Presumably that's what it's complaining about (but it is not malformed
>>>> according to any tar I've tried). If you feed git-archive a tree rather
>>>> than a commit, that is omitted. What does:
>>>>
>>>>   git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null
>>>>
>>>> say? If it doesn't complain, then we know it's indeed the pax comment
>>>> field.
>>>
>>> It also complains
>>>
>>>   $ git archive --format tar c21b98da2^{tree} | tar tf - >/dev/null
>>>   tar: Ignoring malformed pax extended attribute
>>>   tar: Error exit delayed from previous errors.
>>
>> Ah, OK. So it's not the comment field at all, but some other entry.
>>
>>> Some more context: I work at Sourcegraph.com We mirror a lot of repos
>>> from github.com. We usually interact with a working copy by running
>>> git archive on it in our infrastructure. This is the first repository
>>> that I have noticed which produces this error. An interesting thing to
>>> note is the commit metadata contains a lot of non-ascii text which was
>>> my guess at what my be tripping up the tar creation.
>>
>> Yeah, though the only thing that makes it into the tarfile is the actual
>> tree entries. I'd imagine the file content is not likely to be a source
>> of problems, as it's common to see binary gunk there. Most of the
>> filenames are pretty mundane, but this symlink destination is a little
>> funny:
>>
>>   $ git archive ... | tar tvf - | grep nicovideo4as.swc
>>   lrwxrwxrwx root/root         0 2019-05-24 03:05 libs/nicovideo4as.swc -> PK\003\004\024
>>
>> That's not the full story, though. It is indeed a symlink in the
>> tree:
>>
>>   $ git ls-tree -r HEAD libs/nicovideo4as.swc
>>   120000 blob ec3137b5fcaeae25cf67927068af116517683806	libs/nicovideo4as.swc
>>
>> But the contents of that blob, which should be the destination filename,
>> are definitely not:
>>
>>   $ git cat-file blob ec3137b5f | wc -c
>>   57804
>>   $ git cat-file blob ec3137b5f | xxd | head -1
>>   00000000: 504b 0304 1400 0800 0800 5069 694e 0000  PK........PiiN..
>>
>> There's quite a bit more data there. And what tar showed us goes up to
>> the first NUL, which does not seem surprising.
>
> That (the symlink target) is a ZIP file with the following contents:
>
>  Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
> --------  ------  ------- ---- ---------- ----- --------  ----
>    39733  Defl:N     3403  91% 2019-03-09 13:10 489e1be1  catalog.xml
>    54131  Defl:N    54151   0% 2019-03-09 13:10 32f57322  library.swf
> --------          -------  ---                            -------
>    93864            57554  39%                            2 files
>
> And link targets longer than 100 characters are encoded in an extended
> Pax header.
>
> (Usually symlink targets are paths, not file contents.)
>
>> It's possible Git is doing the wrong thing on the writing side, but
>> given that newer versions of bsdtar handle it fine, I'd guess that the
>> old one simply had problems consuming poorly formed symlink filenames.
>
> Git preserves symlink targets with embedded NULs in the repository and
> in generated tar files.  Not sure if GNU tar and bsdtar truncating them
> at the first NUL is a bug.  I'm also not sure if there is a platform
> that would allow creating such a symlink in the file system, or how one
> is supposed to use it.
>
> We could truncate symlink targets at the first NUL as well in git
> archive -- but that would be a bit sad, as the archive formats allow
> storing the "real" target from the repo, with NUL and all.  We could
> make git fsck report such symlinks.
>
> Can Unicode symlink targets contain NULs?  We wouldn't want to damage
> them even if we decide to truncate.

I don't see a practical use for this case, and maybe we should even fsck
check for the blob representing the symlink target having a \0 in it as
suggested upthread.

But that being said, this assumption that data in a tar archive will get
written to a FS of some sort isn't true. There's plenty of consumers of
the format that read it in-memory and stream its contents out to
something else entirely, e.g. taking "git archive --remote" output,
parsing it with e.g. [1] and throwing some/all of the content into a
database.

1. https://metacpan.org/pod/Archive::Tar