Web lists-archives.com

Re: [PATCH] specify encoding for sed command




I did a little more digging into this issue today.

> On Apr 11, 2018, at 4:42 PM, Matt Coleman <cat.moleman@xxxxxxxxx> wrote:
> 
> I found another (possibly better) way to fix this:
> 
>> On Apr 10, 2018, at 3:18 AM, Matt Coleman <cat.moleman@xxxxxxxxx> wrote:
>> 
>>> 1) What platform OS / version / sed version is this on?
>> I'm experiencing this on macOS Sierra (10.12.6). The issue occurs with the OS's native sed, which is FreeBSD sed so the version number is kind of ambiguous.
>> 
>> The error goes away if I set LANG=C or LC_ALL=C or change it to use GNU sed (installed via homebrew as gsed).
> 
> If I change it to use awk instead of sed, it works with mawk, gawk, and macOS awk:
> unset $(set | awk -F '=' '/^__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*=/ {print $1}') 2>/dev/null
> 
> I compared sed vs. awk on Linux and Mac and they all take about the same amount of time to run (within 0.002ms).

The issue isn't actually caused by `sed`: it's caused by the way that the `set` builtin outputs characters in the Unicode Private Use Area (PUA) in the build of Bash 3.2.57 that macOS Sierra ships with.

Powerline uses several PUA code points to make some of its pretty text UI elements:

Code point (hex value): description
U+E0A0 (0xEE82A0): Version control branch
U+E0A1 (0xEE82A1): LN (line) symbol
U+E0A2 (0xEE82A2): Closed padlock
U+E0B0 (0xEE82B0): Rightwards black arrowhead
U+E0B1 (0xEE82B1): Rightwards arrowhead
U+E0B2 (0xEE82B2): Leftwards black arrowhead
U+E0B3 (0xEE82B3): Leftwards arrowhead

macOS Bash 3.2.57's `set` builtin has garbled output where Powerline's special symbols should be in the PS1 variable, but Bash 4.4.19 (installed on macOS via homebrew) and Bash 4.3.38 (Ubuntu 16.04) both display it correctly in the output of `set`. `echo $PS1` does display the symbols correctly on these versions of Bash.

So...

> 3) There's other invocations of "sed" in the file, aren't those affected as well?
The short answer: no. Slightly longer: not by the same thing that's affecting the line in the patch, at least.

Long: As described above, the problem isn't actually `sed`: it's the `set` builtin in macOS's build of Bash. The other invocations of `sed` should be safe, because `sed` properly handles the PUA code points on its own:

$ echo $'\xee\x82\xb0' | sed 's/./@/'
@

The way that `set` is displaying the PS1 variable seems to be sending them to `sed` individually or somehow split up:

$ for character in $'\xee' $'\x82' $'\xb0' $'\xee\x82' $'\x82\xb0'; do echo $character | sed 's/./@/'; done
sed: RE error: illegal byte sequence
sed: RE error: illegal byte sequence
sed: RE error: illegal byte sequence
sed: RE error: illegal byte sequence
sed: RE error: illegal byte sequence

Interestingly, Bash 3.2.25's `set` builtin on CentOS 5 correctly displays the octal values for the symbols (prompt edited to keep this email ASCII):

$ PS1=$'\xee\x82\xb0 '
>  set | grep PS1
PS1=$'\356\202\260 '

I haven't started digging through Bash's codebase, but it could either be a bug that was introduced between 3.2.25 and 3.2.57 or Apple used some silly flags when compiling Bash. Ideally, this should be fixed in Bash, but since Apple's using such an old version of Bash for license reasons, I think it's unlikely that they'll fix the issue.

I think the best way to move forward with this is a new patch that uses `awk` instead of `sed`: I tested several `awk` variants and the command was portable without requiring any changes to LANG or LC_ALL.

Does that sound like a good plan?