Web lists-archives.com

Re: gawk 4.1.4: CR separate char for CRLF files




Achim Gratz wrote:
Vermessung AVT - Wolfgang Rieger writes:
> Another solution which we have been using for many years now, though 
> it might not be feasible for you:

Cygwin is, like it or not, a rolling distribution.

> We very rarely update Cygwin. We have been using Cygwin for some 15+ 
> years now. We use tools like gawk (hundreds of scripts), head, tail, 
> sort, etc. that we are using in shell scripts running under cmd.exe 
> (no Unix shells involved). I soon realized that upgrades of Cygwin may 
> cause troubles with existing scripts, so we only update if we really 
> need to (e.g.: New functionality that would be important, 32 to 64 bit 
> shift, eventually new Windows versions, bugs we needed to be fixed).

Hopefully the machine(s) runnning those scripts are isolated.

In your particular case you might be better off using MSys2 or GNUwin32 tools, although you'd still need a better way to deal with updates.
Also, audit your scripts for non-portable constructs, since those are the parts that most likely to break.  CMD scripting is a tough nut to crack if it's of any complexity and there are lots of things that are poorly or not officially documented.  I don't quite understand why you use POSIX tools, but specifically shun POSIX scripting.

> I have followed the discussions about the CR/LF behaviour changes in 
> the past attentively and decided not to update in near future, because 
> that would lead to a massive problem with many hundreds of scripts - 
> hoping that sometimes there will be a change in gawk again.

You'd better replace that hope with a feature request at gawk upstream.

> What is Unix-like or OS-like or Posix-like behaviour in that context?
> You could argue that gawk interprets line endings like the underlying 
> OS does (i. e., gawk reads LF in Unix and CR/LF in Win), or it 
> interprets line endings in a Unix-style no matter of the underlying OS 
> used. That's a developer's decision in my opinion.

Cygwin uses LF line endings (yes there are still text mounts, but you'd be better off pretending they don't exist).  When you're trying to use it for CRLF files, you need to wrap those invocations to do an explicit conversion.

https://cygwin.com/cygwin-ug-net/using-textbinary.html

> But since with pipes or output redirection gawk used to write no CRs 
> even in previous versions, we already had the problem that gawk had to 
> accept *both* inputs, LF with or without CR. That worked widely fine 
> so far, since most Windows and other application SW we use accept both 
> record formats, fortunately (we had issues with SW upgrades of other 
> vendors no longer accepting pure LF, but that only concerned a very 
> small number of scripts). With the new approach in Cygwin that seems 
> to be broken, so we did not upgrade Cygwin since then (we currently 
> use gawk 4.1.3).

Again, your attempt to freeze your system at some arbitrary point in time is misguided.  It'll never quite work out and chances are that when it breaks it will do so in ways that creates more work and forces you to do it in emergency mode, which is never a good thing.

> Of course the reason for that really annoying CR/LF thing is the 
> arrogance and ignorance of MS, which caused innumerable of useless 
> developers' hours when I think of the endless discussions and changes 
> in Cygwin; but MS is the one who defines the standards because of its 
> very market power, so we have to deal with it, if we like or not.

You really can't blame them for CRLF, they weren't and aren't the only ones using it and it's been in use long before Microsoft entered the scene.

> I'd definitely prefer to use Unix for its powerful tools, but most of 
> the SW we use is simply not available for Unix, and MS does not 
> provide gawk etc. So we have to deal with that CR/LF issue in a 
> pragmatic rather than in a more, say, philosophical approach: We need 
> to run our scripts with as little changes as possible. So that's why 
> we upgrade Cygwin as seldom as possible. It is a "living system", yes, 
> which is great on the one side - but can be annoying in everyday 
> practice.

Again, you'd better figure out how to transform your input (and possibly
output) so it'll conform to the conventions of the tool(s) you use, perhaps by providing a handful of wrapper scripts.  Alternatively, only use tools that adhere to the same set of conventions.

> In my opinion there should be at least an option for gawk to accept 
> both LF and CR/LF line endings equally, preferably with a system 
> variable so that there is no need to change the command line call of 
> gawk at all. That's what I vote for.

Yes, but please cast that vote with the upstream developers.  I reckon it'd be a generally useful function, so there's no point in providing it only on Cygwin.


Regards,
Achim.
-- 
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+

SD adaptations for KORG EX-800 and Poly-800MkII V0.9:
http://Synth.Stromeko.net/Downloads.html#KorgSDada




Dear Achim,

I fully agree to most of what you say. But:

1) As well as Cygwin is a rolling distrib my work is a "rolling work". And that is why I deal with it in what I call a pragmatic way: I need a working system with minimum maintaining effort. SW is as it is provided, and I have to adopt, since I mostly can't write my own.

2) When I started using Cygwin some 2 decades ago I was coming from Unix. C programming and awk were what I was used to. In fact, awk was my most favourite tool, even for developing small C-programs. When I was forced to switch to Windows I needed a way to do text and data processing in a feasible manner and to port several awk-scripts to Windows within short time - awk is a nearly perfect text processing tool till today, though not widely known. I don't know anything comparable in terms of ease of syntax (once you know C), compactness of code, flexibility, and, most important for me, I am very familiar with it. Somebody recommended to use Cygwin then, which I implemented and learned to work with and to like it. Decisions had to be made about scripting then, and for some reasons we ended up with cmd.exe and a couple of additional tools along with some major software. It was not at all ideal, but it was very easy and very flexible. Had we known to how the system once would grow, maybe we would have decided differently. Maybe. But I am not sure if we would have come so far. We are a service provider with the need to automate tools, we are not a software vendor.

3) Years later somebody recommended GnuWin as native port, which I immediately tried. However, we ran into serious problems with quoting, as single quote syntax did not work there (Unix and Cygwin: gawk '{print}' would have to be written as gawk "{print}"), which broke a lot of scripts, and there were other problems with providing special characters, quoting, etc. which I could not manage to solve. So we did not switch (and, besides, sometimes I was not sure if GnuWin was still an active system - Cygwin has great user groups and is very active).

4) We have learned a lot of how to incorporate Cygwin in cmd.exe, even with constructs like
for /f "usebackq ..." %%A in (`someprog ... ^| gawk '{...}' ^| something`) do ...
and a lot of other and even more complicated things. That may sound strange, but it works and has worked for many, many years now. A lot is possible!

5) You can always find a better way to do things, of course, I won't argue about that. Sometimes we thought about switching to Java or php or python or whatever. Maybe, we should. But we have a lot of running scripts, massive batch and parallel processing, and cmd.exe with minimum Cygwin (no X subsystem, no pile of tools, just a tiny installation) has worked great for many years - so why not use it? Just because it is not intended to use it that way?

6) We have a grown and growing system. To completely change the system would certainly mean months of work on developer's side. But we have no developer team. We work on projects which have to go on. We do programming where it is necessary in order to automate processes. Our clients don't care if we change software, they want results.

7) Yes, many things could be done much better. I'd like to have the perfect system. But there is no perfect system. Cygwin under cmd.exe works really fine once you have learned its specifics. In fact, Cygwin has done a really great job in our environment for nearly 2 decades so far, even if we mostly don't use it as intended!

There is one point where I disagree. You said,
> Again, you'd better figure out how to transform your input (and possibly
> output) so it'll conform to the conventions of the tool(s) you use, perhaps
> by providing a handful of wrapper scripts.  Alternatively, only use tools that
> adhere to the same set of conventions.
That is exactly what we do and have done so far as I explained above. The problem comes up when developers decide for any reasons to change the behaviour - which happened with the CRLF handling. You can argue that the previous CRLF handling of gawk was not posix conforming. To be honest, I never looked up posix specifications. I use the SW by trying how it works and adopt to it. A SW vendor may be forced to check for compatibility considerations before writing one single line of code (I doubt many of them do so). But I am not a SW vendor. I eventually take the gawk manual and write code and test it. I realise there is that CRLF thing and adopt my scripts accordingly. For many years that worked; the developers did not change the behaviour. Our "input and output perfectly conformed to the conventions" (which means for me, what the SW accepted). Some day they changed the conventions. The reasons are comprehensible, of course, yet it causes a big amount of troubles. That is where we are now. So I "adhere to the same set of conventions" by simply not updating now.

Maybe in 10 years time another developer group decides to change it another way for any other good reason. Every change in syntax will cause problems. If a SW tools allows several ways it can be assumed all of them will be used by different people. If that behaviour is changed it *will* cause problems for some.


Anyway, thanks for the suggestion to contact the upstream developers. I was not aware of that. Can you give me a hint where to go?

Kind regards,
Wolfgang


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple