Re: Problem with zombie processes
- Date: Mon, 20 Feb 2017 14:54:45 -0800
- From: Mark Geisert <mark@xxxxxxxxxx>
- Subject: Re: Problem with zombie processes
Erik Bray wrote:
On Mon, Feb 20, 2017 at 11:54 AM, Mark Geisert wrote:
So my guess was that Cygwin might try to hold on to a handle to a
child process at least until it's been explicitly wait()ed. But that
does not seem to be the case after all.
You might have missed a subtlety in what I said above. The Python
interpreter itself is calling wait4() to reap your child process. Cygwin
has told Python one of its children has died. You won't get the chance to
wait() for it yourself. Cygwin *does* have a handle to the process, but it
gets closed as part of Python calling wait4().
To be clear, wait4() is not called from Python until the script
explicitly calls p.wait().
In other words, when run this step by step (e.g. in gdb) I don't see a
wait4() call until the point where the script explicitly waits(). I
don't see any reason Python would do this behind the scenes.
You're right. I missed the wait in your script and ASSumed too much of the
Python interpreter :-( .
Anyways, I think it would be nicer if /proc returned at least partial
information on zombie processes, rather than an error. I have a patch
to this effect for /proc/<pid>/stat, and will add a few more as well.
To me /proc/<pid>/stat was the most important because that's the
easiest way to check the process's state in the first place! Now I
also have to catch EINVAL as well and assume that means a zombie
The file /proc/<pid>/stat is there until Cygwin finishes cleanup of the
child due to Python having wait()ed for it. When you run your test script,
pay attention to the process state character in those cases where you
successfully read the stat file. It's often S (stopped, I think) or R
(running) but I also see Z (zombie) sometimes. Your script is in a race
with Cygwin, and you cannot guarantee you'll see a killed process's state
before Cygwin cleans it up.
One way around this *might* be to install a SIGCHLD handler in your Python
script. If that's possible, that should tell you when your child exits.
Perhaps the Python script is a red herring. I just wrote it to
demonstrate the problem. The difference between where I send stdout
to is strange, but you're likely right that it just comes down to
subtle timing differences. Here's a C program that demonstrates the
same issue more reliably. Interestingly, it works when I run it in
strace (probably just because of the strace overhead) but not when I
run it normally.
My point in all this is I'm confused why Cygwin would give up its
handles to the Windows process before wait() has been called.
(In fact, it's pretty confusing to have fopen returning EINVAL which
according to  it should only be doing if the mode string were
O.K., you may be on to something amiss in the Cygwin DLL. Thanks for the STC in
C; that'll help somebody looking further at this. I'm out of ideas. It might
be possible to reduce strace overhead somewhat by selecting a smaller set of
trace options than the default.
Problem reports: http://cygwin.com/problems.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple