Web lists-archives.com

Re: Cygwin hanging in pselect

On Fri, Jan 6, 2017 at 12:40 PM, Erik Bray <erik.m.bray@xxxxxxxxx> wrote:
> Hello, and happy new-ish year,
> I've been working on and off over the past few months on bringing
> Python's compatibility with Cygwin up to snuff, including having all
> pertinent tests passing.  I've noticed that there are several tests
> (which I currently skip) that cause the process to hang indefinitely,
> and not respond to any signals from Cygwin (it can only be killed from
> Windows).  This is Cygwin 64-bit--I have not tested 32-bit.
> I finally looked into this problem and found the lockup to be in
> pselect() somewhere.  Attached I've provided the most minimal example
> I've been able to come up with so far that reproduces the problem,
> which I'll describe in a bit more detail next. I would attach a
> cygcheck output if requested, but I was also able to reproduce this on
> a recent build from source.
> So far as I've been able to tell, the problem only occurs with AF_UNIX
> sockets.  In the example I have a 'server' socket and a 'client'
> socket both set to non-blocking.  The client connects to the socket,
> returning errno EINPROGRESS as expected.  Then I do a pselect on the
> client socket to wait until it is ready to be read from.  The hang
> only happens when I pselect on the client socket, and not on the
> server socket.  It doesn't seem to make a difference what the timeout
> is.  One thing I have no tried is if the client and server are
> actually different processes, but the example from the Python tests
> this is reproducing is where they are both in the same process.
> Below is (I think) the most relevant output from strace on the test
> case.  It seems to hang somewhere in socket_cleanup, but I haven't
> investigated any further than that.

I made a little bit of progress debugging this, but now I'm stumped.
It seems the problem is this:

For each socket whose fd is passed to select() a thread_socket is
started which calls peek_socket until there are bits ready on the
socket, or until the timeout is reached.  This in turn calls

The reason it's only locking up on my "client thread" on which
connect() is called, is that evaluate_events notes that the socket is
waiting to connect, and this passes control to
fhandler_socket::af_local_connect().  af_local_connect() temporarily
sets the socket to blocking, then sends a magic string to the socket
(you can see in my strace log that this succeeds).  What's strange,
and what I don't understand, is that there are no FD_READ or FD_OOB
events recorded for the WSASendTo call from af_local_send_secret().
Then, after af_local_send_secret() it calls af_local_recv_secret().
This calls recv_internal() which in turn calls recursively into
fhandler_socket::evaluate_events where it waits for an FD_READ or
FD_OOB event that never arrives.  And since it set the socket to
blocking it just sits in an infinite loop.

Meanwhile the timer for the select() call expires and tries to shut
down the thread_socket but it can't because it never completes.

What I don't understand is why there is not an event recorded for the
WSASendTo in send_internal.  I even wrapped it with the following
debug code to wait for an FD_READ event immediately following the

      else if (get_socket_type () == SOCK_STREAM)
        WSAEventSelect(get_socket (), wsock_evt, EVENT_MASK);
        res = WSASendTo (get_socket (), out_buf, out_idx, &ret, flags,
                 wsamsg->name, wsamsg->namelen, NULL, NULL);
          debug_printf("WSASendTo sent %d bytes; ret: %d", ret, res);
          while (!(res=wait_for_events (FD_READ | FD_OOB, 0))) {
              debug_printf("Waiting for socket to be readable");

But the strace at this point just outputs:
   62  108286 [socksel] poll_test 24152
fhandler_socket::af_local_connect: af_local_connect called,
  156  108442 [socksel] poll_test 24152
fhandler_socket::send_internal: WSASendTo sent 16 bytes; ret: 0

It never returns from send_internal.  I don't have deep knowledge of
WinSock, but from what I've read ISTM WSASendTo should have triggered
an FD_READ event on the socket, and it doesn't for some reason.

Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple