Web lists-archives.com

Re: Hangs on connect to UNIX socket being listened on in the same process (was: Cygwin hanging in pselect)




On Mon, Jan 9, 2017 at 12:01 PM, Erik Bray <erik.m.bray@xxxxxxxxx> wrote:
> On Fri, Jan 6, 2017 at 12:40 PM, Erik Bray <erik.m.bray@xxxxxxxxx> wrote:
>> Hello, and happy new-ish year,
>>
>> I've been working on and off over the past few months on bringing
>> Python's compatibility with Cygwin up to snuff, including having all
>> pertinent tests passing.  I've noticed that there are several tests
>> (which I currently skip) that cause the process to hang indefinitely,
>> and not respond to any signals from Cygwin (it can only be killed from
>> Windows).  This is Cygwin 64-bit--I have not tested 32-bit.
>>
>> I finally looked into this problem and found the lockup to be in
>> pselect() somewhere.  Attached I've provided the most minimal example
>> I've been able to come up with so far that reproduces the problem,
>> which I'll describe in a bit more detail next. I would attach a
>> cygcheck output if requested, but I was also able to reproduce this on
>> a recent build from source.
>>
>> So far as I've been able to tell, the problem only occurs with AF_UNIX
>> sockets.  In the example I have a 'server' socket and a 'client'
>> socket both set to non-blocking.  The client connects to the socket,
>> returning errno EINPROGRESS as expected.  Then I do a pselect on the
>> client socket to wait until it is ready to be read from.  The hang
>> only happens when I pselect on the client socket, and not on the
>> server socket.  It doesn't seem to make a difference what the timeout
>> is.  One thing I have no tried is if the client and server are
>> actually different processes, but the example from the Python tests
>> this is reproducing is where they are both in the same process.
>>
>> Below is (I think) the most relevant output from strace on the test
>> case.  It seems to hang somewhere in socket_cleanup, but I haven't
>> investigated any further than that.
>
> I made a little bit of progress debugging this, but now I'm stumped.
> It seems the problem is this:
>
> For each socket whose fd is passed to select() a thread_socket is
> started which calls peek_socket until there are bits ready on the
> socket, or until the timeout is reached.  This in turn calls
> fhandler_socket::evaluate_events.
>
> The reason it's only locking up on my "client thread" on which
> connect() is called, is that evaluate_events notes that the socket is
> waiting to connect, and this passes control to
> fhandler_socket::af_local_connect().  af_local_connect() temporarily
> sets the socket to blocking, then sends a magic string to the socket
> (you can see in my strace log that this succeeds).  What's strange,
> and what I don't understand, is that there are no FD_READ or FD_OOB
> events recorded for the WSASendTo call from af_local_send_secret().
> Then, after af_local_send_secret() it calls af_local_recv_secret().
> This calls recv_internal() which in turn calls recursively into
> fhandler_socket::evaluate_events where it waits for an FD_READ or
> FD_OOB event that never arrives.  And since it set the socket to
> blocking it just sits in an infinite loop.
>
> Meanwhile the timer for the select() call expires and tries to shut
> down the thread_socket but it can't because it never completes.
>
> What I don't understand is why there is not an event recorded for the
> WSASendTo in send_internal.  I even wrapped it with the following
> debug code to wait for an FD_READ event immediately following the
> WSASendTo:
>
>       else if (get_socket_type () == SOCK_STREAM)
>       {
>         WSAEventSelect(get_socket (), wsock_evt, EVENT_MASK);
>         res = WSASendTo (get_socket (), out_buf, out_idx, &ret, flags,
>                  wsamsg->name, wsamsg->namelen, NULL, NULL);
>           debug_printf("WSASendTo sent %d bytes; ret: %d", ret, res);
>           while (!(res=wait_for_events (FD_READ | FD_OOB, 0))) {
>               debug_printf("Waiting for socket to be readable");
>           }
>       }
>
>
>
> But the strace at this point just outputs:
>    62  108286 [socksel] poll_test 24152
> fhandler_socket::af_local_connect: af_local_connect called,
> no_getpeereid=0
>   156  108442 [socksel] poll_test 24152
> fhandler_socket::send_internal: WSASendTo sent 16 bytes; ret: 0
>
> It never returns from send_internal.  I don't have deep knowledge of
> WinSock, but from what I've read ISTM WSASendTo should have triggered
> an FD_READ event on the socket, and it doesn't for some reason.

After playing around with this a bit more I came up with a much
simpler example.  This has nothing to do with select( ) at all,
directly.

The simplified example is just:

#include <arpa/inet.h>
#include <sys/socket.h>
#include <string.h>
#include <stdio.h>
#include <sys/un.h>
#include <errno.h>

int main(void) {
    fd_set rfds;
    int sock_server, sock_client;
    int retval;
    struct sockaddr_un addr;

    memset(&addr, 0, sizeof(addr));
    addr.sun_family = AF_UNIX;
    strcpy(addr.sun_path, "@test.sock");

    sock_server = socket(AF_UNIX, SOCK_STREAM, 0);
    if (bind(sock_server, (struct sockaddr*)&addr, sizeof(addr))) {
        printf("binding server socket failed");
        return 1;
    }

    retval = listen(sock_server, 5);
    printf("Ret from listen: %d\n", retval);

    sock_client = socket(AF_UNIX, SOCK_STREAM, 0);
    retval = connect(sock_client, (struct sockaddr*)&addr, sizeof(addr));
    printf("Ret from client connect: %d; errno: %d\n", retval, errno);

    return 0;
}


On Linux this example works as I expect, and the connect() call
returns immediately.  However, on Cygwin the connect() call hangs
after af_local_send_secret(), as described in my first message.

However, when I split this example up into separate client and server
processes it works as expected and the connect() is properly
negotiated and returns immediately.

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple