reliable TCP reconnect made easy

When I came to work on Syslog one of the most disturbing texts I came across was Rainer’s observation “On the (un)reliability of plain tcp syslog…“. The problem is that a sendmsg() system call is nearly always successful — it only indicates local errors (like a full send queue), but no network errors. So even after the other side initiated a connection shutdown one can happily write into the local buffer and only get an error on the second write.

Most applications and protocols use a request-response or a session model and do not have problems with this, because they simply reset their status on connection loss and start again. Syslog is different because it does not use a backchannel (for acks or other server responses) and completely relies on the one-direction channel for server and client to synchronize their states over long periods of time and across connection losses. After reading Rainer’s text it took me some time to believe the problem described, and even then I saw the words and accepted the reasoning, but it still felt wrong to believe in them.

This week I began my own Syslog programming with a small client/server program to test different IPC protocols; and the plain TCP variant nicely confirmed the observation above. Only after trying SCTP which did not have that problem, I realized that we asked the wrong question. (I did not dig into it but I guess the socket is marked broken really fast on connection shutdown, so the first write to it returns an error immediately). — sendmsg() is not our only interface to the TCP-layer and if sendmsg() does not tell us whether a connection is still established then we just have to use another interface.

My first experiment lead to a bad solution: I found that Linux and FreeBSD have a struct tcp_info with TCP state information that can be accessed with getsockopt(). This allows us to always check if the connection is still established:

getsockopt(sock, IPPROTO_TCP, TCP_INFO, &tinfo, &tinfo_size)
if (tinfo.tcpi_state != TCPS_ESTABLISHED) {
  if (sock) { close(sock); }
  /* reconnect */
}
/* send() */

After that I finally found the good and really simple solution: Just use recv() to check on the connection status. On connection shutdown recv() immediately indicates that the socket is unavailable and the application can react upon it by reconnecting:

rc = recv(sock, msgbuf, BUFSIZE, MSG_DONTWAIT | MSG_PEEK);
if (!rc) {
    if (errno == EAGAIN) {
        /* server closed connection */
        connected = 0;
    } else {
        perror("Error in recv()");
    }
}
if (!connected) {
    /* (re)connect */
    connected = 1;
}
/* send() */

Source code is online in my Trac. Note that the tcp_info version only works on FreeBSD; on Linux I did not find the necessary header files yet and NetBSD does not support tcp_info (bad solution anyway).

Update: Just a note on the intention and the kind of problem that is solved here. TCP has no atomic and synchronous send. Period and no need to argue.

With “real” network problems TCP does not show errors immediately because it was practically designed not to. (I have already worked on failur tolerance in a cluster computing environment and there are similar problems.)

The main problem I want to improve on is the simple LAN without connection or even packet loss. Where it is really unnecessary (and annoying) for a TCP-based syslog to derministically loose messages on a server reload, just because the client does not react upon a nice and clean connection shutdown.

Later Update (22.6.): One obvious problem is the additional system call per socket operation. So I am quite happy I do not have to use this in the BSD syslogd implementation. I found it much easier to register a BSD kernel event on the socket. That way the kernel immediately notifies the application when there is either data to read from the socket or when it is closed.

13 Responses to “reliable TCP reconnect made easy”

  1. Rainer Gerhards says:

    Hi Martin,

    I think this is a nice but partial solution. Note: I did not yet prove what I am talking about ;), but.. As of my understanding of TCP, this does not (actually MUST not) work if the connection really breaks (that is remote machine or network connection dies). In this case, you do not have a real shutdown.

    But the shutdown case should work. I’ll give it a try. More information on the netbsd maling list and my blog.

    Thanks for the good idea!
    Rainer

  2. Rainer Gerhards says:

    oops… one thing: EGAIN should not indicate a shutdown condition. It may happen when the call is interrupted or no data is present – and there never should be any data present on this socket (as the server never sends it). So I conclude that the code currently always does a reconnect… I had anticipated that you check the other error codes (the else case). I’ll continue that route..

    Rainer

  3. Martin says:

    Yes, the given code was and is incomplete and does not handle all cases. I also completely forgot to check if the return value was positive or negative :-/
    (rc = -1 and errno = EAGAIN is the ‘normal case’, if rc >= 0 then errno is meaningless because it is not reset.)

    Some more tests with regular connection closing as well as dropping all IP packets for some time led me to distinguish more possible cases: https://barney.cs.uni-potsdam.de/trac/syslogd/browser/trunk/testing/tcp/client_recv.c?rev=7

    Of course all of this is only tested, not proven correct ;)

  4. Rainer Gerhards says:

    I am sorry, but it doesn’t work out. I have now also been able to define why it is impossible to do reliable delivery over plain tcp without app-level acks: http://blog.gerhards.net/2008/05/why-you-cant-build-reliable-tcp.html

    Rainer

  5. Rainer Gerhards says:

    Martin, I saw you post from noon today only after posting my comment. You are now using two error states:

    74 } else if (rc == -1 && errno == ECONNRESET) {
    75 /* server closed connection */
    76 printf(“got ECONNRESET, reconnectn”);
    77 connected = 0;
    78 } else if (rc == -1 && errno == ENOTCONN)

    However, you do get ECONNRESET and ENOTCONN also from the send() call, but ONLY after the TCP stack has detected that the connection is broken. I propose this test: make the client connect to the server. Pull the network cable from the server. Make the client continue to send some data to the server. The client will be unable to detect the server failure for “n” records. Plug in the network cable to the server. Server will now receive the “lost” messages. So the client was right in assuming the server was available.

    Another test: connect to server as before. Do not pull the network cable but switch the machine off (do NOT shut it down). Client will not see any problems. Problem will be seen only after server comes back online (TCP connections can be reset at this point).

    Rainer

  6. Rainer Gerhards says:

    OK, I have updated my last blog post with references to RFC 793, which provides clear evidence that there is no way around this problem (at least for a compliant implementation). Rainer

  7. Rainer Gerhards says:

    Martin, I have read your update. This use case isn’t fixed either, at least not reliably. It is racy. You get the same results be just doing the send() without the recv() – just give it a try. ;) — Rainer

  8. Rainer Gerhards says:

    OK, I am not just ignorant ;) I’ve given it another try, while racy it looks like it indeed solves the problem in the majority of simple cases. I just wonder what happens under the hood. I’ll have a look :) — Rainer

  9. Rainer Gerhards says:

    Interesting. I’ve now done a wireshark trace. As it looks, the IP stack does NOT process the incoming connection request until after it actually sends the message. If you do the recv(), this forces this evaluation. I have not verified this with the tcp sources and I don’t think it makes much sense to do so. It’s probably a bit platform-specific, depending on what the IP stack does. Anyhow, it seems to greatly reduce the problem in non-failure cases and does not cause much performance pain. So it is worth adding it. I’ll now apply it once again to the real rsyslog and see how it performs there. I keep you posted. Rainer

  10. Rainer Gerhards says:

    OK, I have done some further stress testing. Unfortunately, the solution does not work as soon as the traffic gets higher. In client.c, you do a sleep. This sleep makes sure that the remote end has time to do its close and notify the client before the next message is sent. I modified you examples so that the sleep() is no longer used and 1000 messages are sent by the client. Then, the disconnect happens 1 byte into message 423 (stream was partly received). The next message I see is 701. The client also notices the problem at 701. So I have lost around 280 messages without anybody noticing it. Of course, on a non-busy server that does not happen, but it proves the point that even in a nice clean close scenario without packet loss you can lose data on a server restart. There is no reliable cure. I’ll now evaluate if it makes sense to add your suggested recv() logic for those cases where it helps. I am just unsure if these are a large enough sample to justify this code change. Anyhow… Please let me know any further findings you may have (did I overlook something)?

    Rainer

  11. Rainer Gerhards says:

    Hi Martin,

    I just wanted to let you know that I have written a final “analysis”: http://blog.gerhards.net/2008/06/getting-bit-more-reliability-from-plain.html . In short, I’ll see that I give it a try in rsyslog, too. It’s not a perfect solution (what on earth is that…), but it looks like it offers sufficient benefit to be worth implementing ;)

    Thanks for bringing it up.

    Rainer

  12. Rainer Gerhards says:

    Hi Martin,

    some more quick feedback: I have now implemented your approach in rsyslog, it will come out as part of 3.19.7. Initial testing confirms the findings – it works well in local, low-traffic environments, but still fails in others. However, even in the fail cases, the message loss is somewhat less than without the recv(). So I think it is useful to have this work-around in the code. Thanks for pointing it out :)

    Rainer

  13. jdmurray says:

    For syslog TCP reliability, would it not be easier to just prevent TCP from buffering the outgoing syslog messages? This would allow the syslog client to immediately know if the syslog message was sent successfully or not by the send exception thrown. The app itself would then keep track of the “first unsent” message.

    I have implemented a syslog TCP client using C# and .NET 2.0 that sends event log messages to a syslog TCP server. If the TCP connection is broken, the next syslog message sent will cause the TcpClient object to return a Winsock error (10061 or 10054) and my app can then keep track of which event messages were not sent until a new connection can be established. It works very well.