[Developers] Revised Ethernet Library Client code to address connection problems and incomplete disconnects

Bruce Luckcuck bruce at etracer.net
Thu Apr 2 20:55:30 EDT 2009


There is a related thread in the forums at:
http://www.arduino.cc/cgi-bin/yabb2/YaBB.pl?num=1238640832

Revised Client.cpp available from here (based on Arduino-0015):
http://www.etracer.net/arduino/ethernet/Client.cpp

Sorry, this is going to be a long email...

I've completed a significant revision to the Client portion of the Ethernet
library. The original motivation was to determine what was causing the
apparently random inability to open sequential connections. The other
symptom I observed was occasional connections that remained ESTABLISHED even
after a client.stop() was called. It turned out the two problems were
related.

One of the fundamental challenges with the Ethernet library is that only 4
sockets are available. This doesn't leave much wiggle room if a socket
becomes unavailable so we need to make sure that the sockets close cleanly
and release quickly so we have resources available for the next connect.

This was the fundamental problem with the problem relating to sequential
connections. The symptom was that relatively rapid connection requests (even
after a previous stop()) would randomly fail. What I found was that
sometimes after a client.stop() the socket would remain in a SOCK_FIN_WAIT
state for approximately 30 seconds until the state timed out. During this
time the socket was not available for connections. So if a sketch made 4
connect()/stop() cycles in less than 30 seconds, all of the sockets would be
"in use" and the 5 connection would fail. This was made more difficult to
diagnose because the client.stop() didn't always leave the socket in the
SOCK_FIN_WAIT state.

Since the SOCK_FIN_WAIT is a normal transition in an orderly TCP disconnect,
initially I assumed that there was some underlying problem in the Wiznet
TCP/IP stack that was improperly handling the orderly close. There didn't
appear to be much under our control since the close and disconnect simply
send commands to the chip. Since our side was in a SOCK_FIN_WAIT state, it
implies we sent the FIN but never received the ACK from the remote end.

So I set about adjusting the client.connect() method to make it consider a
socket in the SOCK_FIN_WAIT state to be equivalent to SOCK_CLOSED and
therefore eligible for use. The rationale is that the connection is dead
anyway and it's just going to time out. Plus we only have 4 sockets so we
have to be a little aggressive. The changes to connect() worked-around the
incomplete disconnects and mostly resolved the problem.

The next investigation was to determine why the incomplete disconnects were
occurring. The client.stop() code was so simple there didn't seem to be
anything to correct. The original code:

void Client::stop() {
  close(_sock);
  disconnect(_sock);
  EthernetClass::_server_port[_sock] = 0;
}

But the problem was right there. According to the Wiznet porting guide:
http://www.wizwiki.net/tc/regina/entry/W5100-Hardware-Design-Guide-2-MCU-Por
ting

The close() and disconnect() are mutually exclusive. The close() terminates
the connection without sending a FIN (what we don't want to do), and the
disconnect() initiates a graceful disconnect with FIN/ACK sequence (what
we'd prefer to do). So by calling close() to immediately sever the
connection, and then disconnect(), we were sending a FIN to a remote end
that was no longer there and would never respond with an ACK (thus the
timeout). So probably all that was needed is to remove the close(). I
enhanced it beyond that to attempt a graceful shutdown and give that 1
second before forcing an abrupt disconnect with a close(). Once again in the
mode of aggressively trying to ensure that the sockets are available. So
we're trying to let the connection close nicely, but aren't going to wait
around for it.

In both the client.connect() and stop() methods you'll note that I have a
delay(1) in the respective loops looking for a socket status change. I did
this to reduce the impact on the SPI bus with constant calls to the Wiznet
chip. The downside is that this required me to #include WProgram.h (but I
needed for the millis() function anyway). Also there tends to be an
unavoidable minimum 1ms delay on connects and disconnects because of this. I
tried to minimize it, but the polled state change never happens immediately
(you'll understand when you look at the code). I was trying to keep it
simple and still make it robust. I didn't think a 1ms overhead was a really
big deal anyway but look forward to comments.

There were also a number of potentially serious problems if some of the
methods were called out of order. As an example, look at the original
Client::stop() I listed above. If a previous connect failure occurred, _sock
would be set to 255 (by the connect() method). If the stop() method was
called (incorrectly) next, the value of _sock would cause an array
out-of-bounds in the statement:

EthernetClass::_server_port[_sock] = 0;

The would probably corrupt the heap and lead to all kinds of bad things. So
basically I added code to ensure that methods that relied on a Client having
an associated socket actually did. I also add code to initialize _sock to
255 in the constructor and reset it to 255 if a connect failure occurs or
after a stop(). This way we have a reliable indicator whether the Client
object is associated with a socket.

There was also another hole in the connect() method that was related to
_sock. The method was not checking to see if the client was already
associated with a socket (regardless of its connection state) before finding
the next available socket to use. So if a sketch erroneously made a second
connect() call before closing the previous connection, the initial socket
would become unrecoverable. It was unavailable and there was no way to
release it. This problem is also resolve by checking if _sock == 255 before
attempting the connect. If the client is already associated with a sockect,
then the connect now fails.

Along the connect() failure lines... Right now the library simply returns a
0 for failure and 1 or success. The problem is that there are multiple
causes for failures and no way to convey that information back to the
caller. This was one of the challenges in debugging the problems. The simple
answer might be to simply add more result codes. But the problem is that the
example code has set the standard that the result from connect() can be
treated as boolean since 0 = false and 1 = true. So adding additional result
codes for failures would erroneously be considered successful connects. I
don't have a really good answer for this since we're bounded by a lot of
existing code that assumes the boolean result. Maybe store the last connect
result code as a public variable that the user can interrogate if desired?
Maybe overload the connect() to return the result code as a by-reference
parameter? I'm looking for suggestions.

Well that's enough for now. I would appreciate any additional testing. I've
done quite a bit but you know how that goes. In the forum post I referenced
at the top of this email is a sample sketch that exercises the Client code
with rapid connects. I tried to be very descriptive with comments in the
code, but if anything is confusing please ask.

- Bruce





More information about the Developers mailing list