W5500 get stuck on connect(), but only sometimes (infinite loop stuck on SOCK_SYNSENT)

I have an issue with my board based on W5500. I have successfully implemented a TCP client that sends an HTTP message and then waits for a response. The behavior, however, is quite puzzling. Just about half the time everything works perfectly, while the other half - the TCP socket opens, connects without error and then hangs on send().

There is no return from that function, and it seems like the program is stuck in some sort of infinite loop.

When checking wizchip_gettimeout, I have a value of 2000 for timeout and 8 for retries. If I’m not mistaken 2000 means 0.2 seconds?

I’m not sure why it hangs, and why only half the time, and not sure where to start debugging.

Any help would be greatly appreciate.

First check the circuit. Ensure that you used 49.9 Ohm resistors and not 49.9 kOhm resisotrs (for example). My suspicion is that device is having difficulties finding the simeslot on the TX wire to send the packet, and hangs on the loop CR clearing to 0. Try connecting W5500 to another network device (e.g. replace hub/switch, change port of the hub/switch). But it must be proven: compile the code with check points in the send() function (outputting some characters to the diagnostic console or anywhere else where you can see them) to find out where code gets stuck.

The circuit looks ok, I will check of course, and also I will do check points. But it seems weird to me that DHCP and DNS are 100% reliable, not a single failure. I would imagine a hardware error would affect that too, no?

I can lease IP, and resolve DNS 100% of the time. It’s send() for TCP that gets stuck.

Yeah, the circuit looks fine. 49.9 Ohm resistors are proper value.

It’s TCP that has a 50% failure rate… I’m gonna try to put checkpoints in send().

Am I wrong to assume that a hardware issue would result in failures with DHCP leasing and DNS resolution as well as TCP? DHCP/DNS work absolutely flawlessly.

Did you use the ioLibrary code from WIZnet (https://github.com/Wiznet/ioLibrary_Driver)?

Yup, I did use the ioLibrary code from that GitHub repo! As I said, the behavior is super odd to me, because DHCP works 100%, DNS works 100% , but TCP hangs 50% of the time. I’m still debugging it, but was wondering if someone knows where to point me.

In my understanding if there was something wrong with the physical TX line, DNS & DHCP would hang just as much, am I wrong?

You’re right.
very odd…Would you please give the exact line number you get stuck?
(free size check? before CMD_SEND? or after CMD_SEND? )

So, after some debugging turns out it gets stuck on socket connect() and not send(). Specifically, it gets stuck inside this loop:

while(getSn_SR(sn) != SOCK_ESTABLISHED)
   {
		if (getSn_IR(sn) & Sn_IR_TIMEOUT)
		{
			setSn_IR(sn, Sn_IR_TIMEOUT);
            return SOCKERR_TIMEOUT;
		}

		if (getSn_SR(sn) == SOCK_CLOSED)
		{
			return SOCKERR_SOCKCLOSED;
		}
	}

That’s the spot it ends up stuck at, about 50% of the time. (the rest of the time everything works great).

getSn_SR seems to be stuck on SOCK_SYNSENT, and never changes to anything else. Which is weird, because I have a value of 2000 for timeout and 8 for retries. So at the very least it should timeout?

This is the relevant code in my application:

// create a TCP socket
	if ((sck_status = socket(HTTP_DATA_SOCKET, Sn_MR_TCP, 0, 0)) == HTTP_DATA_SOCKET) {
		// connect to the server
		if ((con_status = connect(HTTP_DATA_SOCKET, (uint8_t*) dest_ip, (uint16_t) dest_port)) == SOCK_OK) {
			if ((send_status = send(HTTP_DATA_SOCKET, (uint8_t*) msg, strlen(msg))) != strlen(msg)) {
				printf_P(PSTR("[error] unable to send: %" PRIi32 "\r\n"), send_status);
                return;
			}
            
            close(HTTP_DATA_SOCKET);
		} else {
			printf_P(PSTR("[error] connection not available: %i\r\n"), con_status);
		}
	} else {
		printf_P(PSTR("[error] unable to create socket: %" PRIi8 "\r\n"), sck_status);
	}

When it works, it works flawlessly, but when it doesn’t ,it reliably gets stuck on connect, being stuck on SOCK_SYNSENT.

Do you have timeout in your SPI read/write driver?

How about checking firewall setting, port number and server application status on your test server(PC)?
I think, there’s no reply “SYN+ACK” packet from your test server(PC).

For these values TCP timeout will be 31.8 seconds, see RCR section of the datasheet, it has this value calculated. You just need to wait a little more :slight_smile:

As @Bong said it is most probably that some device located on the packet path filters out or loses the SYN/ACK return packet.

1 Like

I saw a similar issue while developing code for the W5500 and used WireShark to debug it. I found the problem was that the W5500 wasn’t seeing ARP responses from the wireless device (an old Linksys router running DD-WRT as a bridge) I was trying to connect to. I saw the ARP request from the W5500 repeated over and over, but there was no response. Never figured out if the problem was the LinkSys router or the ASUS-WRT Merlin-based router it was connected to. Sometimes it worked just fine, other times it didn’t. I’ve never seen this with any other device.

I don’t Know further What Happened but @dannypovolotski was your problem solved ?
I was Facing the same problem. Now i don’t know what solved but it is working.
Most probably the stuck issue is when Sizes of array or buffer assigned to socket does not match with initialization. This i have faced twice.

@Eugeny This Time you are talking about is already there by default and i never needed to change it. have tested this earlier ?

I’m still having intermittent issues. Those issues are very weird. I’m opening two sockets at a time - same destination IP and everything. As before , there is 100% stability on DNS & DHCP, but sometimes it gets stuck on connect(). When it does get stuck, I don’t understand why does the retry not help it? Sometimes after timing out, my code tries to open the connection again, but then it works. Most times the first socket connects without an issue, but the second one hangs. However that’s not every time.

The behavior is very weird and not very predictable.

I’m not entirely sure how can DHCP & DNS be rock solid, but TCP connections so fickle. Also - why would one socket to the same IP (I made sure the ports are different e.t.c) open with more stability than two?

It also seems like when it DOES get stuck, the loop is infinite. It never actually times out :frowning: … It gets stuck in an infinite loop somewhere in connect()