Why has my device has started failing to connect 90% of the time despite nothing having changed?

I am working with a device a device that uses a W5500. I have had very little to do with its development, it was produced years ago by someone no longer with the company. Until recently I have had no problems with it, but (despite no changes being made to FW or HW) it now usually fails to connect. I have tried:

  • reverting to a previous version just in case something has changed;
  • testing on multiple devices to check nothing has been damaged or anything;
  • three different routers in different locations to confirm its not simply a bad connection;
  • debugging it myself as best I can.

It seems to use this library:

for socket.c and wizchip_conf.c and w5500.c

The code where it should connect:

int ethNI_tcpConnect(fbNI_sN_t sN, uint32_t timeout) {
  _ethNI_tcpSock_t * sock = (_ethNI_tcpSock_t *) sN;
  int retval = 0, count = 0;
  uint8_t status;
  uint32_t startTime = HAL_GetTick();
  retval = connect(sock->sN, sock->ip, sock->port);
  if(retval == SOCK_BUSY) {
    do {
      getsockopt(sock->sN, SO_STATUS, &status);
      if((HAL_GetTick() - startTime) > timeout) {
        retval = -40;
        break;
      }
      count++;
    } while(status != SOCK_ESTABLISHED);
  }


  if(retval < 0) {
    #ifdef DEBUG_PRINT
    debug_printf("TCP connect failed: %i, count: %i\n", retval, count);
    #endif
  }

  return retval;
}

Most of the time it doesn’t connect, so status != SOCK_ESTABLISHED until the timeout occurs (whatever the length of timeout). The program then restarts, but if it fails the first time it will never succeed. Occasionally after resetting it will connect (always first time), so there’s nothing fundamentally wrong with the whole thing.

The arguments (sock->sN etc) in connect() and getsockopt() are the same every time, whether it connects or fails.

Any ideas what might be causing this? If there’s any more information needed to diagnose the problem I’ll do my best to provide it.

I do believe you must have started with looking at the networking logs rather than digging the device. Set up machine as a bridge or within the same LAN (but be sure it gets all the packets going to and from W5500) and see what is going on with the communication. Capture device must be as close to W5500 as possible. Share here the log if it is not confidential.

Network settings outside of the W5500 may have been changed. Depending on what is on the packet path and on the target device (location) W5500 tries to connect to. Logs must show what is going on (or what is NOT going on).

Thanks for the answer. I have a few questions though, sorry if they’re pretty basic.

Set up machine as a bridge or within the same LAN
Capture device must be as close to W5500 as possible
Share here the log if it is not confidential.

What is the “machine” and “capture device” here? What logs should I share? Thanks again.

Machine is any PC or anything else having Wireshark installed.
Capture device is this machine in the explanation above, which captures the traffic into the buffer and then you save this buffer into the log file (logs).

You put this machine (PC) in between W5500 and W5500’s outer world, bridging these PC’s connections (so that traffic goes through PC in both directions, and it can intercept the traffic). Install Wireshark onto this machine (PC) and run Wireshark to capture the traffic under consideration. Then save captured traffic into .pcapng file for analysis. If you have necessary knowledge, look into these logs, maybe you will immediately see something wrong. and will fix it.

I’ve not used wireshark before, so I’ve tried a few tutorials but I’m having some trouble with a lot of the terminology. They all seem to assume a lot of knowledge I don’t have. I’m not sure how to set things up so the traffic goes through the PC. Please could you explain in more detail, or perhaps point me towards a tutorial that explains how to use wireshark for this specific setup? Is there anything I need other than the faulty device, a PC with wireshark, and an ethernet cable? In the meantime I’ll continue to try to understand on my own.

Thank you again, I’m extremely grateful for your help.

The first two in the search list:
Help to set up a “pass through bridge” sniffer
Capture network traffic with Wireshark

Perform Wireshark installation first, play with it, see how data is being captured and look into the packet details (even if you do not fully understand what is in there :slight_smile: ).
And after you are more or less experiences making Wireshark capture, play bridging or network sharing.

I understand the basics of how to use wireshark in general, which seems to be what the second link explains. What I’m struggling with is how to set everything else up so that I can use wireshark to capture the relevant traffic. I don’t know what the physical setup should be. Maybe the first link explains this, but to me it seems to have skipped this step as the person who asked the question already understood it. In any case, I got as far as

  • in that window, left-click one of the adaptors you want to enbridge, and then hold the Ctrl button and left-click the other one
  • now right-click any of them and choose “enbridge” (the 4th item from the top), a new “network card” will be created.

but I only have one ethernet port. I do have a VAR11N-300 repeater/bridge, can that be used in place of a usb ethernet dongle? Or can I just use wifi? What should the PC be connected to via the ethernet cable(s) - one to the router and one to the board with the W5500 that I am debugging? Or does the board need to be connected to the router?

I tried reading this for example: Ethernet · Wiki · Wireshark Foundation / wireshark · GitLab
but found it impossible to follow.

You are right, the PC must have at least 2 LAN adapters. What type - clearly wired RJ-45 for W5500, other end does not matter (wired or WiFi or whatever). PC must be on the way of the packets - one of its network interfaces is connected to W5500, and another to the network W5500 usually was connected to. If it will be slightly clearer, PC is a “man in the middle” which will monitor and log everything going through it.

Managed to get it working, thanks. What should I be looking for? Looks like the relevant packets are TCP/HTTP? I can see that when it fails to connect I get a few “Bad TCP” packets, and when it succeeds I get a lot of TCP/HTTP packets.

When it fails I get a (not Bad) TCP packet with info:
49155 → 80 [SYN] Seq=0 Win=2048 Len=0 MSS=1460
followed by several Bad TCP packets with info either:
[TCP Retransmission] [TCP Port numbers reused] 49155 → 80 [SYN] Seq=0 Win=2048 Len=0 MSS=1460
or
[TCP Port numbers reused] 49155 → 80 [SYN] Seq=0 Win=2048 Len=0 MSS=1460

When it works I get several which look similar to the Bad TCP packets:
80 → 49155 [SYN, ACK] Seq=0 Ack=1 Win=26883 Len=0 MSS=1452
49155 → 80 [ACK] Seq=1 Ack=1 Win=2048 Len=0
49155 → 80 [PSH, ACK] Seq=1 Ack=1 Win=2048 Len=470 [TCP segment of a reassembled PDU]

Is that at all useful or do I need to be looking in the more detailed logs for each packet? I’d rather avoid sending the entire log, just to be safe.

1 Like

Your problem is you reusing same local port number every time. You must use different local port for every connection (e.g. +1 every time for new connection). I do not know why it was working before, but in some cases it will not work because networking equipment will think that current packet is continuation of the previous TCP session with the same port number.

1 Like

There are 8 sockets right? Should I just loop through all 8 every time until I get a connection?

No, you must, when you open next TCP connection, use different local port number. Use some global value between 1000-60000 and perform +1 to this global variable after you use it to open new socket.

If you always use same port number (in your case 49155) for TCP, networking nodes may have difficulties identifying what socket (data stream) specific packet relates to. I am sure they can, but not all the nodes are that intelligent. Reusing same port number for subsequent connections is also a bad idea because next packet may be thought relating to connection which was already closed. Therefore, the rule is to use unique port numbers for any open socket and ensure reuse of just used port number to as distant time as possible (so that caches expire, and networking equipment ‘forgot’ about previous connection with the same port number).

Do you use W5500 in server mode or client mode? You ‘get’ connection when you are in server mode, you ‘make’ connection when you are in client mode.
In client mode you may say ‘get connected’ (which is applicable for server mode too).

1 Like

I see, is that just something you should know in general (regardless of hardware/libraries used) before you start?

In the code in my original post, sock->port in connect() is 80, not 49155. Am I looking in the wrong place? I haven’t found anywhere else that the code where 49155 could be used as a port yet. The port that is used is determined by:
params->port = (params->useHTTPS)?443:80;
which from a quick google seems to make sense.

When does a new port number need to be used - or to put it another way: when is a new TCP connection opened? Does it happen constantly, or only on startup, or any time connection is lost and has to be remade? Or none of those?

Sorry for the confusion, I used “get” in the colloquial/non-technical sense. I was just being imprecise with my language, I should have used “make”.

This was the clear indication from the Wireshark. port 80 is destination port number. I am talking about socket’s source port number. No idea how it appears in the socket structure/W5500 registers, but this Wireshark warning says the source port number does not change with subsequent connections. You must closely look into your W5500 software.

I believe it is not in the code you provided. The code performs connect, socket is being opened earlier.
Library routine is called socket(). I suspect that uint16_t port is what we are looking for.

Okay, I had thought that might be the case but wasn’t able to find another obvious place for the port to be chosen.

The relevant call to socket() uses port == 0. From what I understand means a suitable port should be automatically chosen. In socket(), if port == 0 then it tries to find a port starting with 49152, the first dynamic/private port available. For some reason every time it repeatedly calls socket() until port 49155 is reached, and then uses that. So I just need to fix that.

I think I understand what’s going on at this point. Unless something I’ve just written is wrong, I’m guessing there’s not much more you can do. Thank you so much for all the patience and help! Don’t know if I’d ever have got this far without it.

1 Like

I can. Look here, it is a place where source port number is set. Put debug output here (so that you see output when this call/action is performed).

Another problem could be, not sure how this guess is viable and possible, that your code does not close socket at all and does not call socket(), but instead uses same socket settings set once for all subsequent connect() calls. This may explain the use of the same source port number. But for sure, you need to dig the code. This must not be hard if you know how things must work, you just need to find a flaw in the algorithm or design of the program.

For more detail see W5100 datasheet “5.2.1.2 CLIENT mode”. Before socket can be operated it must be in closed state, and then open with socket() (and this is when new source port is set). And after use/termination socket must be in closed state (being close()'d or auto-closed due to socket life cycle).

1 Like

Yeah I worked out where the port was set. Turned out that if it failed to connect, the whole microcontroller would be reset, so of course sock_any_port in that link you set would initialised with the initial value for the next attempt, and the port would always be the same.

I’ve fixed that, although the code I’m working with is a bit of a mess so who knows, perhaps I’ll have created a new problem.

Thanks again.

1 Like