Appear to be locking up under heavy traffic

We have an AWS IoT based device using the W5500 chip. The device works flawlessly when it connects directly to a router or a switch. However, it seems to be locking up if it is plugged into a wall jack in a large commercial building. Here are some of our observations:

  1. My company has several buildings and each has its own network segment. If this device is plugged into a wall jack in a relatively small building, typical IP, we do see disconnect events at the AWS IoT application layer. However, the device can recover/reconnect without a problem.

  2. If the device is plugged into a wall jack in the large multi-story office building, typical IP, it appears to lock up from time to time. When this problem happens:

    • There is no traffic coming out of W5500 when we put a network sniffer (TAP) on.
    • The ICMP ping gets no response.
    • We can still read/write all internal registers.
    • We monitor the SIP register in case there is a reset on the W5500 side, but we haven’t seen one.
    • Again, if we install a LAN switch between the device and the wall jack, the problem seems to go away.
  3. The W5500 runs at 25MHz and the circuit design follows the recommendations provided by WIZNET.

  4. The application firmware doesn’t generate much network traffic. Most of time it is sending MQTT Keep Alive message once every 30-seconds.

  5. Not sure if this matters - In the firmware we set to use 10BT Half-duplex with auto-negotiation disabled even though PMODE 0-2 pins are all pulled high.

Any thoughts? Thanks.

Router/switch filters the traffic, sending to W5500 only traffic designated to it and broadcasts.

I used W5100 (which has more or less the same engine), and empirically found out that W5x00 does not perform well under high load, in particular the issue, which I caught, is when TCP communication gets out of order - W5100 can not recover properly from it.

But your case sounds different, as you say there’s nothing sent from the W5500 - in my case it sends some packets when its times out (can be up to 2 minutes).

You already have a workaround for the problem - buy cheapest router, and put it before W5500. Ugly, but it should work :frowning:

While you performed initial troubleshooting, more steps are needed to find the cause.

I usually use Wireshark, but I guess your network sniffer is set up in correct location of the network (connected directly to W5500?) and can capture everything on the wire. Then, how much time did you wait to decide that there’s nothing in there? 30 seconds? 5 minutes? Remember W5500 works according to its timeout algorithm, which may cause it looking hung for minutes.

Needs more monitoring time to ensure there’s nothing on the wire for minutes after ping request is sent.

Dump all registers (common as well as socket for all sockets) to see anomalities.

That is a good way to detect reset.

Then logically, W5500 gets something it thinks wrong it can not digest in your network. Until we find out what it is, not possible to say anything certain. For this you must watch the traffic before the W5500 (e.g. connect some bridge PC with two ports to W5500 with one port and to wall jack to another, and run Wireshark on it),

Good that you mentioned it. I think you can easily change this software setting, at least for testing purposes, and see if W5500 behaves the same way.

Thanks for the reply.

Remember W5500 works according to its timeout algorithm,…

Can you tell me where to find this timeout algorithm you were referring to? I assume we are using the default settings at the moment since there is no code on our side that sets up anything related to timeout.

Perhaps to say that nothing coming out of W5500 is a bit premature because I don’t think we knew what to look for at the time of system locking up or had waited long enough. The Wireshark capture, through the use of a network sniffer, was showing tons of seemingly unrelated traffic.

FYI - This is the sniffer we are using:

We also suspect that there might be electrical/wiring issues in this large building because when we put a network sniffer between the W5500 and the wall jack, we can’t seem to reproduce the problem. This makes the troubleshooting really difficult. We have to not use the sniffer first to make the problem happen, then insert the sniffer after the problem occurs. Of course, this causes the PHY layer to disconnect then reconnect.

Pages 39-40 of the datasheet.
TCPto is total timeout, which is progressive with each next retry taking 2x more time (until time quantum value becomes >=65536). Thus at later stages time between packet retransmit will be 6.4 seconds.

Related to W5500? (being as destination or source?)

In my experience there could be such issues. I have simple ASUS hub device, and sometimes, when I power W5100 up, W5100 becomes crazy showing contonuous incoming traffic (RX LED is on and activity LED is constantly blinking), when there is no traffic at all. Only hard reset (sometimes several times) helps. But it seems to be related to this hub attached, and I suspect it really has something with electronics rather than driving algorithm or logical problem. I recall there were also reports that W5x00 do not work very well with Cisco switches - and that solution would be to preset switch port to the desired specs (instead of relying on its auto-negotiation capabilities).

Do you have only one W5500-based device under consideration?

And it does not fix the situation? Usually when PHY is reset, if it is the cause, it unlocks and network access to/from the chip is being restored.

Can you build into your W5500 driver the procedure to reset PHY (PHYCFGR bit 7) so that you can execute this code when chip locks up and see if it helps?

There is another product line being developed at the moment. However, the software on the cloud side isn’t as ready as the product I am currently testing.

I’ll see if I can do this. Good idea. Thanks.