My company uses the W5200 in a product that has been deployed for years. A recent prospective customer, while evaluating our product, discovered that the TCP handler of the W5200 could be made to stall (without recovery) by inundating it with nonsensical messages.
The test scenario was simply sending packets with a payload of a single character (the letter “A” in their case) as fast as possible. As these packets were being received, I could observe through Wireshark that the TCP window was shrinking. It soon became zero, and never recovered.
Repeating this test with as little as 1ms between packets prevented this issue.
Is this a known problem?
Is there any available workaround?
Would one of the newer WizNet devices resolve this problem?
Please look at what is going on at the microcontroller interface.
It can happen only if MCU does not “extract” data from the RX buffers of W5200. That’s why it is important to see what MCU is doing when receiving these ‘A’ messages. It may be failure of W5200 driver, and not W5200 TCP/IP stack.
Information encapsulated into TCP packet will have a format anyway, understandable to the application running in the device. What device is expected to do when it receives ‘A’? What does it actually do?
If you increase the deay between packets, does the window size starts increasing? Make a test with variable delay.
You explanation is clear - W5200 always reports window size 0, and this is what we will see in the log. So the log is not much of value until you figure out how device treats ‘A’ and how fast if flushes data from its RX buffers.
The MCU is a legacy 8-bit processor (AVR ATMega1284P). Its resources, both compute and memory, are quite limited. The original programmer, as an effort to save space, designed the application to leave data in the WizNet buffer until a complete message has been received, or a timer elapses.
The incoming messages are encrypted, so this adds to the difficulty of knowing when a complete message has been received.
The 1ms delay results in the TCP window remaining within ~20 bytes of it’s 2048 capacity. I don’t think there’s much point to testing with further delays.
W5100’s datasheet chapter 5 has full information on what should be done on the RX and TX buffers and their pointers. You copy data, and then update pointer(s). If you do not need data you can just update pointer to the proper value - this is the fastest way to “clear” the buffer, but data usually matters
Presumably about the same as the W5200’s datasheet? (I’m using the W5200.)
So, it appears that my MCU isn’t fast enough to properly service the incoming data, and I should be prepared for the eventuality of a ZeroWindow problem. Can I diagnose this from the MCU and reset the socket? (This would be a bit help if possible.)
I think so. Some datasheet misses valuable information (e.g. W5500 has a tiny piece of what is explained in W5100’s chapter 5).
It depends on the application of your design. If incoming data is not expected to be like a DDoS you can live with it easily (given there’s none on the network starting behaving this way).
Not sure I understand. You want to discard data? Then you just update pointer to current position, this way “freeing” the buffer. No need to reset socket or even whole device.
You can consider alternative solutions - depending on MCU models, there could be pin-compatible (or even code compatible) replacements with higher processing speed. I know Microchips are good at making such product lines of slower - faster -fastest devices which can be put into the same socket and require minor changes to the code when porting the applications.
Also depends on how you service incoming data - by polling WIZnet chip registers or using interrupts.
Thanks for the good information, Eugeny. What I meant in my earlier post is, I think that – as our product is currently built – we need to address the issue of the TCP Window filling up, because the MCU can’t remove data from the W5200 as quickly as the W5200 can receive it from the network.
When the buffer is full, the W5200 issues a ZeroWindow message. This will go on forever until the system is rebooted. My question is, is there an intermediate way to clear the buffer? The routine I posted above doesn’t do it.
You must operate RX buffer pointers, and not forget about RECV command.
Picture you posted shows that MCU does not “remove” any data, not touching RX pointers at all and/or not performing RECV command after it. It is not possible to advise as I do not know what your application is doing and how it can tolerate this “data loss” (when data is just skipped).
There’re a number of ways to consider, e.g. waiting for specific data size in the buffer and only then reading it, bit their fit to your environment is very applications-dependent.
Yes, you’re correct that the application doesn’t remove any data upon timeout. This may be a bug in our app, and would explain the current behavior.
Right now, I’m prepared to live with losing data, as long as I can prevent this TCP stall. I’ve read section 220.127.116.11 in the data sheet, and I believe what I need to do is reset the pointers Sn_RX_RD and Sn_RX_WR for the socket that is giving me trouble. Can I clear these pointers directly, or do I need to execute a RECV command?