TCP stalls when socket is spammed


#1

Hi all -

My company uses the W5200 in a product that has been deployed for years. A recent prospective customer, while evaluating our product, discovered that the TCP handler of the W5200 could be made to stall (without recovery) by inundating it with nonsensical messages.

The test scenario was simply sending packets with a payload of a single character (the letter “A” in their case) as fast as possible. As these packets were being received, I could observe through Wireshark that the TCP window was shrinking. It soon became zero, and never recovered.

Repeating this test with as little as 1ms between packets prevented this issue.

Is this a known problem?
Is there any available workaround?
Would one of the newer WizNet devices resolve this problem?

I can provide a Wireshark trace file if desired.

Thank you.


#2

Please look at what is going on at the microcontroller interface.

It can happen only if MCU does not “extract” data from the RX buffers of W5200. That’s why it is important to see what MCU is doing when receiving these ‘A’ messages. It may be failure of W5200 driver, and not W5200 TCP/IP stack.

Information encapsulated into TCP packet will have a format anyway, understandable to the application running in the device. What device is expected to do when it receives ‘A’? What does it actually do?

If you increase the deay between packets, does the window size starts increasing? Make a test with variable delay.

You explanation is clear - W5200 always reports window size 0, and this is what we will see in the log. So the log is not much of value until you figure out how device treats ‘A’ and how fast if flushes data from its RX buffers.


#3

The MCU is a legacy 8-bit processor (AVR ATMega1284P). Its resources, both compute and memory, are quite limited. The original programmer, as an effort to save space, designed the application to leave data in the WizNet buffer until a complete message has been received, or a timer elapses.

The incoming messages are encrypted, so this adds to the difficulty of knowing when a complete message has been received.

The 1ms delay results in the TCP window remaining within ~20 bytes of it’s 2048 capacity. I don’t think there’s much point to testing with further delays.


#4

So it is clearly the design flaw, right? It is just not designed for the conditions you test in.
Application/MCU must free RX buffer in time for W5200 being able to receive data.


#5

How do I free the buffer? It’s not clear from the datasheet. The current code does this:
void HOSRxFlush(void)
{
HosRxPut = HosRxGet = 0; // Rx buffer pointers
}

Which seems to be updating pointers, but not actually freeing any memory.


#6

W5100’s datasheet chapter 5 has full information on what should be done on the RX and TX buffers and their pointers. You copy data, and then update pointer(s). If you do not need data you can just update pointer to the proper value - this is the fastest way to “clear” the buffer, but data usually matters :slight_smile:


#7

Presumably about the same as the W5200’s datasheet? (I’m using the W5200.)

So, it appears that my MCU isn’t fast enough to properly service the incoming data, and I should be prepared for the eventuality of a ZeroWindow problem. Can I diagnose this from the MCU and reset the socket? (This would be a bit help if possible.)


#8

I think so. Some datasheet misses valuable information (e.g. W5500 has a tiny piece of what is explained in W5100’s chapter 5).

It depends on the application of your design. If incoming data is not expected to be like a DDoS you can live with it easily (given there’s none on the network starting behaving this way).

Not sure I understand. You want to discard data? Then you just update pointer to current position, this way “freeing” the buffer. No need to reset socket or even whole device.

You can consider alternative solutions - depending on MCU models, there could be pin-compatible (or even code compatible) replacements with higher processing speed. I know Microchips are good at making such product lines of slower - faster -fastest devices which can be put into the same socket and require minor changes to the code when porting the applications.

Also depends on how you service incoming data - by polling WIZnet chip registers or using interrupts.


#9

Thanks for the good information, Eugeny. What I meant in my earlier post is, I think that – as our product is currently built – we need to address the issue of the TCP Window filling up, because the MCU can’t remove data from the W5200 as quickly as the W5200 can receive it from the network.

When the buffer is full, the W5200 issues a ZeroWindow message. This will go on forever until the system is rebooted. My question is, is there an intermediate way to clear the buffer? The routine I posted above doesn’t do it.

We as a company are willing to look at replacing our MCU (and upgrading from the W5200 to the W5500 if need be), but I want to know what we can do with our current configuration.


#10

You must operate RX buffer pointers, and not forget about RECV command.

Picture you posted shows that MCU does not “remove” any data, not touching RX pointers at all and/or not performing RECV command after it. It is not possible to advise as I do not know what your application is doing and how it can tolerate this “data loss” (when data is just skipped).

There’re a number of ways to consider, e.g. waiting for specific data size in the buffer and only then reading it, bit their fit to your environment is very applications-dependent.


#11

Hi Eugeny -

Yes, you’re correct that the application doesn’t remove any data upon timeout. This may be a bug in our app, and would explain the current behavior.

Right now, I’m prepared to live with losing data, as long as I can prevent this TCP stall. I’ve read section 5.2.1.1 in the data sheet, and I believe what I need to do is reset the pointers Sn_RX_RD and Sn_RX_WR for the socket that is giving me trouble. Can I clear these pointers directly, or do I need to execute a RECV command?

Thanks for all the assistance.


#12

You must issue RECV command after changing pointer.


#13

So, write to the socket registers in this order:

W5200RegWrite(0x4328, 0x0);
W5200RegWrite(0x4329, 0x0);
W5200RegWrite(0x432a, 0x0);
W5200RegWrite(0x432b, 0x0);
W5200RegWrite(0x4301, 0x40); // the RECV command

Is that correct? Do I need to do something else to commit the command?

Thanks.


#14

I do not think so. Why you write zeros? Which registers you write to? What must you write in these registers?


#15

0x4328 is Sn_RX_RD0
0x4329 is Sn_RX_RD1
0x432a is S1_RX_WR0
0x432b is Sn_RX_WR1

0x4301 is Sn_CR; the 0x40 is the RECV command.

If I don’t change the pointers to 0, what should I change them to, in order to flush the buffer?


#16

Datasheet is pretty clear on it on page 50:

Sn_RX_RD += len;

where

len = Sn_RX_RSR;

What this does: you shift RD pointer past the received data size of RSR. Then perform RECV command which will mark skipped buffer area as free for new incoming data.

If you assign 0 to pointers it will break up TCP sequence and most probably prevent further communication performing correctly due to improper window indication and calculation.


#17

Oh…so:

u8 uc0;
u8 uc1;

W5200RegRead(Sn_RX_RSR0, &uc0);
W5200RegRead(Sn_RX_RSR1, &uc1);

W5200RegWrite(Sn_RX_RD0, uc0);
W5200RegWrite(Sn_RX_RD1, uc1);

W5200RegWrite(Sn_CR, RECV); 

Is that better?

Thanks, Eugeny.

EDIT: oops, no that’s not right…hang on a minute, please…I’ll correct it.


#18

Here’s another go at it:

u8 uc0;
u8 uc1;
u16 rsr;
u16 rd;

W5200RegRead(Sn_RX_RSR0, &uc0);
W5200RegRead(Sn_RX_RSR1, &uc1);
rsr = (uc1 << 8) | uc0;

W5200RegRead(Sn_RX_RD0, &uc0);
W5200RegRead(Sn_RX_RD1, &uc1);
rd = (uc1 << 8) | uc0;

rd += rsr;

uc0 = rd & 0xff;
uc1 = rd >> 8;

W5200RegWrite(Sn_RX_RD0, uc0);
W5200RegWrite(Sn_RX_RD1, uc1);

W5200RegWrite(Sn_CR, RECV); 
rc = recv(SockTCPL, HosRxDat.bad, &len, &(HosRxIp.ul), &port);	

Does this look correct now?


#19

Read octet pairs into 16-bit unsigned variables (RSR and RD), then sum them up, and write resulting 16-bit value by its octets into RD.

So second try seems to be the correct one :slight_smile: I am sure you will see result immediately when running it.

After you perform RECV wait until CR register clears to 0 before going further.

rc = recv(SockTCPL, HosRxDat.bad, &len, &(HosRxIp.ul), &port);

What this one does? Is it that you instruct to read data from the socket? Then there’s no sense in previous code. Or I would say this is redundant call as you have just cleared the buffer.


#20

Well, this looks hopeful. I may have solved the stall problem. I am now seeing, however, tons of TCP Dup ACK messages from my device:

Any idea what could be causing this?