W5300 SENDOK and Reserved Status Byte

I’m having a recurring issue sending data with the W5300. I have 5 test units, and each unit is using a W5300 and sending data at a very high speed, about 80Mbits/second total, mostly evenly divided among 4 sockets.

They ran for hours, and then eventually they all failed at different times. The failure mode is that I am waiting for the SENDOK bit to be set in Sn_IR, and it never gets set for one of the sockets. The Sn_SSR remains good, 0x17 socket established.

I do notice that the Reserved byte of the status register (for example 0x288 for socket 2) is different. For the sockets that are working, I see 0x01 in this, but for the frozen socket, I have seen either 0x89 or 0x8B). Is that reserved register giving me some useful status information? Can I use it to know that something has happened so I shouldn’t expect the SENDOK bit to get set? Can I recover from this state?

In general, what would cause the SENDOK bit to never get set if the connection is still established? The client is not receiving any data, but reports no errors.

Thank you.

Update: I tried to see if the status goes through any temporary conditions during a normal write sequence, and found the following:
After writing data to the TX FIFO: Sn_SSR (including the reserved byte) was one of 0x07A0, 0x07C0, 0x07E0, 0x0800
After writing Sn_TX_WRSR: Sn_SSR is back to normal 0x0117
After writing SEND command: Sn_SSR remains normal 0x0117

I have not been able to duplicate seeing 0x8917 or 0x8B17 in the reserved+SSR bytes in any normal operation; this only seems to occur when the freeze up happens.

By the way, I have the keepalive set for 5 seconds, but I haven’t checked during the freeze up if the keepalive messages are going through. I’ll try to duplicate it and check with Wireshark.

Update 2: I was able to duplicate the failure with 3 units running overnight. For the “frozen” units, I used Wireshark to analyze any traffic and found the following:

The W5300 is trying to send a re-transmission over and over with a sequence number near the 32-bit rollover, and the ACK response has a non-matching sequence number. For example, the retransmission is 1460 bytes with sequence 0xFFFFFA70, and the ACK responsds with sequence 0x000000DC, which is the original sequence offset by 1644 instead of 1460.

The other two units failed in similar fashion:

  • one retransmitting 1460 bytes with sequence #0xFFFFFAD8, and the ACK coming back as 0x000001C4 (off by 1772 instead of 1460)
  • one retransmitting 1460 bytes with sequence #0x0000088A, and the ACK coming back as 0x00000FF6 (off by 1900 instead of 1460)

I looked at some of the normal data stream before the freeze up, and it looks like the packets were most often 1460 bytes, with some occasional smaller packets of sizes: 340, 416, 1028, 1112, 1368, 1452. It also seems common for the W5300 to send multiple packets at high speed and only get a single ACK.

I was wondering if it’s possible that the W5300 is sending multiple packets around the time the sequence rolls over, and a single ACK is not being processed correctly. Any thoughts or workaround ideas?

Hi, abes!!
I’m so sorry to answer too late.

Your problem is extreamly in the rare case.
Your application is very often to re-transmit a packet to occur the problem.

I think so your peer system can’t send a expected ack packet for w5300 sending data in your network enviroment.

Let me know how to send a large data.
If you send a large packet so many as socket’s buffer free size, I will recommend you should send a large data divided by 1460 unit, and checking send_ok code statement should be moved to the below of ordeing a send_command like as the following.

uint32   send(SOCKET s, uint8 * buf, uint32 len)
{
   uint8 status=0;
   uint32 ret=0;
   uint32 freesize=0;
   #ifdef __DEF_IINCHIP_DBG__
      uint32 loopcnt = 0;

      printf("%d : send()\r\n",s);
   #endif
   
   ret = len;
   if (len > getIINCHIP_TxMAX(s)) ret = getIINCHIP_TxMAX(s); // check size not to exceed MAX size.
   
   

   /*
    * \note if you want to use non blocking function, <b>"do{}while(freesize < ret)"</b> code block 
    * can be replaced with the below code. \n
    * \code 
    *       while((freesize = getSn_TX_FSR(s))==0); 
    *       ret = freesize;
    * \endcode
    */
   // -----------------------
   // NOTE : CODE BLOCK START
   do                                   
   {
      freesize = getSn_TX_FSR(s);
      status = getSn_SSR(s);
      #ifdef __DEF_IINCHIP_DBG__
         printf("%d : freesize=%ld\r\n",s,freesize);
         if(loopcnt++ > 0x0010000)
         { 
            printf("%d : freesize=%ld,status=%04x\r\n",s,freesize,status);
            printf("%d:Send Size=%08lx(%d)\r\n",s,ret,ret);
            printf("MR=%04x\r\n",*((vuint16*)MR));
            loopcnt = 0;
         }
      #endif
      if ((status != SOCK_ESTABLISHED) && (status != SOCK_CLOSE_WAIT)) return 0;
   } while (freesize < ret);
   // NOTE : CODE BLOCK END
   // ---------------------
   
   wiz_write_buf(s,buf,ret);                 // copy data

   #ifdef __DEF_IINCHIP_DBG__
      loopcnt=0;
   #endif   

  ////////////////////////////////////////////////////
  // Remove All
  /*
   if(!check_sendok_flag[s])                 // if first send, skip.
   {
      while (!(getSn_IR(s) & Sn_IR_SENDOK))  // wait previous SEND command completion.
      {
      #ifdef __DEF_IINCHIP_DBG__

         if(loopcnt++ > 0x010000)
         {
            printf("%d:Sn_SSR(%04x)\r\n",s,status);
            printf("%d:Send Size=%08lx(%d)\r\n",s,ret,ret);
            printf("MR=%04x\r\n",*((vuint16*)MR));
            loopcnt = 0;
         }
      #endif
         if (getSn_SSR(s) == SOCK_CLOSED)    // check timeout or abnormal closed.
         {
            #ifdef __DEF_IINCHIP_DBG__
               printf("%d : Send Fail. SOCK_CLOSED.\r\n",s);
            #endif
            return 0;
         }
      }
      setSn_IR(s, Sn_IR_SENDOK);             // clear Sn_IR_SENDOK	
   }
   else check_sendok_flag[s] = 0;
   */
   // send
   setSn_TX_WRSR(s,ret);   
   setSn_CR(s,Sn_CR_SEND);

  /////////////////
  // Add to checking send_ok
      while (!(getSn_IR(s) & Sn_IR_SENDOK))  // wait previous SEND command completion.
      {
      #ifdef __DEF_IINCHIP_DBG__

         if(loopcnt++ > 0x010000)
         {
            printf("%d:Sn_SSR(%04x)\r\n",s,status);
            printf("%d:Send Size=%08lx(%d)\r\n",s,ret,ret);
            printf("MR=%04x\r\n",*((vuint16*)MR));
            loopcnt = 0;
         }
      #endif
         if (getSn_SSR(s) == SOCK_CLOSED)    // check timeout or abnormal closed.
         {
            #ifdef __DEF_IINCHIP_DBG__
               printf("%d : Send Fail. SOCK_CLOSED.\r\n",s);
            #endif
            return 0;
         }
      }
      setSn_IR(s, Sn_IR_SENDOK);             // clear Sn_IR_SENDOK	
   
   return ret;
}

If you can, send a packet capture files.
Thank you.

I don’t know how often re-transmits occur, but the data flows very well most of the time.

The peer system is sending ACKs back. Are you saying that the ACK is not what the W5300 expects, so there is a failure to recognize the ACK?

I have attached a Wireshark capture file that shows IP addresses 192.168.1.52 and 192.168.1.55 continually retransmitting. The peer is 192.168.1.79, and it keeps sending the same ACK. This is the state when I am waiting for the SEND_OK bit to be set. This is two separate units that froze up in this state.

I already do check for the SEND_OK flag after every send command, and I also clear it every time it gets set.

For sending a lot of data, I have large TX buffers of 27kBytes for each socket, and I fill them up whenever data is available. There may be less than 1460 bytes, and there may be more than 1460 bytes. If the W5300 already breaks up large amount of data into MTU sized packets, how will it help for me to do that manually?

Even if this is rare, we will be sending large amounts of data in bursts, and this has happened many times. Every time I found it looks like the TCP sequence number is near rollover. If you can confirm that it has something to do with the TCP sequence number rolling over, maybe we can come up with a workaround.
capture with 52_1 and 55_3 frozen.zip (44.9 KB)

Hi,
In the attached capture file,
Peer maybe send back the ack number greater than the expected ack num of W5300.
So, W5300 re-transimits continuouly until it receive the expected ack num.
I can’t analyse why W5300 do that because the previous packet is not captured before retransmittion.

Anyway, I think your problem can be solved if you do the following.

assume that the data size is 27Kbytes.

int data_size = 27*1024;
int send_size = 0;
char buf[27*1024] = "...."; // any data;

while(data_size > 0)
{
   if(data_size > 1460) send_size = 1460;
   else send_size = data_size;
   if( send(s, buf, send_size) != send_size) goto error_proc;
   data_size -= send_size;
}

Thank you.

Yes, I noticed that the ACK numbers didn’t match up. There were many occasions when ACKs were sent out of order, or for multiple packets, so I assumed that is normal. But I have only seen the glitch when the sequence number is near 0xFFFFFFFF.

Unfortunately, there is too much data to collect it all. Maybe I’ll find a way to have Wireshark filter or just save header information and then I can catch the packets right before the failure.

I’ve made the recommended change to only send 1460 bytes at a time, and I’ll see how it goes.

Thank you.

I made the change to only send 1460 bytes at a time, and it did not fix the problem. I have 4 systems that ran overnight, and each one has 4 sockets open. So far, one of the sockets on one of the systems froze up. Again, the TCP sequence number is right near the 32-bit rollover:

W5300 transmitted sequence number: FFFFFC6F
peer ack number: 00000287

Also, I closed the peer program, and the frozen socket has status SSR of 1D, SOCK_LAST_ACK for many minutes. I have not changed the default values of RTR and RCR, so I would expect it to time out and close after some short time.

Hi, Abes!

I’m so sorry that your problem is not solved.

For more analyzing your problem, I need more detail information.

  1. Source code.
  2. Caputured packets to before and after this situation. (I want to show why ack packet is lost.)
  3. How to configure the each socket’s buffer size.
  4. What is a peer? (Windows or Linux based system, or any other system?)

Most of all, Number2 is needed.

If you can test another network environment, try it (example as local network test, exchange the switch device, and etc.)

Thank you.

I figured out how to capture just the sequence numbers near the rollover point with Wireshark, and I was able to capture one of the failure modes.

The attached file shows the problem starting around packet number 5870.

Event 5870: The PC sends an ACK with the ack sequence number in the middle of a packet.
Event 5871: The W5300 sends another packet of data.
Event 5872: The W5300 send a FIN flag.

After that, the PC sends a FIN and tries to reconnect, but the W5300 is still trying to do something and it gets messy.

The peer is a PC running Windows 7 and Labview. I don’t know how the buffers are configured, but I notice that sometimes the window size shrinks by more than the amount of data sent on a particular socket, so I think Labview or Windows is sharing some buffer space somehow, and the window size is not really available due to other traffic. We will need to see if we can allocate separate buffers for each port. Could this be the source of the problem? Should the W5300 be able to handle an ACK that only accepted part of a packet? Is that even acceptable for TCP?

I have not seen another occurrence where the PC ack sequence is only accepting part of a message, but it may be happening. I have only seen the freezing behavior associated with the time the sequence number is rolling over, so I’m guessing there is some correlation.
w5300_fin1.zip (903 KB)

Hi,

I wonder how to test your device in network environment.
As your capture file, your device maybe exsit 5. Is it right?
All device have a wrong operation caused by unknown error events.
Each device maybe used in 5 sockets as port number from 7980 to 7984. but your device’s each socket is not used at the same time, it may be sequetially used.
Also, 5 devices is not commuincated with the server at the same time, It is also sequeitially used as like sockets

So, I need more detail information about your test environment.

In the captured packet filtered ecah device,

  • Device 192.168.1.52 :
    1~143th packets are normal
    144~161 packets are the requesting to connect to each device. (why try to connect to each device?)
    After long time(3742 seconds) is spent, (During this time, Device 192.168.1.55 communicated with the server)
    3937th packet’s sequence number is broken. (I don’t know why?)

  • Device 192.168.1.55 :
    156 packet is syn packet (reqeust to connection) to this deivce
    Atfer long time is spent(maybe about 860s : During this time, No happen, except that server try to connect to other device => Why?)
    162 packet is syn packet to 192.168.1.55.
    163 packet is a data packet to the server (without device’s syn packet). In normal tcp connection process, server’s syn request, client’s syn/ack, server send ack, and then data communication can be started. this method is called as 3-way handshaking. But this captured files does not keep the rule. I don’t know why this situation? (I quess the situation is occured by selective ack opiton in server’s syn packet, W5300 can’t support the selective ack option, it is ignored.)
    Anyway, 164~310 packet is operated normal.
    After long time is spent(maybe about 1800s : During this time, The device did communicated with server using another port number. It is also same as without 3-way handshaking connection)
    1937 packet is wrong ack number. This relative ack number should be 129741+1460=131201, not 885.
    Atfer 1937, W5300 started to retransmittion, it looks like as normal operation but this retransmittion is abnormal operation, that is , the retransmit packet is are all wrong packet(maybe invalid data because each socket have 2048 memory size and the broken ack num of data does not exist the socket memory, the data is processed normaly and then disappear by normal ack number)

  • Device others : maybe same as the above devices.

Most of all.
Why the server send the broken ack number to the devices?
and
Why does not W5300 keep the rule of 3way hand-shaking connection process?
Why the sever use the selective ack option? (this is minor issue).

Anyway, W5300 has operated abnomally. I don’t know exactly why?

If you can use just one device and one socket in test, check the occurrence of this situation please.
And send me the source code by email, please.

Thank you.

Regarding the 3-way handshake, I think it does that fine, but my Wireshark capture had a filter on it, so it didn’t show all of the packets. I don’t think it’s related to the selective ACK. When I capture everything without a filter, I see the correct 3-way handshake.

I believe I may have found an issue that would cause the “broken ack number”.

During the initial connection, the PC sends SYN with an option of window scaling = 2 (multiply by 4)

The W5300 SYN only sends Maximum Segment Size, it does not send window scaling option.

When the error occurred, the PC reported window size of 552, but it looks like the W5300 interpreted it as 552*4=2208, and sent a full packet of 1460 bytes.

The PC then only acknowledged 552 bytes because that’s all it had room for. At this point, the W5300 seems to not be able to handle the incorrect ACK number and everything falls apart.

According to someone who wrote this article: en.wikipedia.org/wiki/Transmiss … ow_scaling
“Both sides must send the option in their SYN segments to enable window scaling in either direction”

Since the W5300 did not send the window scaling option, I’m thinking it should ignore the window scale sent by the PC. Is it possible that the W5300 is using the scaling factor sent by the PC when it shouldn’t be?

I will try to see if I can modify the client to use a scale factor of 0, and see if that might resolve the issue.

Edit…I added an attachment of a second capture of a failure. You can see at event 69, the PC acks with window size of 2112, and then the W5300 sends 2 more packets of 1460 bytes each. Shortly after that, it all falls apart.

I also added a capture of the normal startup showing the 3-way handshake where the PC sends the window scaling option of 2, and the W5300 doesn’t send the window scaling option.

Also, I’m not ignoring your request to see the code, but it is written in Verilog and it not very easy to compare to a standard C program. So for now, I’m focusing on the strange behavior that has actually been captured.

Edit 2: I’m thinking that a math error on the W5300 is more likely than an error in the window scaling, because this has only happened right where the sequence number rolls from FFFFFFFF to 00000000.
capture normal SYN startup.zip (6.24 KB)
w5300_fin2_port7893_only.zip (15 KB)

Hi, abes!

Your big problem is why the server send the un-expected ack number to W5300. When W5300 received the un-expected ack number, W5300 may be unknown logic-states until W5300 receives the expected ack-number. But, As shown the your captured, When W5300 is still frozen state, server can’t sent a expected ack number.

Anyway, You can monitor the reception of unexpected ack number just with reading MSB of Sn_SSR. If the MSB is set, W5300 have received the unknown expected ack number from a peer. In this case, W5300 enter the unknown logic-state and You can’t expect to W5300’s normal operations.

The reserved byte of Sn_SSR is as the following,…
7th : Received un-expected ack number
3rd : No more transit until ack number is received. But W5300 can re-transmit the no received ack number packet.
2nd : Retransmitting
1st : Rx buffer empty
0th : Tx buffer full

In your cases, The 7th, 3rd, and 2nd bit may be set. If you want to go back to normal state, You should be closed the socket or You should receive the expected ack number.

W5300 can support the mutiple send packet with just once SEND command. But In your case, We don’t recommended that.
As the previous anwer, I recommended to send data size as many as 1460.

Thank you.

Thank you for the detailed information about the Sn_SSR MSB.

I already added code to try to disconnect when the Sn_SSR MSB came up with the 7th bit set, and that is why we see the FIN packet in the data captures, but I still don’t seem to get a clean disconnect. It seems to be in an unusual state, and unable to recover.

I understand that the W5300 doesn’t handle the un-expected ACK number. However, the problem starts when the W500 sends more data than the window size, so I think there is an error that has something to do with the window size and the sequence number rollover that causes the W5300 to send too much data. This only happens right at the rollover and only if the receive window is very small due to a buffer getting close to full, so my best guess is some math operation that doesn’t handle the 32-bit overflow properly.

It would be nice to see some more in depth research and information about this behavior if possible. In the meantime, I will try to find a workaround that ensures that there will always be buffer space so the window will never go down below 1460. I will probably have to throttle the data somehow, or ensure the PC/Labview program has high enough priority to read all the data without much buffering.

Thanks for all your feedback on this so far.

Hi, Abes!!

How to configure each socket tx/rx memory size?
And Are you sure of checking the tx free size before sending data?

In your capture file, your Tx memory configuration is 2Kbytes size.
And you doesn’t check the tx free size.

After 1st packet is sent normally,
the next 1460 data packet can’t be sent without ack packet of 1st data packet.
But, your case does not like as that.

Check again plz!!!
Thank you.

For the sockets that are failing:

TMSR is set to 27kBytes per socket
RMSR is set to 2kBytes per socket; I think this is what you are seeing in the capture file where Win=2046, but these sockets don’t receive data, they just send data

Yes, I check TX_FSR before sending. But this only verifies that there is buffer space in the W5300 TX Buffer (27kBytes to start). It does not indicate the peer Window size. I think there is no way for me to know the peer window size. The W5300 uses it internally but doesn’t give me access to it.

Your statement that “the next 1460 data packet can’t be sent without ack packet of 1st data packet” is not consistent with the answer in this forum topic:

After the first transmission, I wait for the SENDOK flag to be set, then I clear it, then I check TX_FSR to verify that there is space in the TX buffer, and if there is space, then I send another packet. According to that post, when the ACK is received, the TX_FSR increases by the amount of data that was acknowledged by the peer, but I should not have to wait for an ACK after every packet. If I did that, I would wait for TX_FSR to show 27kBytes before I send each packet, and my data rate will go down. I’ve tried that.

Hi, ables!

That post is right.
I was mistaken for the TX memory size of 2KB.
I’m sorry I can not help resolve the problem.

Currently, To solve your problem,
Monitor to Sn_MR, and the socket should be forced to close when the Sn_MR is unknow.

Thank you.