I have encounter the strange problem, pls see the image. 192.168.111.93 is W5500 client and 192.168.1.100 is Linux server, Normally, the packet flow is like [PUSH,ACK] that is real packet by application and server reply [ACK] by Tcp/IP stack and then Server send [PUSH,ACK] that is real packet by application and then W5500 reply [ACK] by TCP/IP stack . But when running days or months, the packet flow is like the red cycle. All packets are changed to [PUSH,ACK]… Then W5500 will not work properly from that moment… Need to power OFF/ON to restore… Does someone know what is the rootcause for this problem? Thanks a lot!
To identify why W5500 started PUSHing its packets, we need to review whole packet history to see what preceded first push packet. It may happen that Linux host started pushing first, and W5500 decided that all data is urgent now and starts pushing everything to Linux application.
Next, log you see may not be caused by the W5500 or hardware/hardwired TCPIP stack in overall. You can see that starting #192896 there’s doubling of messages, however these messages are different ones, and it may be the cause of changed data flow pattern.
#192896: server acknowledges previous packet at #192895;
#192897: server devices to push (same as previous?) ACK to W5500. Question - why, and that is in 20 bytes of the data in the packet? I can not see it in this dump;
#193160: W5500 acks data sent (those 20 bytes?) in #192897, with no data (just ACK);
#210957: W5500 sends packet with data size 45 bytes, time passed since previous exchange with server is 31 seconds, relatively long;
#210958: server replies with ACK, again with some data of 20 bytes. What are these 20 bytes, and what server want to tell to W5500? I can not see in this log;
#211304 (you marked it red): W5500 ACKs server’s data (those 20 bytes?) with packet with data size 0;
And so on.
Here’s my guesswork on things I see: data pattern changed. Server started sending 20 bytes in its ACK messages; in its previous ACKs it was sending dummy payload and no data. It may happen that your application running on W5500 can not cope with incoming data, however server expects some response for them. My conclusion - you must identify what server says in these 20 bytes, and what those 20 bytes require W5500-based device (means W5500 hardware + your application running on W5500) to do.
I am not sure but it may happen that due to accumulating data W5500 RX socket buffer becomes full and whole communication gets stuck.
Thanks a lot for your reply and I found the broken point with Wireshark. It looks like something related with W5500 RX socket buffer. Before crash, the win is 848, quit small and the buffer should be 2048, but what I concern is , the RX buffer seems be restored in the followed packets, but finally W5500 set out totally wrong ARP broadcast with wrong IP address… So Can you give any suggestion? Use different source port every time? Also this device already run for about one year… No problem in the one year, but has such problem in these days… Very stranger…
As I know buffer size is calculated by the Wireshark; this value is not present in the packet itself. Thus actual situation may differ from what Wireshark shows. What are those 20 bytes in the server response? Note that there was again the same message from the server #508731 just before it “crashes” (as you call it). I understand this exercise is not easy, but can you calculate number of these server responses with 20-byte data in them, and calculate how much data was actually sent by server, and how resulting value relates to 2048 byte RX window of W5500.
No, it was not something totally wrong. It seems chip started connecting to the same server with the same local and destination port numbers. This action can only be triggered by the W5500-driving software, thus you must look into the code to see what it could be. You also must look for RST messages before and after this new connect.
When W5500 gets CONNECT command, it sends ARP request to know MAC address of remote host (#509204), then it sends SYN message, and Wireshark properly states that W5500 reuses same configuration networking (remote IP address + remote port + local port). This is not correct doing during or just after previous connection with same properties was closed, as network devices may still cache connection properties if it was not closed properly.
Thus here’re two things to investigate:
- Why W5500 was asked to start connecting to server with the same configuration - was it same socket (then where’re RST/FIN packets?), or was it another socket of the chip?
- What happened to socket’s RX buffer before issue happens?
Thanks a lot for your quick reply. the 20 bytes data are the confirm packet from server to confirm receive the packets from W5500. Yes and I will use different source port to reconnect the server. But my concern is why the (#516973)ARP message has problem after retry serval times. Also I have reseted W5500 in the source code. But seems the ARP message can’t be restored.
Very appreciate for your help!
Where did you find it? There’re two same ARP messages logged, probably your network is having several paths and same message arrived twice. But response was only one and the correct one.
In general source port does not matter much unless you configure remote host to respond for specific source port only. When reconnecting just increment source port number. And also do not forget to properly disconnect and close previous connection, if it is possible.
Please see the bottom of the image and the red rectangle.
#516973 MAC address: 48:91:48:5a:8f:f8 that is the W5500’s mac address, but the Server ip should be 192.168.1.100 and the local ip should be 192.168.111.93
Ok, now I see it. Again, my guess work is that after chip tries to reconnect (in messages 509204-515254), connect operation is not successful (e.g. because server denies connect request, however it is not visible in the packets; or due to timeout [it is most probable, given period between retransmissions]), and your application driving W5500 tries to close the socket (command value is 0x10), and mistakenly fills destination IP address and first octet of GAR with 0x10. Thus look into this routine to find out what’s wrong going on there when connect to server fails.