W5100 tcp socket closes after some minutes

I have written a driver for the Wiznet W5100 ethernet chip (currently sits on the WIZ812MJ breakout board). It is connected via SPI1.

Everything works fine - ping, tcp client mode, receiving and sending data - but, after a random time (from 1 minute up to 10 minutes) the connection is closed suddenly! I tried to debug all the steps that happen, but I can’t find anything that helps. Has anybody ever experienced similar problems?

As the driver is written for my company, I can’t put all the code here, but I try to paste the (to me) most important parts. Most of this is pseudo-code, but you can assume that SPI communication works and all the registers are set to the given values!


W5100_Init - Pass SPI pin configuration and do W5100 initialization

Pass SPI configuration
Write reset-cmd to MR register
Write retry time-value to RTR register (0x07D0 = 2000ms)
Write retry count to RCR register (0x05)
Write socket memory information to TMSR/RMSR register (0x55)

W5100_NetworkInit - Write GW, MAC, SubnetMask and IP Addr

Write Gateway to GAR register
Write MAC to SHAR register
Write Subnet to SUBR register
Write IP to SIPR register

W5100_SocketInit - Map important socket registers to given socket and initialize the socket

/This maps all important/required ports to the socketConfig struct/
if(socketX == SOCKET0)
socketConfig.PORT0 = W5100_ADDR_S0_PORT0;
socketConfig.PORT1 = W5100_ADDR_S0_PORT1;
socketConfig.MR = W5100_ADDR_S0_MR;
socketConfig.CR = W5100_ADDR_S0_CR;
socketConfig.SR = W5100_ADDR_S0_SR;

Write protocol type to MR register
write Port to PORT register
Write CMD_OPEN to CR register

Use client mode:

Write server IP to DIPR register
Write server Port to DPORT register

Now that we’ve got a tcp connection in client mode, start loop.

Check socket status from SR register
If there’s any data on the line, call W5100_Socket_ReceiveData method

This works fine for some time (latest run: 10 minutes), but then suddenly the socket status returns 0x00 and the connection is closed.

Is there still anything important that I may have forgotten?

Why is the socket closed?

Thank you very much for providing any hints!

I do not see CONNECT command on the list. Do you try to connect to the server?
Most probably socket is closed because server says to close it, sending FIN packet.
Use wireshark to trace packets on the network - the best is to have it installed and run on the server (hopefully Linux or Windows).

Sorry, I just missed to paste the Connect-command, but it is there!

I’ve tried both, client and server mode, but even in server mode the connection is closed after some time, although I never send the FIN/Close socket command explicitly. :confused:

It would be very helpful if you can make a wireshark log to see what happens on the network - who sends the closing packet, and what happens before it.
When connection is established, there’re still packets being send from one node to another to keep it alive. It may happen that these packets are got lost, and one of the hosts thinks that connection is broken and closes its end sending FIN to another device.

Here is a Wireshark log (please notice, there was a ping command running all the time until it dropped!) - you’ll see that there is no close command from any of the devices. The connection simply breaks at a random time.

I wrote a debug register dump method to output all the registers AFTER the connection broke - all the essential registers (GAR, SUBR, SIPR, PORT, DIPR) are 0 (zero!) at the time the connection breaks - or better, the connection breaks because of that. The reset pin is connected to 3V3 through a resistor / capacitor. It seems like the W5100 suddenly loses its settings, even the ping isn’t working any further (how should it as the W5100 lost its network settings)…

ws_export_ping.zip (47.5 KB)

Good! You took very good troubleshooting steps. IMHO common registers’ contents loss is not normal and the cause can be:

  1. Hardware reset - as you already said you have cap and pull-up resistor, should be ok if power source is stable, but I would anyway prefer level-controller reset input using some external reset signal or device like MAX809/MAX810.
  2. Software reset - your code accidentally writes bit 7 of the common MR and soft resets the chip.
  3. Power is bad, soldering is bad (e.g. 1V8 is not soldered well).

Thus your action plan would be:

  1. examine board and chip soldering thoroughly. As board is not manually assembled, I doubt this is the cause of the issue, but anyway, use magnifier to see and ensure there’re no visible issues;
  2. hang your scope onto the power (if you have dual scope - to both 3V3 and 1V8) and set it into slow mode so that you see the level continuously displayed on the scope’s screen, and can press “suspend” button to fix the picture. Wait until W5100 becomes un-ping-able and ensure power does not disrupt around this time;
  3. redesign your code to do nothing after connection is established - literally nothing, just dead loop - to ensure that it does not perform any actions on anything in W5100. Do not forget to disable interrupts (as ISR can also do some changes). Then wait the time it usually takes for chip communication failure. Or wait 2x times of that, or 3x times of that. If it will not fail, probably issue is somewhere in your code damaging chip’s configuration. If it will fail again, then most probably your code is not an issue (and then we will see further what we can do).

Thank you very much for your quick answer. I’ve updated the main loop to do nothing. But the problem still occurs after some minutes. I’ve just pinged the device and it suddenly breaks. The debugger was inside the main loop, so no hardfault happened. Power is stable.

Change your W5100’s MAC address to 18:15:15:15:AA:AA and try again.

Thanks for the hint, I’m running a test with another MAC address at the moment.

I also wired a quick, but total similar, setup using an Arduino Uno and the official arduino-ethernet library to exclude a software bug in my driver.

I’ll edit my post as anything occurs.

Edit: Changing the Mac-Address on my STM-board didn’t help me out. The fault is so hard to isolate, that I just can’t provide a real solution - worse, I can’t even find out what’s wrong…

My arduino is running since half an hour without crashing, but it broke once, so I am not sure wether that it’s working…

So, finally my arduino broke as well, so that I come to the conclusion that my WIZ812MJ board is rubbish. :confused:

You mean that both boards exhibit the same behavior and both are suspected to be malfunctioning?

Let’s revise everything again:

  1. you open TCP connection, do not send anything over it for some time;
  2. in 10 minutes (or so) W5100-based board becomes unavailable for TCP packets and for ICMP packets (ping);
  3. when it becomes unavailable, you make a dump of all common and socket registers and see that they are “as default” - SIPR, SHAR, GAR, port numbers are zeroed, memory allocation registers are 0x55 (both RX and TX).
  4. and we are sure your application does not perform any operations on W5100 registers just before and after “failure” (except performing following read for register dump).

Is above correct?

0. you have exponential signal on the reset line after power up, and perform software reset writing 0x80 to common registers’ MR.

No, not my boards. I tested the W5100-breakout board with two different boards (STM32F4 Discovery and Arduino Uno) - on STM32 with my own driver and on Arduino with the official ethernet driver - and on both boards with two different drivers the W5100 crashed - my hardware is in good condition.

I write the reset command as the first thing in my init-method.

W5100_SPI_writeRegister(&w5100Config, W5100_ADDR_MR, W5100_MR_RST);

where W5100_MR_RST is defined as:

#define	W5100_MR_RST		(uint8_t)0x80

I opened a TCP connection over socket0 and tried all scenarios - w/ and w/o interrupts. I even removed all of the socket stuff for one test run, running just a very plain W5100 initialization method and stored the network settings so that I could ping the device.

The time varies a lot, from 1 minute up to 30 minutes - again something that I can’t reliably measure. But not only the socket stops to work, also I can’t ping the device, but SPI works as I’m able to read out the registers:

Registers after crash:

MR		0
GAR0 	0
GAR1	0
GAR2	0
GAR3	0
same as for SIPR and SHAR
IR		0x80
IMR		0
RTR0		0x07
RTR1		0x00
RCR		0x08
RMSR	0x55
TMSR	0x55
UPORT	0x00

all 0x00 but FSR0	0x08

I do not explicitly write any register that shouldn’t be written.

Again, thanks for your help!

Hold on I am trying to simulate the issue on my board.

Meanwhile questions -

  • where another network device is located in the topology? What is between that device and W5100?
  • which protocol is used within TCP connection? What is an application at another end?

The board is plugged into a D-Link DES-1016D network switch over around 2 meters of CAT6 cable. It is in our company network.

There’s no protocol running - I don’t use any application at the moment, but for testing the TCP stack I used Hercules (hw-group.com/products/hercules/index_en.html) as a server and/or client.

Ok I did the following test.

My setup:

  • windows PC with Wireshark, TCPView and Apache 2 server installed. Apache is configured with “KeepAlive On” and “KeepAliveTimeout 3600”;
  • W5100-based device with instant register dump;
  • interconnected by the ASUS simple Ethernet switch (not a router).

I connected to the Apache server using HTTP/1.1, got some HTTP content out there, and connection was live exactly for 1 hour - TCPView showed connection as established, and W5100 had its S0 SR as 0x17. After an hour Apache closes connection sending FIN (seen in the TCPView and in Wireshark), and W5100 responds with ACK setting its S0 SR to 0x1C.

Thus I state that it works as expected at my end.

Thus at your end:

  1. your router may be too intelligent breaking connection. Remove router, connecting endpoints with direct cable, and try again (ensure their network interfaces are correctly configured though). Or use dumb network device instead of the router.
  2. your code still does something wrong rewriting registers/resetting the chip (however you seem have checked it).

Thanks for your effort in testing everything!

However, if I don’t configure any socket nor TCP, the device should still be running and ACK the ping command. For that, there should not be any need for a keep alive signal or something like that. Think of a plain, low-level network connection where the W5100 only replies to the ping command. That should work like forever…

One thing that is very suspicious regarding a hardware fault/damage, is that the essential network registers are resetted to 0x00, which should not happen, even if my switch/router disconnected the device. Currently, all my code does, is setting the registers so that the W5100 boots up and then goes into an infinite while-loop doing literally NOTHING.

I’ve attached an oscilloscope at the moment and I now provide the voltage through an external power supply - there is definitely something strange happening on the wire…

You are right, I just had to have some setup which should work, and I described it above. I did not experience same issue as you do.

Here you are also right.

Focus on reset circuit - I would replace it with square pulse. Don’t you have any reset signals in your device? Can you use it?
I recall back in 2012 I had some issues with the chip, when I had its reset just pulled up. Then I was advised to connect it to proper reset signal, and it solved that issue.

[quote]Focus on reset circuit - I would replace it with square pulse. Don’t you have any reset signals in your device? Can you use it?
I recall back in 2012 I had some issues with the chip, when I had its reset just pulled up. Then I was advised to connect it to proper reset signal, and it solved that issue.

Sorry about the delay.
The oscilloscope is showing some voltage drops, but the voltage level is still far above from being too low to trigger the hardware reset.
We decided to buy another breakout board, but this time the WIZ811MJ, just in case. The driver should also work. It’s very sad as everything seems to work beside of the sudden crash. Anyways, I will always update this thread if I find out something of interest. Cheers.