Losing interrupts because SN_IR_RECV to SN_IR fails (silicon glitch?)

Candela12 · May 23, 2016, 12:22pm

For our more time critical applications receiving data is not triggered by polling registers, but purely interrupt driven.

When an interrupt is received, we write SN_IR_RECV to SN_IR and then read RSR for that socket twice (stable mode) and start transferring payloads.

We have a problem that sometimes the system stops processing w5500 data and tracked it down to SN_IR_RECV to SN_IR write that doesn’t make the interrupt low (thus losing an interrupt edge and stopping communication).

There are no other sockets active, and post mortem checks confirm that the SN_IR for the active socket remains 4.

We retested issueing the SN_IR write twice, and then see cases where the first write doesn’t reset the interrupt pin, and the second one does (visible in the scope picture below, the first 4 bytes are the first SN_IR command and interrupt doesn’t react, while it does react on the second set of bytes (also setting SN_IR of socket to 4) ).

These two commands are about 4us apart, so I assume the global state can’t have changed much. (and IMR=0 anyway)

Of course doubling the command doesn’t really solve the problem but only makes it less frequent.

What could be happening here and is there a failsafe solution?

We tried various INTLEVEL settings (0x10,0x100,0x200), without result.
SIMR=0xFF
EIMR=0

Only socket 0 and 1 are used. 0 is receive only and 1 for send only which is disabled in this program.

packet rate is two machines doing each 10 writes to the socket quickly after eachother twice per second. So the total packets are just 40/s, but close together.

We dumped all registers after it stopped reacting (via serial), and the register state seems to be what we expect, except for a full buffer and SN_IR=SN_IR_RECV

Candela12 · May 23, 2016, 12:36pm

Btw, it is chip version 04

Eugeny · May 23, 2016, 3:03pm

The main issue in your design is

using edge to trigger interrupt. You should use low level of the interrupt line strobed using clock signal. You should NOT use interrupt signal as a clock because interrupt line state change may occur when interrupts are disabled, and can cause timing violations and metastability in your design.

This applies to whole WIZnet chip family. You need to flush all data from buffer before clearing bit 2, otherwise this bit will not clear because there’s still data in the RX buffer, and interrupt line will still be low. In case you use interrupt level (instead of edge) when you exit interrupt service routine and enable interrupt, interrupt routine is invoked again, and it will be happening until there will be no data in the RX buffer.

Candela12 · May 25, 2016, 10:34am

[quote=“Eugeny”]The main issue in your design is

using edge to trigger interrupt. You should use low level of the interrupt line strobed using clock signal. You should NOT use interrupt signal as a clock because interrupt line state change may occur when interrupts are disabled, and can cause timing violations and metastability in your design.
[/quote]

I don’t disable interrupts on the w5500 nor my micro (dspic33e)

That is not documented as such for the w5500 I think. But reading this triggered rereading the w5100/5200 docs, and indeed it is there. I’m gearing up in wiznet again after a small two years of relative idleness, so thanks for that.

However, the problem is that I don’t see that happening. I reset SN_IR(SN_IR_RECV) every receive to kill the interrupt status, but I only adjust buffer pointers once every 2k. (while the avg packet is much smaller) for latency and throughput.

So there are thus nearly always bytes left in the buffer (SN_RX_RSR), and then SN_IR(SN_IR_RECV) should be high always, and shouldn’t be resettable, which is not what I see. I only see this behaviour (SN_IR_RECV not happening) if there are interrupts during the SN_IR_RECV resetting packet itself, which leads me to assume there is some race there.

I assumed that this was something that could not be fixed easily in silicon (buffer being updated by the ethernet side very near the moment I try to reset the interrupt), and they left in a known racecondition because fixing it would require increasing clock, and was looking for an official source for the mitigation of this problem, just like the “stable mode” is a fix for unshadowed 16-bit pointer reads

My current workaround is to simply check the level of the interrupt after the combine SN_IR_RECV reset and double RSR packet. (stable mode), and if it hasn’t been reset, simulate an interrupt.

That’s what I originally did, but there are two problems with that, which is why I changed it to the current situation.

because you retrieve RSR before the SN_IR_RECV reset, your RSR value is not updated for receives between RSR retreival and SN_IR_RECV reset, so you loose event information there too, and while the information will be interpreted on the next interrupt, the data received inbetween will be stale, something you often only notice if you timestamp your packets.
moreover, if what I think is true, then there TOO is a danger of SN_IR_RECV not resetting if data is arriving at a certain point during that reset moment, risking that SN_IR(SN_IR_RECV) remains high and losing an interrupt again is again a risk.

In any situations this can only be resolved by additional polling of registers, which is a shame. (I now have a 14 byte dma pakket followed by a 2+ payload dma pakket to drain a buffer, and additional polling would decrease throughput, latency etc).

I can only conclude that my workaround might not be so bad because that way I only poll when needed. Still it would be good to have an official source for all this, since I assume it is a common problem when implement a low latency interrupt driven application.

[quote]
In case you use interrupt level (instead of edge) when you exit interrupt service routine and enable interrupt, interrupt routine is invoked again, and it will be happening until there will be no data in the RX buffer.[/quote]

I don’'t disable interrupts, and I have a hardware INT peripheral interrupt w5200 INT pin that probably would detects pin changes in the 10ns range. (140MHz peripheral clock)

In general we are mightily impressed by the wiz5500, specially since we optimize it for low latency UDP I/O slaves on a local network which is not the most common goal in these IoT times. The nicest part is the low cpu burden when the program is rewritten using event (interrupt) driven DMA. (less than 10-15us CPU time to receive, process and reply (one small pkt in and one out) for a relatively puny 70MIPS dspice. The wall time (from interrupt till the send process is completed) can be as low as 100us, but once every 2-3kb received it will be 20us more.

I’m currently ironing out the last remaining kinks, because we are working on bringing in the design-in of 2 years ago to production, and hopefully our experiences will be helpful for people trying to do something similar. If somebody knows more advanced examples for any micro for low latency, low cpu use, or even just descriptions of it, I would be extremely interested.

Candela12 · April 12, 2018, 9:23am

In retrospect back then I didn’t realize what the main problem is. That is that the w5500 INT pin is a level based interrupt instead of a edge interrupt. I meanwhile used other parts that use level based interrupts, and realized this is the same behaviour as the w5500. Eugeny seems to hint on that in his message, but I then didn’t understand.

An level based interrupt remains high if the condition is not resolved, without generating a new interrupt. Level interrupts are more common in
faster parts. (like >100-200MHz PIC32/MIPS and ARMs) that have caches and peripheral busses.

If you know that, the workaround is simple, you just rerun the service routine as long as the interrupt is high after the action that should reset the interrupt (writing SN_IR).

I do this via a separate variable (socketinterrupt in the below fragment) to rerun the entire mainloop (and allow other interrupts to be serviced)

So the interrupt routine looks like this (schematic only, I use Microchip Dspic)

volatile unsigned int socketinterrupt = 0;

int _w5500pininterrupt()
{
   socketinterrupt++; // run service routine
   pininterrupt=0; // clear end of interrupt
}

and the main loop checks the service routine:

int main()
{int rcvsiz;
  // lots of initialization

  while (1) do
   {
     if (socketinterrupt)
      { socketinterrupt--;  // declare interrupt as "handled".
         res=check_for_read_data(&rcvsiz);  // writes SN_IR_RECV to SN_IR and reads Sn_RX_RSR0 twice till first read=second read.

         // !!!!!!!!!!!!!!! handle w5500 level instead of edge interrupt.
         if (W5500INTActive() && (socketinterrupt==0) )   // If INT  high, and no interrupts pending, retrigger.
          { socketinterrupt++;}
         if (res)
           {
               res=udpgetpayload (&rxsiz);  // get and process one payload. returns true if there are more unprocessed bytes in the RX buffer.
              if (res)
                  {socketinterrupt++;}
           }

      } // if socketinterrupt
    if (something else)
      {
      }
   } // while
 }// main

Topic		Replies	Views
W5500 silicon bugs? W5500	4	718	October 7, 2014
Issue with W5500 hardware interrupt W5500	3	1143	June 9, 2021
Reading Sn_IR - gives 0xFF ?? W5500	1	268	February 1, 2022
W5500, RX-Interrupt not working W5100S	10	201	December 8, 2023
Interrupt not occur every time on w5500 INTn pin W5500	2	645	March 12, 2019

Losing interrupts because SN_IR_RECV to SN_IR fails (silicon glitch?)

Related Topics