UDP packet RX pointer not at header

I’m using a port of the Wiznet ioLibrary driver (from here: GitHub - Wiznet/ioLibrary_Driver: ioLibrary_Driver can be used for the application design of WIZnet TCP/IP chips as W5500, W5300, W5200, W5100 W5100S.), running on a Cypress PSoC.

When receiving a UDP packet (actually a DHCP packet) in the socket.c function recvfrom I’m encountering a scenario where the header is corrupt (or the RX pointer is pointing to something that is not the head of the message). In any case the result is that the pack_len is some random number depending on what point of the packet it happens to land. In my case that random number is very big, which of course is larger than the buffer I’m writing to and therefore destroys a bunch of memory and crashes the application.

One solution is to implement the packet length check that’s eluded to in the comment immediately before reading the RX buffer that says “//Need to packet length check” (or check if the IP and port in the header are valid) but that raises 2 questions:

  1. If I run that test what do we do if we encounter and invalid value, there’s no way to flush the buffer and start over and if I read the entire buffer I risk never getting the pointer back to the head of the UDP message.

  2. How could this be happening in the first place? This library is pretty old and presumably used by others so I’d be surprised if something like this had been missed if it happened frequently.

For what it’s worth this only happens when I’m running an application that handles a large number of TCP packets on a different socket and even then it only happens after a few minutes/hours of runtime, so it’s inconsistent. The DHCP packets only come in every minute or so, is it possible that something else is actually stomping on the RX pointer or RX buffer (in the W5500) for this socket?

I can’t see how the problem could happen as you described. If the buffer length (len) is less than sock_remained_size[sn], then pack_len = len, otherwise, pack_len = sock_remained_size[sn]. Can you add a console printf to make sure the contents of sock_remained_size[sn] is reasonable after this 16-bit value has been loaded from head[6] and head[7]?

But because the initial read does not read the correct location I’m not reading the actual header but rather some random location in the RX buffer. Therefore head[6] and head [7] don’t contain the actual length of the packet but rather random data from within the RX buffer.

Even the header read is wrong, how can it “…destroys a bunch of memory and crashes the application…” as you have claimed? You control the size of the receive buffer. I don’t see how recvfrom() can put more data into the buffer than you have allocated.

Duh, you are correct in that, I read that piece of code a bunch of times and for some reason thought it was doing the opposite. Well that’s what I get for trying to debug on a Friday afternoon!!

It looks like in the DHCP code the message buffer size, RIP_MSG_SIZE isn’t big enough to hold the entire buffer, that has never been an issue before though, hmm. I’m going to have to step through that code when I’m back at the bench again on Monday to see what’s going on.

Do you read RX buffer pointer every time before accessing RX buffer, or use your own cached value?

I’m using the library pretty much unchanged (other than a few specific calls to peripherals like the SPI interface). Unfortunately the code is completely lacking in comments so it’s hard to tell what’s going on. I believe it does read the RX buffer pointer each time using the getSn_RX_RD macro.

The only place I can see where there should be another check is in the parseDHCPMSG function where it reads the size of the data in the RX buffer (using the getSn_RX_RSR macro) and then attempts to read in that much data regardless of whether or not it will fit in the allocated DHCP message buffer (which isn’t big enough to read the entire W5500 RX buffer).

I can’t see the actual length returned from that read (and passed into the recvfrom function), but I can see that the content of sock_remained_size is very large (33363) and is equal to pack_len. As acpie360 pointed out the only way this could happen is if both sock_remained_size and len contained the same value, but the only way that could happen would be if the read from getSn_RX_RSR actually returned that value and passed it into recvfrom.

Unfortunately stepping through the code isn’t an option as it takes a long time for this bug to show up. I can only attach to a running system after it has crashed so I can only see what’s in the call stack. It’s entirely possible that something else in the code is corrupting memory but it’s a weird manifestation and it’s always in the same location with the same symptoms (I’m running this test on 5 different devices).

I am afraid we need more details. You must perform log of the packets using Wireshark, and perform log of all socket operations from your processor/MCU (packet sizes, RX/TX pointers).

Looking into the code of parseDHCPMSG I see it uses buffer pDHCPMSG, and is expected to read whole DHCP message.

You must catch which library routine has pointers corrupt, and then consider it for exceptions/bugs.

But you can log required information to trace the operation?

W5500 can support up to 8 sockets and each socket’s RX and TX buffers are 2KB by default. The DHCP buffer is initialized by DHCP_init(). The library allocates a 2KB buffer for DHCP.

#define DATA_BUF_SIZE   2048
uint8_t gDATABUF[DATA_BUF_SIZE];

...
	/* DHCP client Initialization */
	if(gWIZNETINFO.dhcp == NETINFO_DHCP)
	{
		DHCP_init(SOCK_DHCP, gDATABUF);
		// if you want different action instead default ip assign, update, conflict.
		// if cbfunc == 0, act as default.
		reg_dhcp_cbfunc(my_ip_assign, my_ip_assign, my_ip_conflict);

		run_user_applications = false; 	// flag for running user's code
	}

If everything is default, the maximum size getSn_RX_RSR() can return is 2KB. I still don’t see how the buffer overflow could occur unless you have memory corruption elsewhere.

Ok I see that now. Since the DHCP buffer is just a pointer to a RIP_MSG type I defined it to be of size RIP_MSG_SIZE, assuming that the type we were using was large enough to hold the entire buffer, I realised in debugging that it was not which is why I mentioned that the DHCP buffer wasn’t large enough.

FWIW I was able to read the value returned by getSn_RX_RSR and the debugger was indicating a value of 0xFFF8 in that register. I don’t think it’s possible for that to be a real response from the W5500 so I have to assume the entire stack and memory is corrupt by the time I get to it after a fault condition.

Further testing indicates this only happens in a very narrow use case for the device so I suspect the actual cause might be elsewhere in the code although it is still odd that it fails during a DHCP read, possibly a buffer stored in adjacent memory.

This is interesting and maybe Eugeny can help - I don’t know how the W5500 internal receive buffer operates. According to the comment in w5500.h

/**
 * @ingroup Socket_register_group
 * @brief Received data size register(R)
 * @details @ref Sn_RX_RSR indicates the data size received and saved in Socket n RX Buffer.
 * @ref Sn_RX_RSR does not exceed the @ref Sn_RXBUF_SIZE and is calculated as the difference between
 * �Socket n RX Write Pointer (@ref Sn_RX_WR)and �Socket n RX Read Pointer (@ref Sn_RX_RD)
 */
#define Sn_RX_RSR(N)       (_W5500_IO_BASE_ + (0x0026 << 8) + (WIZCHIP_SREG_BLOCK(N) << 3))

So what happens if the RX write pointer wraps around? The number 0xFFF8 could simply be a (-6), indicating a buffer overrun situation.

What is this “very narrow use case” if you don’t mind sharing? Also, are you using non-blocking mode (SF_IO_NONBLOCK) for TCP sockets?