W5500 DNS Timeout Issue

Hi All,

I have a weird issue where I resolve a hostname ok the first time, I get two attempts with timeout errors and from that point on the hostnames resolve fine. This behaviour is consistent and repeatable.
I’m using the dns.c library from ioLibrary_Driver/dns.c at master · Wiznet/ioLibrary_Driver · GitHub

The only change that I’ve made is replacing my old cheap router with a Unifi UDM Pro SE unit. Prior to that however I’m not able to absolutely confirm that it worked correctly but believe it did.

Sorry, I don’t have a great deal of information to contribute but hoping someone also came across a similar issue.

Thanks,
Paul

Probably related to something like the Cisco’s “fast-start” port option. Dig into the router manual. You goal is to make port ready at the router’s side as fast as possible when link is up from W5500 side.

I’ll have a look at that side also, although that doesn’t really explain why the first lookup is successful. I did try leaving the unit alone for around 10 minutes or so in case it was some strange timing thing. Same behaviour where the first lookup works, 2 and 3 return error and subsequent lookups are fine.

First request may go through because port startup settings match the line and before router starts port training. This is a guess of course, but it somehow may explain the behavior. On normal steady link situations like yours must not happen.

I think I’ve eliminated the port as the reason for failure. It was orginally connected to an old cisco switch to the router. I tried directly connected to the router (interestingly faster obtaining an IP address) with the same result and then to a check switch and still had issues.

It isn’t making a lot of sense so far, the only other thing I can look at is putting the old router back in as a test but not looking forward to doing that :slight_smile:

A short log view shows exactly the same pattern each time I run the program. I’m using some buttons to instigate the DNS call.

First attempt work, the next two fail and then works from that point on. Interestingly, on the failures it still returns the correct IP address for that domain.

UNIT	W5500 IP Address: 10.1.1.42
UNIT	W5500 DHCP Server IP: 10.1.1.1
UNIT	W5500 Lease time: 86400 secs

UNIT	Short Press 4
Try to resolve freelanceshoes.com.au
> DNS Query to DNS Server : 10.1.1.1 -> freelanceshoes.com.au
dns_makequery return: 39
sendto parameters: DNS_SOCKET: 4 len: 39 IPPORT_DOMAIN: 53
> Receive DNS message from 10.1.1.1(53). len = 55
question section cp: 192
answer section cp: 0
> DNS Server return from parseDNSMSG : 1 -> 23.227.38.65
Resolved freelanceshoes.com.au 23.227.38.65

UNIT	Short Press 3
Try to resolve cba.com.au
> DNS Query to DNS Server : 10.1.1.1 -> cba.com.au
dns_makequery return: 28
sendto parameters: DNS_SOCKET: 4 len: 28 IPPORT_DOMAIN: 53
> DNS Timeout
> DNS Timeout
> DNS Server is not responding : 10.1.1.1
2022-10-21T09:41:50	ip_mgr_resolve_name: timeout or parsing error - Name: cba.com.au  IP: 18.67.111.92
Error resolving cba.com.au

UNIT	Short Press 3
Try to resolve cba.com.au
> DNS Query to DNS Server : 10.1.1.1 -> cba.com.au
dns_makequery return: 28
sendto parameters: DNS_SOCKET: 4 len: 28 IPPORT_DOMAIN: 53
> DNS Timeout
> DNS Timeout
> DNS Server is not responding : 10.1.1.1
2022-10-21T09:42:00	ip_mgr_resolve_name: timeout or parsing error - Name: cba.com.au  IP: 18.67.111.92
Error resolving cba.com.au

UNIT	Short Press 3
Try to resolve cba.com.au
> DNS Query to DNS Server : 10.1.1.1 -> cba.com.au
dns_makequery return: 28
sendto parameters: DNS_SOCKET: 4 len: 28 IPPORT_DOMAIN: 53
> Receive DNS message from 10.1.1.1(53). len = 92
question section cp: 192
answer section cp: 192
answer section cp: 192
answer section cp: 192
answer section cp: 0
> DNS Server return from parseDNSMSG : 1 -> 18.67.111.92
Resolved cba.com.au 18.67.111.92

UNIT	Short Press 2
Try to resolve test.mosquitto.org
> DNS Query to DNS Server : 10.1.1.1 -> test.mosquitto.org
dns_makequery return: 36
sendto parameters: DNS_SOCKET: 4 len: 36 IPPORT_DOMAIN: 53
> Receive DNS message from 10.1.1.1(53). len = 52
question section cp: 192
answer section cp: 0
> DNS Server return from parseDNSMSG : 1 -> 91.121.93.94
Resolved test.mosquitto.org 91.121.93.94

UNIT	Short Press 1
Try to resolve urbanpersonaltraining.com.au
> DNS Query to DNS Server : 10.1.1.1 -> urbanpersonaltraining.com.au
dns_makequery return: 46
sendto parameters: DNS_SOCKET: 4 len: 46 IPPORT_DOMAIN: 53
> Receive DNS message from 10.1.1.1(53). len = 62
question section cp: 192
answer section cp: 0
> DNS Server return from parseDNSMSG : 1 -> 221.121.157.27
Resolved urbanpersonaltraining.com.au 221.121.157.27

Cheers,
Paul

I just pulled the tcp traffic off the router and this confirms that the request/response seems fine. Each reply contains the DNS result as shown in the attached file. I’m tending back towards the code but that isn’t making a lot of sense either.

dns_packets.zip (507 Bytes)

Agree, packets are ok. What is the version of library? Please ensure you use the latest as it seems there were bugs fixed throughout code life. Library does not cover calling DNS_time_handler(), which is expected to count seconds, and by default DNS_WAIT_TIME is 3 (3 seconds) and MAX_DNS_RETRY is 2 (two attempts). We see two attempts, but does the timing of each attempt match (3 seconds)?

Yes, the timing is correct with 2 retries of 3 seconds, athough it appears that I’ve solved the issue.

The ports are allocated by a function in w5500_iface.c that were added I believe by my Nigerian programmer (yeah, there is one :-)) as we were havig issues with port allocations at one point, so was changed to the following:

uint16_t w5500_get_unused_port(void)
{
if (++local_port_offset < W5500_LOCAL_PORT_OFFSET)
{
local_port_offset = W5500_LOCAL_PORT_OFFSET;
}
return local_port_offset;
}

anything requiring a port was calling this function . . . although not everything.

socket.c had the following in case the port wasn’t send in for whatever reason, and it was allocating, you guessed it, 2 ports on startup.

if(!port)
{
    port = w5500_get_unused_port();

// if(sock_any_port == 0xFFF0)
// {
// sock_any_port = SOCK_ANY_PORT_NUM;
// }
// LOGLN_INFO(“Sock Allocated Port: %d”, port);
}

So the fix was to simply call the function w5500_get_unused_port(); in w5500_iface.c to ensure duplicate ports were not being issued. I’m not sure how far away from the original code this is, I don’t think it’s far off from a quick check, however certain mods were required to make everything work.

So, bottom line, mystery solved.

OK, one coffee later and thinking about it a little more it seems that if the two ports allocated on startup (and I suspect without checking that they are DHCP and NTP clients) are not being closed, then this issue will eventually happen twice at some point in the future when the cycle comes around again.

My project is current hosting a webserver for configuration, and clients for DHCP, NTP, Syslog, DNS and MQTT so there’s a bit going on.

So, should ports ALWAYS be closed when the call has been made or should those allocated be tracked somehow as they’re allocated and released. When allocating a port one would do a lookup to see if a port in the pool is currently in use. Seems like effort compared to just closing the port.

1 Like

Good to hear!

I am not closely familiar with the library and how it actually performs; on the lowest socket level it is wise to open socket, perform task, and close and release the socket if it is not needed on regular basis and you are short on sockets. W5500 has limited number of sockets, so it is up to you as system designer to decide if sockets hijacked for e.g. NTP or MQTT are always allocated, if you perform NTP sync regularly and use MQTT broker connecting regularly. In this example you have two less sockets for other operations.