Page 1 of 1

Detecting when DHCP struggles

Posted: Sun Feb 11, 2024 9:39 am
by floatyb
I have an ESP32 using esp-idf v5.3-dev-1724-ga5b261f69980, which connects over Wi-Fi to a TI WL18xx based access point.

Like all access points, they're unreliable. One of the issues with this access point is the firmware on it locks up, watchdogs and restarts, which is not a problem for other Linux and Windows based platforms - they reconnect and carry on as if nothing happened. The ESP32 however doesn't, and attempts to DHCP, but it seems the ESP32 can't process the reply. The result is hours of the ESP32 spamming the access point with DHCP requests and discovers until it eventually times out. Sometimes it never recovers.

In the log below, "xx:xx:xx:42:fc:66" is the laptop, "xx:xx:xx:ee:11:fe" is a phone, and "64:b7:08:cc:d1:54" is the ESP32.

The log from the access point:
Feb 10 18:19:46 gw dnsmasq-dhcp[12232]: DHCPREQUEST(brlan) 192.168.248.203 xx:xx:xx:42:fc:66
Feb 10 18:19:46 gw dnsmasq-dhcp[12232]: DHCPACK(brlan) 192.168.248.203 xx:xx:xx:42:fc:66
Feb 10 18:21:04 gw wpa_supplicant[366]: wlan0: AP-STA-DISCONNECTED xx:xx:xx:42:fc:66
Feb 10 18:21:04 gw kernel: wlcore: WARNING Unable to flush all TX buffers, timed out (timeout 500 ms
Feb 10 18:21:06 gw kernel: wlcore: ERROR SW watchdog interrupt received! starting recovery.
...
Feb 10 18:21:10 gw dnsmasq-dhcp[12232]: DHCPREQUEST(brlan) 192.168.248.193 xx:xx:xx:ee:11:fe
Feb 10 18:21:10 gw dnsmasq-dhcp[12232]: DHCPACK(brlan) 192.168.248.193 xx:xx:xx:ee:11:fe
Feb 10 18:21:11 gw wpa_supplicant[366]: wlan0: AP-STA-CONNECTED xx:xx:xx:42:fc:66
Feb 10 18:21:11 gw wpa_supplicant[366]: wlan0: EAPOL-4WAY-HS-COMPLETED xx:xx:xx:42:fc:66
Feb 10 18:21:11 gw dnsmasq-dhcp[12232]: DHCPREQUEST(brlan) 192.168.248.203 xx:xx:xx:42:fc:66
...
Feb 10 18:51:10 gw wpa_supplicant[366]: wlan0: AP-STA-DISCONNECTED xx:xx:xx:ee:11:fe
...
Feb 10 19:05:57 gw dnsmasq-dhcp[12232]: DHCPREQUEST(brlan) 192.168.248.85 64:b7:08:cc:d1:54
Feb 10 19:05:57 gw dnsmasq-dhcp[12232]: DHCPACK(brlan) 192.168.248.85 64:b7:08:cc:d1:54 espressif
Feb 10 19:05:58 gw dnsmasq-dhcp[12232]: DHCPREQUEST(brlan) 192.168.248.85 64:b7:08:cc:d1:54
Feb 10 19:05:58 gw dnsmasq-dhcp[12232]: DHCPACK(brlan) 192.168.248.85 64:b7:08:cc:d1:54 espressif
(This repeats every second from now on, until...)

Feb 10 19:16:03 gw dnsmasq-dhcp[12232]: DHCPDISCOVER(brlan) 64:b7:08:cc:d1:54
Feb 10 19:16:03 gw dnsmasq-dhcp[12232]: DHCPOFFER(brlan) 192.168.248.85 64:b7:08:cc:d1:54
Feb 10 19:16:04 gw dnsmasq-dhcp[12232]: DHCPDISCOVER(brlan) 64:b7:08:cc:d1:54
Feb 10 19:16:04 gw dnsmasq-dhcp[12232]: DHCPOFFER(brlan) 192.168.248.85 64:b7:08:cc:d1:54
(This now repeats every second until the last one at 22:39:00).

Feb 10 19:26:12 gw wpa_supplicant[366]: wlan0: AP-STA-CONNECTED xx:xx:xx:ee:11:fe
Feb 10 19:26:12 gw wpa_supplicant[366]: wlan0: EAPOL-4WAY-HS-COMPLETED xx:xx:xx:ee:11:fe
Feb 10 19:26:12 gw dnsmasq-dhcp[12232]: DHCPDISCOVER(brlan) xx:xx:xx:ee:11:fe
Feb 10 19:26:12 gw dnsmasq-dhcp[12232]: DHCPOFFER(brlan) 192.168.248.193 xx:xx:xx:ee:11:fe
Feb 10 19:26:12 gw dnsmasq-dhcp[12232]: DHCPREQUEST(brlan) 192.168.248.193 xx:xx:xx:ee:11:fe
Feb 10 19:26:12 gw dnsmasq-dhcp[12232]: DHCPACK(brlan) 192.168.248.193 xx:xx:xx:ee:11:fe

Feb 10 22:39:00 gw dnsmasq-dhcp[12232]: DHCPDISCOVER(brlan) 64:b7:08:cc:d1:54
Feb 10 22:39:00 gw dnsmasq-dhcp[12232]: DHCPOFFER(brlan) 192.168.248.85 64:b7:08:cc:d1:54
Feb 10 22:45:10 gw wpa_supplicant[366]: wlan0: AP-STA-POLL-OK 64:b7:08:cc:d1:54
Feb 10 22:50:26 gw wpa_supplicant[366]: wlan0: AP-STA-POLL-OK 64:b7:08:cc:d1:54
(This continues every five minutes until 03:55:10)
Feb 11 03:55:10 gw wpa_supplicant[366]: wlan0: AP-STA-POLL-OK 64:b7:08:cc:d1:54
Feb 11 04:00:17 gw wpa_supplicant[366]: wlan0: AP-STA-DISCONNECTED 64:b7:08:cc:d1:54

and at this point, the ESP32 is never seen on the Wi-Fi network again.

My event handling for Wi-Fi is:

Code: Select all

        switch (event_id) {
        case WIFI_EVENT_STA_START:
                esp_wifi_scan_start(NULL, false);
                break;

        case WIFI_EVENT_SCAN_DONE:
                wifi_event_scan_done(ctx, event_data);
                break;

        case WIFI_EVENT_STA_CONNECTED:
                ESP_LOGI("wifi", "connected");
                esp_netif_create_ip6_linklocal(netif);
                break;

        case WIFI_EVENT_STA_DISCONNECTED:
                ESP_LOGI("wifi", "disconnected");
                esp_wifi_scan_start(NULL, false);
                break;
        }
So a disconnection off the Wi-Fi should trigger a scan, which should then find the network, and via my wifi_event_scan_done() function, will cause the Wi-Fi to be reconnected (this same path is used for the initial connection. It looks up the found networks in a table, and if a matching SSID is found, credentials are set via esp_wifi_set_config() and esp_wifi_connect() is called. If no matching network is found, it restarts scanning. So after the disconnect at 4am this morning, it should have reconnected but for some reason didn't. I'm not sure I understand why that didn't happen.

Overall, it looks to me like when the TI WL18xx firmware restarts, it loses some of the encryption state with clients, allowing it to continue receiving packets from clients, but packets that are sent to the clients can't be decrypted, but that is mere conjecture.

I am not able to monitor the console on the ESP32 because this only sporadically happens and I have no way to leave the laptop permanently connected to monitor the console.

My questions are:
1. how can I detect when DHCP starts to struggle, so I can trigger a more timely reconnect of the Wi-Fi ?
2. how can I configure the DHCP retries and timeouts (e.g. to give up quicker)
3. at what point would I receive an IP_EVENT_STA_GOT_IP event with a zero ip_info.ip.addr during this?

Thanks.