ESP-WIFI-MESH: 2nd layer node failure after root reboot

maxmaxk
Posts: 7
Joined: Mon Jan 29, 2024 6:06 pm

ESP-WIFI-MESH: 2nd layer node failure after root reboot

Postby maxmaxk » Thu Feb 01, 2024 11:40 am

I'm using ip_internal_network example from idf release v5.2 on my ESP32-C6 devkit without any change.
Until now, I0m testing with 2 devices. Everything runs smoothly until I reset/reboot my root node and check it again and I see it connects to my home router. I expect the 2nd layer node to fail sending a few packets but then recovering. However, It doesn't seem to recover.
Per recommendations from GitHub issues, I

Code: Select all

set esp_mesh_send_block_time(5000)
To me, this is because the root node just reset and connected too fast to my home router so the 2nd layer node thinks it was just a connection failure and not a root reset.
Root node logs show that the node is not in its routing table. So node did not try to reconnect to it.

Code: Select all

I (849779) mesh_main: Sending routing table to [0] 40:4c:ca:41:b7:90: sent with err code: 0
I (849779) mesh_main: Received Routing table [0] 40:4c:ca:41:b7:90
And 2nd layer node logs show that MQTT failed, and mesh_send also returns errors. It also shows that it has an IP address of 10.0.0.2 which means it assumes that it is connected to the root node, but root logs show that the root does not have any information about this node!

Code: Select all

W (1039709) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:268, no_wnd_count:0, timeout_count:749
E (1039709) mesh_netif: Send with err code 16394 ESP_ERR_MESH_TIMEOUT
E (1039709) esp-tls: Failed to open new connection
E (1039719) transport_base: Failed to open a new connection
E (1039719) mqtt_client: Error transport connect
I (1039729) mesh_mqtt: MQTT_EVENT_ERROR
I (1039729) mesh_mqtt: MQTT_EVENT_DISCONNECTED
I (1039739) mesh_mqtt: sent publish returned msg_id=15518
W (1041409) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:269, no_wnd_count:0, timeout_count:750
I (1041749) mesh_main: Tried to publish layer:2 IP:10.0.0.2
I (1041749) mesh_mqtt: sent publish returned msg_id=8804
W (1042609) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:269, no_wnd_count:0, timeout_count:751

maxmaxk
Posts: 7
Joined: Mon Jan 29, 2024 6:06 pm

Re: ESP-WIFI-MESH: 2nd layer node failure after root reboot

Postby maxmaxk » Fri Feb 02, 2024 10:03 am

How we can check if there is connectivity on the link layer? Something like an ARP mechanism or anything like a function that gives the status of connectivity in the link layer by passing the hardware (mac) address of the root.
Still, I don't understand how the node assumes that it stayed authenticated and connected after the root reset!
Since we don't have the source of mesh implementation in idf, I would like to know about the algorithm that mesh children are checking for parent liveness. The documentation of wifi mesh is quite high level. This makes it difficult to have solutions on the app layer to work around a problem like this because we don't know the outcome of it if the network becomes large.

maxmaxk
Posts: 7
Joined: Mon Jan 29, 2024 6:06 pm

Re: ESP-WIFI-MESH: 2nd layer node failure after root reboot

Postby maxmaxk » Fri Feb 02, 2024 8:23 pm

I'm posting more details here: https://github.com/espressif/esp-idf/issues/12856 as I don't get any response from forum.

zhangyanjiao
Posts: 34
Joined: Mon Aug 28, 2017 3:27 am

Re: ESP-WIFI-MESH: 2nd layer node failure after root reboot

Postby zhangyanjiao » Tue Feb 06, 2024 3:44 am

Code: Select all

esp_mesh_send_block_time(5000)
This API is using to set the block time of `esp_mesh_send()`, it means the `esp_mesh_send()` will return timeout error if the mesh packet has not been sent successfully after 5s.

zhangyanjiao
Posts: 34
Joined: Mon Aug 28, 2017 3:27 am

Re: ESP-WIFI-MESH: 2nd layer node failure after root reboot

Postby zhangyanjiao » Tue Feb 06, 2024 4:02 am

After the station connected to AP, it will continue to receive the beacon from the AP, if the station does not receive a beacon frame from the connected AP for 6s, it will disconnect from the AP. You can call the following API to set this inactive time.

Code: Select all

/**
  * @brief     Set the inactive time of the STA or AP
  *
  * @attention 1. For Station, If the station does not receive a beacon frame from the connected SoftAP during the inactive time,
  *               disconnect from SoftAP. Default 6s.
  * @attention 2. For SoftAP, If the softAP doesn't receive any data from the connected STA during inactive time,
  *               the softAP will force deauth the STA. Default is 300s.
  * @attention 3. The inactive time configuration is not stored into flash
  *
  * @param     ifx  interface to be configured.
  * @param     sec  Inactive time. Unit seconds.
  *
  * @return
  *    - ESP_OK: succeed
  *    - ESP_ERR_WIFI_NOT_INIT: WiFi is not initialized by esp_wifi_init
  *    - ESP_ERR_WIFI_NOT_STARTED: WiFi is not started by esp_wifi_start
  *    - ESP_ERR_INVALID_ARG: invalid argument, For Station, if sec is less than 3. For SoftAP, if sec is less than 10.
  */
esp_err_t esp_wifi_set_inactive_time(wifi_interface_t ifx, uint16_t sec);
In your case, when root device reboot, it will send deauth/disassoc frame to 2nd layer node, if the 2nd layer node doesn't receive the deauth/disasssoc frame, it will wait for 6s to discover the root node leave, and if the root has reconnect to the router in 6s, then 2nd layer node will not be aware of the root leaving. But when the 2nd layer node tries to send packet to the root node, the root node will refuse it, and start the SA query process to trigger the reconnection operation.

maxmaxk
Posts: 7
Joined: Mon Jan 29, 2024 6:06 pm

Re: ESP-WIFI-MESH: 2nd layer node failure after root reboot

Postby maxmaxk » Tue Feb 06, 2024 12:06 pm

Thanks for your explanation.
This means that for mesh, we should lower the inactivity time, or delay more than 6s the mesh nodes to re-connect to the router. This way the mesh nodes will try to send SA query after 6s, and not be confused for some minutes until receiving a de-auth.
Do you have a better solution? Will you consider resolving this issue in the newer mesh lib?

maxmaxk
Posts: 7
Joined: Mon Jan 29, 2024 6:06 pm

Re: ESP-WIFI-MESH: 2nd layer node failure after root reboot

Postby maxmaxk » Tue Feb 06, 2024 2:18 pm

I think there should be a well-structured document about the relationship between ESP WIFI MESH and WIFI configurations, especially for the fact that it is a closed source and we don't know how it works in detail. There is a general overview and a guide, but it is not enough for us to understand the behavior of it.
Looking at

Code: Select all

esp_wifi_set_inactive_time
, and the typical scenario logs I'm getting from STA (here it is the 2nd layer node) and SoftAP (here it is the root) in my mesh network, it seems it is the SoftAP that sends the de-auth frame after 300 seconds (5 mins) to the STA. This means that the root didn't receive any data from the 2nd layer node at 300 seconds. But why? Probably because it changes the router to 00:00:00:00:00:00, 10 seconds after the timeout occurs at its Mesh layer. So, I am a little confused here: what is the correct sequence of the events?
I need to remind you that since the root node goes off suddenly (reset) there is no de-auth frame from the root to 2nd layer nodes.
The timeline is like this to be more clear:

100 120 124 127? 427 428
Mesh already started root reset (AP) router change (STA ) Deauth from AP SA query STA
|________________________|_________________|_______________________________|_______________|_____________
|
root reconnects to Router

maxmaxk
Posts: 7
Joined: Mon Jan 29, 2024 6:06 pm

Re: ESP-WIFI-MESH: 2nd layer node failure after root reboot

Postby maxmaxk » Tue Feb 06, 2024 2:27 pm

Sorry, the timeline was messed up in previous post.
Here it is:
Image
or if the forum does not allow:
here is the link:,https://imgur.com/BjAZlgk

zhangyanjiao
Posts: 34
Joined: Mon Aug 28, 2017 3:27 am

Re: ESP-WIFI-MESH: 2nd layer node failure after root reboot

Postby zhangyanjiao » Tue Feb 20, 2024 2:43 am

This means that for mesh, we should lower the inactivity time, or delay more than 6s the mesh nodes to re-connect to the router.
yes, this is the solution.

To understand the behavior of your case, please provide the logs for the root and node with the absolute timestamps. like this:

Code: Select all

[01-19 14:36:26:522]:
I (100868) wifi:mode : sta (60:55:f9:f6:a5:bc) + softAP (60:55:f9:f6:a5:bd)
I (100869) wifi:Total power save buffer number: 16
The `[01-19 14:36:26:522]` is the absolute timestamp, and the `I (100869)` is the relative time since the device starts.
Then we will show the timeline in your case.

Who is online

Users browsing this forum: Google [Bot] and 178 guests