Page 1 of 1

ESP MESH invalid state seems to cause lockup

Posted: Thu Mar 18, 2021 7:46 am
by Snedig
Hi,

I've been using ESP MESH for some time, but recently a strange problem has surfaced.

Sometimes my devices get hung up with the following messages:

I (19:45:13.501) mesh_net: <GOT ROOT ADDRESS> addr:f0:08:d1:85:ea:ed
I (12983903) mesh: 5080<assoc>parent layer:1, channel:1, rssi:-78, assoc:1, rssi threshold<-78,-82,-85>
W (12985112) mesh: [mesh_schedule.c,3072] [WND-RX]1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:0
W (12986313) mesh: [mesh_schedule.c,3072] [WND-RX]1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:1
I (12986453) wifi:new:<1,0>, old:<1,0>, ap:<1,0>, sta:<1,0>, prof:1
I (12986454) wifi:station: bc:dd:c2:f7:f3:1d join, AID=1, lr, 20
I (12986608) mesh: 1307[recv]invalid child bc:dd:c2:f7:f3:1d
I (12986609) mesh: 1307[recv]invalid child bc:dd:c2:f7:f3:1d
I (12986610) mesh: 1804[XON]async, from child bc:dd:c2:f7:f3:1d
I (12986610) mesh: 1307[recv]invalid child bc:dd:c2:f7:f3:1d
I (12986619) mesh: 1307[recv]invalid child bc:dd:c2:f7:f3:1d
I (12986681) mesh: 1307[recv]invalid child bc:dd:c2:f7:f3:1d
I (12986681) mesh: 1307[recv]invalid child bc:dd:c2:f7:f3:1d
I (12986682) mesh: 1804[XON]async, from child bc:dd:c2:f7:f3:1d

Once this starts, it seems to go on indefinitely (I tried waiting for about a day, at which point the mesh_schedule timeout count was in the multiple millions).

This seems to be blocking other tasks, and it also seems to be "infectious" to a certain extent, causing more nodes in the network to fail once one has failed.

I have no access to the mesh schedule file, so I can't see what is going on, but is there any way to detect such an error?

Re: ESP MESH invalid state seems to cause lockup

Posted: Wed May 12, 2021 4:58 am
by the-remote-engineer
I am seeing the same issue. (full disclosure: I'm running router-less code with esp-idf v4.2) Here is my debug:
I (15903) mesh: [scan]new scanning time:600ms, beacon interval:500ms
I (15913) mesh: 1990<arm>parent monitor, my layer:2(cap:6)(node), interval:11259ms, retries:1<normal connected>
I (15923) mesh_main: <MESH_EVENT_PARENT_CONNECTED>layer:0-->2, parent:10:52:1c:66:98:61<layer2>, ID:77:77:77:77:77:77, duty:0
I (15933) mesh_main: <MESH_EVENT_TODS_REACHABLE>state:0
I (15933) mesh_main: layer:2, rtableSize:1, NODE
I (15943) mesh_main: <MESH_EVENT_ROOT_ADDRESS>root address:10:52:1c:66:98:61
I (15993) mesh_main: Station MAC: 10:52:1c:69:48:00
I (15993) mesh_main: Parent MAC: 10:52:1c:66:98:61
E (15993) mesh_main: ROUTING TABLE: idx:0 Neighbor:10:52:1c:69:48:00
I (15993) mesh_main: NODE TASK EVENT {"Hello":"World"}
W (17803) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:0
I (18403) wifi:pm start, type: 1

I (18923) wifi:AP's beacon interval = 512000 us, DTIM period = 1
W (19603) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:1
W (21403) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:2
W (23203) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:3
W (25003) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:4
I (25943) mesh_main: layer:2, rtableSize:1, NODE
W (26803) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:5
I (27453) mesh: 5120<active>parent layer:1(node), channel:1, rssi:-25, assoc:2(cnx rssi threshold:-120)my_assoc:0
W (28603) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:6
I (28953) mesh: 5909<scan>parent layer:1, rssi:-24, assoc:3(cnx rssi threshold:-120)
I (28953) mesh: [SCAN][ch:1]AP:3, other(ID:0, RD:0), MAP:3, idle:0, candidate:3, root:1, topMAP:0[c:2,i:2][00:00:00:00:00:00][FIXED-ROOT]<weak>
I (28963) mesh: 7331[weak]try rssi_threshold:-120, backoff times:0, max:5<-78,-82,-85>
I (28973) mesh: 701[monitor]no change, parent:10:52:1c:66:98:61, rssi:-24
I (28983) mesh: 1990<arm>parent monitor, my layer:2(cap:6)(node), interval:7817ms, retries:2<>
W (30403) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:7
W (32203) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:8
W (34003) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:9
W (35803) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:10
I (35943) mesh_main: layer:2, rtableSize:1, NODE
I (36993) mesh: 5120<active>parent layer:1(node), channel:1, rssi:-25, assoc:3(cnx rssi threshold:-120)my_assoc:0
W (37603) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:11
I (38493) mesh: 5909<scan>parent layer:1, rssi:-26, assoc:3(cnx rssi threshold:-120)
I (38493) mesh: [SCAN][ch:1]AP:3, other(ID:0, RD:0), MAP:3, idle:0, candidate:3, root:1, topMAP:2[c:0,i:2][00:00:00:00:00:00][FIXED-ROOT]<weak>
I (38503) mesh: 7331[weak]try rssi_threshold:-120, backoff times:0, max:5<-78,-82,-85>
I (38513) mesh: 701[monitor]no change, parent:10:52:1c:66:98:61, rssi:-26
I (38523) mesh: 1990<arm>parent monitor, my layer:2(cap:6)(node), interval:5360ms, retries:3<>
W (39403) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:12
W (41203) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:13
W (43003) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:14
I (44023) mesh: 5120<active>parent layer:1(node), channel:1, rssi:-25, assoc:3(cnx rssi threshold:-120)my_assoc:0
W (44803) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:15
I (45523) mesh: 5909<scan>parent layer:1, rssi:-27, assoc:3(cnx rssi threshold:-120)
I (45523) mesh: [SCAN][ch:1]AP:3, other(ID:0, RD:0), MAP:3, idle:0, candidate:3, root:1, topMAP:2[c:0,i:2][00:00:00:00:00:00][FIXED-ROOT]<weak>
I (45533) mesh: 7331[weak]try rssi_threshold:-120, backoff times:0, max:5<-78,-82,-85>
I (45543) mesh: 701[monitor]no change, parent:10:52:1c:66:98:61, rssi:-27
I (45543) mesh: 1990<arm>parent monitor, my layer:2(cap:6)(node), interval:395191ms, retries:3<>
I (45943) mesh_main: layer:2, rtableSize:1, NODE
W (46603) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:16
W (48403) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:17
W (50203) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:18
W (52003) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:19
W (53803) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:20
W (55603) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:21
I (55943) mesh_main: layer:2, rtableSize:1, NODE
W (57403) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:22
W (59203) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:23
W (61003) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:24
W (62803) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:25
W (64603) mesh: [mesh_schedule.c,3072] [WND-RX]1800 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:26

Re: ESP MESH invalid state seems to cause lockup

Posted: Sun Jan 09, 2022 4:45 pm
by shantha
I am using IDF ver 4.3 and facing the same issue. here is my debug output.

I (7509) mesh: [scan]new scanning time:600ms, beacon interval:300ms
I (7519) mesh: 2004<arm>parent monitor, my layer:3(cap:6)(node), interval:7642ms, retries:1<normal connected>
I (00:00:06.904) MAIN: <MESH_EVENT_PARENT_CONNECTED>layer:0-->3, parent:94:b9:7e:e9:52:b1, ID:77:77:77:77:77:00
I (00:00:06.915) mesh_netif: It was a wifi station removing stuff
I (7789) wifi:AP's beacon interval = 307200 us, DTIM period = 1
W (8739) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:0
W (9939) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:1
W (11139) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:2
I (11509) wifi:state: run -> init (fc0)
I (11509) wifi:pm stop, total sleep time: 0 us / 3998117 us

I (11509) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
W (12339) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:3
W (13539) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:4
W (14739) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:5
I (15349) mesh: 5151<active>parent layer:2(node), channel:1, rssi:-54, assoc:0(cnx rssi threshold:-120)my_assoc:0
W (15939) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:6
W (17139) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:7
W (18339) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:8
W (19539) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:9
W (20739) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:10
W (21939) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:11
W (23139) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:12
W (24339) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:1, no_wnd_count:0, timeout_count:13

Re: ESP MESH invalid state seems to cause lockup

Posted: Fri Jan 14, 2022 12:29 pm
by ESP_Sunshine
This is because the underlying window of the wifi mesh has expired, so you are continuously sending data, and the underlying buffer will be full, but your data cannot be sent to the peer.

There are two reasons for this:
1. The disconnection between nodes leads to poor communication, which causes this problem

2. It may be that the amount of data is too large and the buffer is full, resulting in congestion at the bottom layer, and this result will also occur.

Can you explain how this happened to you.

Re: ESP MESH invalid state seems to cause lockup

Posted: Sun Jan 23, 2022 4:40 pm
by shantha
Hi Jyf_sunshine,

Thanks for your reply.

We have observed this happening when we restart our Root node, another node becomes the root and previous root node is going to this looping status. Also, we have observed sometimes this is happening to other nodes as well.

Appreciate if you can explain how the two possible reasons you mentioned related our case.
Also, if you can suggest the possible solution to overcome the issue (modification to mesh code or any configuration changes) will be very much helpful.

Thank you.

Shantha.

Re: ESP MESH invalid state seems to cause lockup

Posted: Wed Nov 02, 2022 1:23 pm
by bradkeifer
I too am experiencing this problem with esp-idf v4.4.2.

I am running code based on the IP Internal Network Example https://github.com/espressif/esp-idf/tr ... al_network.

The problem occurs on a non-root node very soon after I experience a brownout detector reset on my root node.

Is there any way that we can deal with this problem in app code or configuration, or is this a bug in the esp-mesh idf code?

Re: ESP MESH invalid state seems to cause lockup

Posted: Thu Nov 03, 2022 6:35 pm
by bradkeifer
Further to the above post, I should add that I am logging the following stats via mqtt on a regular basis, and they show no signs of a developing problem. Obviously, the stats from the locked up non-root node stop being received by the mqtt broker from the point of failure. These stats are:
  • # of tasks
  • mesh layer #
  • self organized status
  • IP address
  • free heap
  • min free heap
  • total mesh rx pending (always 0)
  • total mesh tx pending (always 0)
  • mesh active duty cycle (always 100 on the root, 10 on the non-root)
When I power cycle the non-root node after this failure, it is unable to reconnect to the root node. Here are the relevant debug logs.

Firstly, the non-root node:

Code: Select all

I (00:01:13.183) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason:4
I (73970) mesh: 5200[disconnect]looking for a new parent, [L:0]try layer:0[scan]
I (74272) mesh: [SCAN][ch:6]AP:3, other(ID:0, RD:0), MAP:1, idle:0, candidate:1, root:1, topMAP:0[c:0,i:0][28:cf:e9:83:0b:04]router found<>
I (74273) mesh: 7369[selection]try rssi_threshold:-78, backoff times:0, max:5<-78,-82,-85>
I (74282) mesh: [DONE]connect to parent:ESPM_51B348, channel:6, rssi:-55, 30:c6:f7:51:b3:49[layer:1, assoc:1], my_vote_num:0/voter_num:0, rc[00:00:00:00:00:00/-120/0]
I (74647) wifi:new:<6,1>, old:<6,0>, ap:<6,1>, sta:<6,1>, prof:6
I (75382) wifi:state: init -> auth (b0)
I (75388) wifi:state: auth -> assoc (0)
I (76389) wifi:state: assoc -> init (400)
I (76389) wifi:new:<6,0>, old:<6,1>, ap:<6,1>, sta:<6,1>, prof:6
I (76391) mesh: [wifi]disconnected reason:4(assoc expire), continuous:1/max:12, non-root, vote(,stopped)<><>
I (00:01:15.654) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason:4
I (76398) wifi:new:<6,1>, old:<6,0>, ap:<6,1>, sta:<6,1>, prof:6
I (76410) wifi:state: init -> auth (b0)
I (76416) wifi:state: auth -> assoc (0)
I (77417) wifi:state: assoc -> init (400)
I (77418) wifi:new:<6,0>, old:<6,1>, ap:<6,1>, sta:<6,1>, prof:6
I (77419) mesh: [wifi]disconnected reason:4(assoc expire), continuous:2/max:12, non-root, vote(,stopped)<><>
I (00:01:16.682) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason:4
I (77426) wifi:new:<6,1>, old:<6,0>, ap:<6,1>, sta:<6,1>, prof:6
I (77439) wifi:state: init -> auth (b0)
I (77444) wifi:state: auth -> assoc (0)
I (78446) wifi:state: assoc -> init (400)
I (78446) wifi:new:<6,0>, old:<6,1>, ap:<6,1>, sta:<6,1>, prof:6
I (78448) mesh: [wifi]disconnected reason:4(assoc expire), continuous:3/max:12, non-root, vote(,stopped)<><>
I (00:01:17.711) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason:4
I (78455) wifi:new:<6,1>, old:<6,0>, ap:<6,1>, sta:<6,1>, prof:6
I (78467) wifi:state: init -> auth (b0)
I (78474) wifi:state: auth -> assoc (0)
I (79475) wifi:state: assoc -> init (400)
I (79475) wifi:new:<6,0>, old:<6,1>, ap:<6,1>, sta:<6,1>, prof:6
I (79477) mesh: [wifi]disconnected reason:4(assoc expire), continuous:4/max:12, non-root, vote(,stopped)<><>
I (00:01:18.740) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason:4
I (79484) wifi:new:<6,1>, old:<6,0>, ap:<6,1>, sta:<6,1>, prof:6
I (79496) wifi:state: init -> auth (b0)
I (79502) wifi:state: auth -> assoc (0)
I (80503) wifi:state: assoc -> init (400)
I (80504) wifi:new:<6,0>, old:<6,1>, ap:<6,1>, sta:<6,1>, prof:6
I (80505) mesh: [wifi]disconnected reason:4(assoc expire), continuous:5/max:12, non-root, vote(,stopped)<><>
I (00:01:19.769) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason:4
I (80513) wifi:new:<6,1>, old:<6,0>, ap:<6,1>, sta:<6,1>, prof:6
I (80525) wifi:state: init -> auth (b0)
I (80531) wifi:state: auth -> assoc (0)
I (81532) wifi:state: assoc -> init (400)
I (81533) wifi:new:<6,0>, old:<6,1>, ap:<6,1>, sta:<6,1>, prof:6
And here is the root node logs for the corresponding period:

Code: Select all

I (18:29:24.248) mesh_main: <MESH_EVENT_PS_CHILD_DUTY>duty:0, child aid:1, child 4c:eb:d6:7c:c9:90
I (26604284) wifi:new:<6,1>, old:<6,1>, ap:<6,1>, sta:<6,0>, prof:6
I (26604284) wifi:station: 4c:eb:d6:7c:c9:90 join, AID=1, bgn, 40U
W (26604285) wifi:m f assoc rsp l=205

I (18:29:25.277) mesh_main: <MESH_EVENT_PS_CHILD_DUTY>duty:0, child aid:1, child 4c:eb:d6:7c:c9:90
I (26606750) wifi:new:<6,1>, old:<6,1>, ap:<6,1>, sta:<6,0>, prof:6
I (26606751) wifi:station: 4c:eb:d6:7c:c9:90 join, AID=1, bgn, 40U
W (26606751) wifi:m f assoc rsp l=205

I (18:29:27.744) mesh_main: <MESH_EVENT_PS_CHILD_DUTY>duty:0, child aid:1, child 4c:eb:d6:7c:c9:90
I (26607779) wifi:new:<6,1>, old:<6,1>, ap:<6,1>, sta:<6,0>, prof:6
I (26607780) wifi:station: 4c:eb:d6:7c:c9:90 join, AID=1, bgn, 40U
W (26607780) wifi:m f assoc rsp l=205

I (18:29:28.773) mesh_main: <MESH_EVENT_PS_CHILD_DUTY>duty:0, child aid:1, child 4c:eb:d6:7c:c9:90
I (26608810) wifi:new:<6,1>, old:<6,1>, ap:<6,1>, sta:<6,0>, prof:6
I (26608810) wifi:station: 4c:eb:d6:7c:c9:90 join, AID=1, bgn, 40U
W (26608811) wifi:m f assoc rsp l=205

I (18:29:29.803) mesh_main: <MESH_EVENT_PS_CHILD_DUTY>duty:0, child aid:1, child 4c:eb:d6:7c:c9:90
I (26609837) wifi:new:<6,1>, old:<6,1>, ap:<6,1>, sta:<6,0>, prof:6
I (26609838) wifi:station: 4c:eb:d6:7c:c9:90 join, AID=1, bgn, 40U
W (26609839) wifi:m f assoc rsp l=205

I (18:29:30.831) mesh_main: <MESH_EVENT_PS_CHILD_DUTY>duty:0, child aid:1, child 4c:eb:d6:7c:c9:90
I (26610866) wifi:new:<6,1>, old:<6,1>, ap:<6,1>, sta:<6,0>, prof:6
I (26610867) wifi:station: 4c:eb:d6:7c:c9:90 join, AID=1, bgn, 40U
W (26610868) wifi:m f assoc rsp l=205
If I then power cycle the root-node, the mesh network is restored and the cycle is able to repeat.

This tends to indicate to me that the root cause of the problem perhaps lies with the root node. Any pointers or assistance would be greatly appreciated.