Problem on MESH_NODEs using IDF WIFI MESH for MQTT and OTA via remote HTTPS Server

MichaelS
Posts: 18
Joined: Mon Jul 08, 2024 12:08 am

Problem on MESH_NODEs using IDF WIFI MESH for MQTT and OTA via remote HTTPS Server

Postby MichaelS » Mon Aug 26, 2024 8:37 am

I have been developing a preparatory application to confirm use of the ESP32 for a Mesh based IOT project using MQTT.
So far everything has been running well where I have taken project example from the ESP IDF installation, tested them and worked the into my application.
I have a MQTT5 TLS host interface with WIFI and provisioning supported along with FATFS and OTA updates using a remote HTTPS server.
Everything has been stable, functional and performs well.

Now I have added WIFI Mesh using the example project ip_internal_network installed with the IDF framework under VSCode.
A couple of alarms bells early on:
README.md states "This example uses experimental NAT feature to translate addresses/ports from an internal subnet, that is created by the root node running a DHCP server."
In order to build I found I had to enable "IP forwarding", "NAT", and "NAT Port Mapping", the latter 2 are marked new/experimental in MenuConfig
I got it to run and things looked ok, except from some network healing issues. Putting that aside for now, I integrated into it into my main test app and tested on one node. MQTT and OTA updates which I had working on WIFI continued to work flawlessly on WIFI mesh when running from the MESH_ROOT node.

The problems started when I added some MESH_NODEs running the exact same software as follows.
MQTT continued to work but I found if I publish more than two messages at a time, the broker disconnects for a few seconds and then reconnects and recovers. I found I had to throttle back publishing to avoid this using an "inflight" counter, but it slows things down considerable and I am only testing one MESH_NODE one layer deep from the MESH_ROOT

OTA updates will not work on the MESH_NODE - still works fine on the MESH_ROOT.

On the Mesh Node I get:
  1. D (00:00:35.417) HTTP_CLIENT: Begin connect to: https://192.168.1.2:8070
  2. D (00:00:35.421) aOTA: OTA started
  3. D (00:00:35.434) esp-tls: host:192.168.1.2: strlen 11   // this is my HTTPS server address
  4. D (00:00:35.435) esp-tls: [sock=55] Resolved IPv4 address: 192.168.1.2
  5. D (00:00:35.448) esp-tls: Enable TCP keep alive. idle: 5, interval: 5, count: 3
  6. D (00:00:35.450) esp-tls: [sock=55] Connecting to server. HOST: 192.168.1.2, Port: 8070
  7. D (00:00:35.463) mesh_netif: Sending to root, dest addr: 30:30:f9:33:49:bd, size: 58
  8. D (00:00:35.495) mesh_netif: Node received: from: 48:ca:43:9b:54:c0 to 30:30:f9:33:49:bd size: 58
  9. D (00:00:35.499) mesh_netif: Sending to root, dest addr: 30:30:f9:33:49:bd, size: 54
  10. D (00:00:35.520) esp-tls: handshake in progress...
  11. D (00:00:35.523) mesh_netif: Sending to root, dest addr: 30:30:f9:33:49:bd, size: 278
  12. D (00:00:35.566) mesh_netif: Node received: from: 48:ca:43:9b:54:c0 to 30:30:f9:33:49:bd size: 58
  13. D (00:00:35.569) mesh_netif: Sending to root, dest addr: 30:30:f9:33:49:bd, size: 54
  14. D (36612) wifi:bssid equal: ss_state=0x4
  15. V (00:00:35.763) transport_base: poll_read: select - Timeout before any socket was ready!
  16. V (00:00:35.766) transport_base: poll_read: select - Timeout before any socket was ready!
Monitoring the MESH_ROOT at the same time I get the following: Its like the MESH_ROOT is echoing back the MESH_NODE messages withing 100ms; eg message lengths received and sent back of 58, 58, 54, 278
Also note my MESH_NODE MAC is 48:ca:43:9b:54:c0 and my MESH_ROOT MAC is 30:30:f9:33:49:bd whereas the logging on the first line below seems to record this back to front?
  1. D (00:01:45.437) mesh_netif: Root received: from: 30:30:f9:33:49:bd to 48:ca:43:9b:54:c0 size: 58
  2. D (00:01:45.449) mesh_netif: Sending to node: 48:ca:43:9b:54:c0, size: 58
  3. D (00:01:45.472) mesh_netif: Root received: from: 30:30:f9:33:49:bd to 48:ca:43:9b:54:c0 size: 54
  4. D (00:01:45.489) mesh_netif: Root received: from: 30:30:f9:33:49:bd to 48:ca:43:9b:54:c0 size: 278
  5. D (00:01:45.506) mesh_netif: Sending to node: 48:ca:43:9b:54:c0, size: 1494
  6. E (00:01:45.508) mesh_netif: P2P Send with err code 16392 ESP_ERR_MESH_ARGUMENT
  7. D (58077) wifi:bssid equal: ss_state=0x4
  8. D (00:01:45.522) mesh_netif: Sending to node: 48:ca:43:9b:54:c0, size: 58
  9. D (00:01:45.545) mesh_netif: Root received: from: 30:30:f9:33:49:bd to 48:ca:43:9b:54:c0 size: 54
  10. D (00:01:45.567) mesh_netif: Sending to node: 48:ca:43:9b:54:c0, size: 1494
  11. E (00:01:45.569) mesh_netif: P2P Send with err code 16392 ESP_ERR_MESH_ARGUMENT
  12. D (58177) wifi:bssid equal: ss_state=0x4
  13. D (58277) wifi:bssid equal: ss_state=0x4
  14. D (58377) wifi:bssid equal: ss_state=0x4
  15. D (00:01:45.868) mesh_netif: Sending to node: 48:ca:43:9b:54:c0, size: 1494
  16. E (00:01:45.870) mesh_netif: P2P Send with err code 16392 ESP_ERR_MESH_ARGUMENT
I cant see why OTA should fail when running on a MESH_NODE when the same software works fine on the MESH_ROOT.

From reading the ESP-WIFI-MESH document, this reads as a proven functional stable product, but I need to resolve these issues before I can move forward.

I would like some help to know if my approach is correct using the project the "ip_internal_network" example project as a basis or is there a better more proven solution eg Mwifi or a different approach. Mwifi does not seem up to date with API v 5.3 so I have not looked at it yet. I don't think I have the stomach to write my own mesh app from scratch using the API.
But debugging is tricky as this is really not my code, I am just calling the OTA API functions and the magic happens (on the MESH_ROOT anyway).

Please help

Who is online

Users browsing this forum: Google [Bot], MTRobin and 114 guests