MQTT Client causes Chip Hardlock

EarthAndy
Posts: 12
Joined: Fri Jan 19, 2024 6:53 pm

MQTT Client causes Chip Hardlock

Postby EarthAndy » Tue May 14, 2024 7:27 pm

I have a system based on an ESP32-S3 (N8) that uses Ethernet and the ESP-IDF MQTT client ) and I experience hardlocks in my system in a way I want to say is intermittent, but happen at the same place in my code in a deterministic manner.

Code: Select all

PRIV_REQUIRES mqttclient 
#include "mqtt_client.h"
The hardlocks occur when the system hits the broker: on primary connection, on add subscription, on publish message. The "hardlock" in this case is that the chip completely stops - all threads quit reporting to the console log, all external communications stop, no apparent processing takes place.

I am unable to reproduce the issue consistently, all I know is that some builds "just hardlock". I'll do a full code build, flash it to chip, and usually if the issue occurs it happens early in the boot process (usually at my connection init or adding subscribers). Sometimes it won't happen until I get a subscribed message come in.

I've been dealing with this issue for a few months, back then I saw something that said it could have to do with WiFi, but now I have my hardware which has a Wiz5500 Ethernet module and I'm still seeing the hardlocks.

Now, how I've gotten around this issue may provide some hints to what is happening, I hope: since this seems to be deterministic, the hardlocks occur at the exact same place IF it is going to happen in a build. The way that I've been able to get around this is to "make the code different enough", and I'm doing that with adding or removing ESP_LOGX() log lines around in my code.

I do not currently have JTAG debugging working as I can't get my development OpenOCD to see my chip via USB (I've tried just about any of the fixes I've found on the web, but nothing seems to be working). When I was looking for solutions to this a few months back I ran across a similar issue where folks with working JTAG were showing that the system was locked in some kind of mqtt wait (I've since looked for those threads to no avail).

I've tried to turn up logging on the mqtt related features, but I don't seem to get any more info from them:

Code: Select all

    esp_log_level_set( "mqtt5_client", ESP_LOG_VERBOSE);
    esp_log_level_set( "MQTT5EventHandler", ESP_LOG_VERBOSE);
but those don't seem to spit out any more information. If there are any other modules I could log that could be helpful, please send those along.

Some other solutions I've tried: I've tried moving these between cores - kept all networking on Core0, moved networking to Core1, changed priorities of MQTT and Networking tasks, etc.

In general, my code is not much more than the example code, I connect to a broker, set some subscriptions, and publish some messages. I'm not doing anything that I consider especially esoteric.

Thanks for reading, any help is welcome!

A

ESP_Sprite
Posts: 9757
Joined: Thu Nov 26, 2015 4:08 am

Re: MQTT Client causes Chip Hardlock

Postby ESP_Sprite » Tue May 14, 2024 11:51 pm

That is odd. Can you maybe configure the task watchdog to panic if it times out? (It's a menuconfig option somewhere.) With a bit of luck, that'll give you a backtrace that leads to the culprit.

EarthAndy
Posts: 12
Joined: Fri Jan 19, 2024 6:53 pm

Re: MQTT Client causes Chip Hardlock

Postby EarthAndy » Wed May 15, 2024 9:38 pm

I tried quickly to turn on the Task Watchdogs but that borked all of my TWAI/CAN stuff - TWAI driver quit working altogether. :shock:

ESP_Sprite
Posts: 9757
Joined: Thu Nov 26, 2015 4:08 am

Re: MQTT Client causes Chip Hardlock

Postby ESP_Sprite » Thu May 16, 2024 12:58 am

That's... errm... disconcerning; the task watchdog shouldn't do anything unless a task is hogging the CPU for several seconds. What did the panic from that look like?

EarthAndy
Posts: 12
Joined: Fri Jan 19, 2024 6:53 pm

Re: MQTT Client causes Chip Hardlock

Postby EarthAndy » Thu May 16, 2024 2:10 pm

I wasn't able to reproduce a mqtt hardlock in this configuration. It's nearly impossible to say "I want to reproduce it now", it's more of a "well, this build has the problem". Sometimes I can go hours without it showing up, and sometimes I can barely figure out a combination of tweaks to make it work.

Problem is that I can't do anything with my system without CAN, so that limits that side of testing things.

EarthAndy
Posts: 12
Joined: Fri Jan 19, 2024 6:53 pm

Re: MQTT Client causes Chip Hardlock

Postby EarthAndy » Thu May 16, 2024 6:35 pm

I was just able to catch a hardlock. After I had a build that I knew would lock up, I tried to turn on the Task Watchdog, and see what I could see (and if it would still lock up):

Code: Select all

I (4896) DEV_MQTT_Client: [DEV] MQTT_EVENT_BEFORE_CONNECT
I (4896) MCU_MQTT_Client_ESP32:    MCU_MQTT_Client_ESP32 Connected with Client Handle: [0x3fcc8114]
I (4896) DEV_MQTT_Client: [0x3fcc7fd0] DEV_MQTT_Client Connection Started... [oMQTTClientHandle=0x3fcc8114]
I (4906) AppIpMqttClient: MQTT Connection Started! [0x3fcc7fd0]
E (4906) McuCanBusChannel_Esp32: FAILED to Queue TWAI Message for Transmit!
E (4926) McuCanBusChannel_Esp32: FAILED to Queue TWAI Message for Transmit!
E (4926) McuCanBusChannel_Esp32: FAILED to Queue TWAI Message for Transmit!
E (4946) McuCanBusChannel_Esp32: FAILED to Queue TWAI Message for Transmit!
E (5036) McuCanBusChannel_Esp32: FAILED to Queue TWAI Message for Transmit!
E (5036) McuCanBusChannel_Esp32: FAILED to Queue TWAI Message for Transmit!
E (5046) McuCanBusChannel_Esp32: FAILED to Queue TWAI Message for Transmit!
E (5056) McuCanBusChannel_Esp32: FAILED to Queue TWAI Message for Transmit!
ESP-ROM:esp32s3-20210327
Build:Mar 27 2021
rst:0x7 (TG0WDT_SYS_RST),boot:0x28 (SPI_FAST_FLASH_BOOT)
Saved PC:0x4200575f
SPIWP:0xee
mode:DIO, clock div:1
load:0x3fce3818,len:0x1260
load:0x403c9700,len:0x4
load:0x403c9704,len:0xa64
load:0x403cc700,len:0x2c14
SHA-256 comparison failed:
Calculated: 78e47e1324eafbfe6f1399dc795f7191e3919b542e063713b7d365584f61cc3a
Expected: ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
Attempting to boot anyway...
The code hardlocked at the initial connection to the MQTT broker. Clearly this killed my CAN driver, but I was able to reproduce the hardlock event - and nothing happened but the watchdog reset the chip after the 5 second timeout.


I tried with both PANIC enabled and not, and both produced the trace as above.

Code: Select all

CONFIG_ESP_TASK_WDT_EN=y
CONFIG_ESP_TASK_WDT_INIT=y
CONFIG_ESP_TASK_WDT_PANIC=y
CONFIG_ESP_TASK_WDT_TIMEOUT_S=5
CONFIG_ESP_TASK_WDT_CHECK_IDLE_TASK_CPU0=y
CONFIG_ESP_TASK_WDT_CHECK_IDLE_TASK_CPU1=y

I managed to get OpenOCD working, and was able to flash over JTAG (boy, that's slow), but I've never used the ESP32 debugger in VSCode (or at all, for that matter), so I'm not sure where to start with using that for profiling.

Jonathan2892
Posts: 45
Joined: Tue Dec 07, 2021 4:04 pm

Re: MQTT Client causes Chip Hardlock

Postby Jonathan2892 » Thu May 16, 2024 7:02 pm

Hi,

I think I have the same issue with the MQTT client:
I use an esp32-s3-wroom-1-n16r8 with idf 5.2.1.
On core 1 I have a task, which uses SPI. On core 0 I have wifi and mqtt running.
But as soon as I try to connect to the broker the esp crahses without any useful output (can post the logs tomorrow). Without the SPI stuff everything works fine...
Sometimes I even can connect to the broker, but then I can not subscribe and crash.

Edit: Same problem on idf 5.1.2 and 4.4.3

EarthAndy
Posts: 12
Joined: Fri Jan 19, 2024 6:53 pm

Re: MQTT Client causes Chip Hardlock

Postby EarthAndy » Fri May 31, 2024 4:04 pm

Back with a bump and another plea for direction on this issue.

I've reached a build where I cannot get MQTT to not-crash with any of my admittedly hacky methods to mitigate the issue. I've mentioned them above, but did not clearly show what I was using to get around this issue. Prepare to be nauseated!

I have a set of magic defines:

Code: Select all

#undef MAGIC_HARDLOCK_FIX_A
#undef MAGIC_HARDLOCK_FIX_B
#undef MAGIC_HARDLOCK_FIX_C
#undef MAGIC_HARDLOCK_FIX_D
Then in my low-level interfaces, which are *basically* the ESP32 example code, I have #ifdefs that block or unblock ESP_LOG lines:

Code: Select all

//==============================================================================
// MQTT Client Connection Process
//==============================================================================
esp_mqtt_client_handle_t MCU_MQTT_Client_ESP32::MCU_init(void)
{

#ifdef MAGIC_HARDLOCK_FIX_A
    ESP_LOGV(MCU_LOG_Tag, "esp_mqtt_client_init...");
#endif
    client_handle = esp_mqtt_client_init(MQTT5_ConnectionConfig);

#ifdef MAGIC_HARDLOCK_FIX_B
    ESP_LOGV(MCU_LOG_Tag, "esp_mqtt5_client_set_user_properties...");
#endif
    esp_mqtt5_client_set_user_property(&MQTT5_ConnectionProperty->user_property, UserPropertyArray, UserPropertyArrayLength);
    esp_mqtt5_client_set_user_property(&MQTT5_ConnectionProperty->will_user_property, UserPropertyArray, UserPropertyArrayLength);

#ifdef MAGIC_HARDLOCK_FIX_C
    ESP_LOGV(MCU_LOG_Tag, "esp_mqtt5_client_set_connect_property...");
#endif
    esp_mqtt5_client_set_connect_property(client_handle, MQTT5_ConnectionProperty);

#ifdef MAGIC_HARDLOCK_FIX_D
    ESP_LOGV(MCU_LOG_Tag, "esp_mqtt5_client_delete_user_property...");
#endif
    esp_mqtt5_client_delete_user_property(MQTT5_ConnectionProperty->user_property);
    esp_mqtt5_client_delete_user_property(MQTT5_ConnectionProperty->will_user_property);

#ifdef MAGIC_HARDLOCK_FIX_A
    ESP_LOGV(MCU_LOG_Tag, "esp_mqtt_client_register_event...");
#endif
    esp_mqtt_client_register_event(client_handle, (esp_mqtt_event_id_t)ESP_EVENT_ANY_ID, MCU_Default_MQTT5EventRouter, (void *)this);


#ifdef MAGIC_HARDLOCK_FIX_B
    ESP_LOGD(MCU_LOG_Tag, "MCU_MQTT_Client_ESP32::MCU_init() == %p", client_handle );
#endif

    return client_handle;
}
By twiddling between #define and #undef on those magic defines, I usually can find a combination that keeps MQTT from locking up. This gives me 16 different options. On my latest code, none of the 16 combinations keeps MQTT from locking the chip.

Since I'm really stuck now, I've moved on to trying to modify the esp32 mqttclient options in the menuconfig. I have tried just about any of the reasonable combinations of settings there: moved cores, changed task priorities, disabled API locks (my app has a single task that runs MQTT functionality, so I'm not concerned about collisions). I've changed from enqueued publish to blocking publish. With all of these changes the code crashes at the exact same place that is defined by those magic defines I listed above.

I tried to fire up OpenOCD this morning, but can't get it started again.

My work-arounds are no longer working-around, so I'm a little stuck.

I'm willing at this point to pivot to another MQTT Library, but my first pass all pointed to Arduino libraries and I'm not sure the steps to get those into a C++ CMake environment. I'd like to keep within the ESP32 API ecosystem, so getting this working is still my priority.

Thanks in advance.

Edited to Add:
This really feels like a code partitioning problem - my "magic fixes" seem to just be realigning code. I've tried several different methods to impact the alignment but none work like turning on and off the ESP_LOG lines.

Is there some sort of pragma option that would allow for more formal code alignment with the MQTT packages? Just brainstorming...

Jonathan2892
Posts: 45
Joined: Tue Dec 07, 2021 4:04 pm

Re: MQTT Client causes Chip Hardlock

Postby Jonathan2892 » Sat Jun 01, 2024 6:21 am

Hi,

I was able to solve my issue by moving some stuff like .bss to the SPIRAM. I did also some changes in the menuconfig, I can post them in a few days, right now I am traveling.
But you don’t have SPIRAM, right? Well, maybe it will help somehow.

Best

MicroController
Posts: 1725
Joined: Mon Oct 17, 2022 7:38 pm
Location: Europe, Germany

Re: MQTT Client causes Chip Hardlock

Postby MicroController » Sat Jun 01, 2024 11:39 am

EarthAndy wrote:
Fri May 31, 2024 4:04 pm
Back with a bump and another plea for direction on this issue.
Well, obvoiusly(?), you didn't provide enough of your code for us to be of much help.
E.g., this log message
"E (5056) McuCanBusChannel_Esp32: FAILED to Queue TWAI Message for Transmit!"
before the reboot, is that expected to be happening?
If not, I'd first go looking into where/why this happens, checking (error) return values to locate and identify the problem.

Who is online

Users browsing this forum: markkuk and 272 guests