I think it's broken, but maybe not?
Posted: Mon Aug 07, 2023 4:14 am
I have been having fun with what I now think is a broken M5 STAMP-Pico, which uses an ESP32 Pico chip. I spent a lot of time looking for a software error, before checking for bad hardware, as I'm actively developing and the module is brand new. I'm pretty sure it's a bad module, as other modules work fine (or as designed anyway), but the symptoms did not look like a hardware fault. I'd be curious if there is some unknown (to me) configuration that is causing this, or what could this be?
The application I am developing (call it the slave) connects the STA interface to another ESP32 softAP (call it the master) via DHCP, then connects to an MQTT server/broker in the master. The master also provides a NAPT routing function to external NTP server, MQTT server and DNS server. The slave can be configured for the server addresses of these servers, either as an mDNS hostname (for the MQTT server in the master), a DNS domain name, or IP addresses. This all works pretty well, for the state of development. The project is an Arduino project developed using Visual Micro, but with some IDF functions also used.</t>
The problem with the suspect module first appeared as very slow responses to the HTTP-based admin function. Looking at the console output (which seems to be responding as normal) showed that it was connecting to the master soft AP as expected, but then there was a problem connecting with the servers. Changing the config to use hostnames or IP addresses made no difference. The module would occasionally connect to the MQTT broker in the master, send a couple of messages, then report that the connection was down, and retry (as designed). As far as I can tell, the WiFi was still connected but the TCP was disconnecting. The rest of the slave app, not related to WiFi or TCP, seems to be functioning perfectly. The same effects occur whether the module is installed in the target PCB, or is standalone. Other modules seem to connect OK, and stay connected.
The module is brand new, but I tried erasing the flash, with no change.
So it still looks to me like a software problem related to making server connections, but my substitution tests seem to have eliminated this. I'm at a loss to explain how a hardware fault could have such a specific effect and be so consistent. At this point, I'm really just curious to know what might be going on here.
The application I am developing (call it the slave) connects the STA interface to another ESP32 softAP (call it the master) via DHCP, then connects to an MQTT server/broker in the master. The master also provides a NAPT routing function to external NTP server, MQTT server and DNS server. The slave can be configured for the server addresses of these servers, either as an mDNS hostname (for the MQTT server in the master), a DNS domain name, or IP addresses. This all works pretty well, for the state of development. The project is an Arduino project developed using Visual Micro, but with some IDF functions also used.</t>
The problem with the suspect module first appeared as very slow responses to the HTTP-based admin function. Looking at the console output (which seems to be responding as normal) showed that it was connecting to the master soft AP as expected, but then there was a problem connecting with the servers. Changing the config to use hostnames or IP addresses made no difference. The module would occasionally connect to the MQTT broker in the master, send a couple of messages, then report that the connection was down, and retry (as designed). As far as I can tell, the WiFi was still connected but the TCP was disconnecting. The rest of the slave app, not related to WiFi or TCP, seems to be functioning perfectly. The same effects occur whether the module is installed in the target PCB, or is standalone. Other modules seem to connect OK, and stay connected.
The module is brand new, but I tried erasing the flash, with no change.
So it still looks to me like a software problem related to making server connections, but my substitution tests seem to have eliminated this. I'm at a loss to explain how a hardware fault could have such a specific effect and be so consistent. At this point, I'm really just curious to know what might be going on here.