device bricked in the field

phatpaul
Posts: 110
Joined: Fri Aug 24, 2018 1:14 pm

device bricked in the field

Postby phatpaul » Mon Jan 06, 2020 9:43 pm

## Environment

- Development Kit: [none, custom product using wrover module.]
- Module or chip used: [ESP32-WROVER 16M]
- IDF version: v3.1.3
- Build System: [Make]
- Compiler version: xtensa-esp32-elf-gcc.exe (crosstool-NG crosstool-ng-1.22.0-80-g6c4433a5) 5.2.0
- Operating System: [Windows]
- (Windows only) environment type: [MSYS2 mingw32].
- Power Supply: [on-board 3.3V]

## Problem Description
A unit was bricked in the field. The user may have been performing an OTA update when it failed.

My symptoms seems similar to another post, but I'm not using flash encryption. viewtopic.php?f=12&t=10346&hilit=bootloader+corrupt

### Expected Behavior
Unit can't be bricked. Factory restore procedure should always fix an otherwise bricked unit.

### Actual Behavior
The reset button does nothing. The factory restore sequence (hold reset button while power-on) does nothing.

### Steps to reproduce
I'm unable to reproduce the error in my lab. I might consider implementing some stress test in the future.

The product is using the http-based OTA method available from the libesphttpd project. See here: https://github.com/chmorgan/esphttpd-freertos

## Debug Logs

I cut open the device enclosure to access the serial programming pins. The serial output is:

Code: Select all

ets Jun 8 2016 00:22:57

rst:0x1 (POWERON_RESET),boot:0x33 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xea
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff0018,len:4
load:0x3fff001c,len:5068
load:0x40078000,len:9244
ho 0 tail 12 room 4
load:0x40080000,len:5772
0x40080000: _WindowOverflow4 at C:/msys32/home/labview/lumitec-dev-fw-plibox-esp32/Tools/esp-
idf/components/freertos/xtensa_vectors.S:1779

entry 0x40080068
0x40080068: _xt_alloca_exc at C:/msys32/home/labview/lumitec-dev-fw-plibox-esp32/Tools/esp-id
f/components/freertos/xtensa_vectors.S:1835

Fatal exception (0): IllegalInstruction
epc1=0x40080068, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
0x40080068: _xt_alloca_exc at C:/msys32/home/labview/lumitec-dev-fw-plibox-esp32/Tools/esp-id
f/components/freertos/xtensa_vectors.S:1835

��@�ets Jun 8 2016 00:22:57

rst:0x10 (RTCWDT_RTC_RESET),boot:0x33 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xea
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff0018,len:4
load:0x3fff001c,len:5068
load:0x40078000,len:9244
ho 0 tail 12 room 4
load:0x40080000,len:5772
0x40080000: _WindowOverflow4 at C:/msys32/home/labview/lumitec-dev-fw-plibox-esp32/Tools/esp-
idf/components/freertos/xtensa_vectors.S:1779

entry 0x40080068
0x40080068: _xt_alloca_exc at C:/msys32/home/labview/lumitec-dev-fw-plibox-esp32/Tools/esp-id
f/components/freertos/xtensa_vectors.S:1835

Fatal exception (0): IllegalInstruction
epc1=0x40080068, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
0x40080068: _xt_alloca_exc at C:/msys32/home/labview/lumitec-dev-fw-plibox-esp32/Tools/esp-id
f/components/freertos/xtensa_vectors.S:1835

ets Jun 8 2016 00:22:57

rst:0x10 (RTCWDT_RTC_RESET),boot:0x23 (DOWNLOAD_BOOT(UART0/UART1/SDIO_REI_REO_V2))
waiting for download
Another trace with a different error the first time. It's a checksum error: (first time only)

Code: Select all

ets Jun 8 2016 00:22:57

rst:0x1 (POWERON_RESET),boot:0x33 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xea
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff0018,len:4
load:0x3fff001c,len:5068
load:0x40078000,len:9244
ho 0 tail 12 room 4
load:0x40080000,len:5772
csum err:0x8e!=0x8a
ets_main.c 371
ets Jun 8 2016 00:22:57

rst:0x10 (RTCWDT_RTC_RESET),boot:0x33 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xea
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff0018,len:4
load:0x3fff001c,len:5068
load:0x40078000,len:9244
ho 0 tail 12 room 4
load:0x40080000,len:5772
entry 0x40080268
Fatal exception (0): IllegalInstruction
epc1=0x4008026b, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
ets Jun 8 2016 00:22:57

rst:0x10 (RTCWDT_RTC_RESET),boot:0x23 (DOWNLOAD_BOOT(UART0/UART1/SDIO_REI_REO_V2))
waiting for download
## My troubleshooting

I dumped the flash and inspected with a hex editor (FlexHEX).

I inspected the partition table, and each partition and compared it to a working unit. I didn't see any issues with any of the partitions, but the bootloader is corrupt, see below.

Focusing on the bootloader: Dump just the bootloader to a file and compare with original "bootloader.bin"

Code: Select all

esptool.py -p COM21 -b 921600 read_flash 0x1000 0x4EE0 dump-bootloader.bin

Diffing the bootloader: There are only 2 bytes different.

Code: Select all

>fc dump-bootloader.bin bootloader.bin
Comparing files dump-bootloader.bin and BOOTLOADER.BIN
00000004: 68 E8
00000008: EA EE
Interestingly, the EA vs. EE looks the same as the difference in the serial output.

The failed device outputs: (notice the 0xea)

Code: Select all

ets Jun 8 2016 00:22:57

rst:0x1 (POWERON_RESET),boot:0x33 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xea
A working device outputs: (notice 0xee)

Code: Select all

ets Jun 8 2016 00:22:57

rst:0x1 (POWERON_RESET),boot:0x33 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
So my next steps could be to:

- reflash the bootloader to correct the 2 bytes that are different.
- or clear the ota_data partition to force boot from factory app.

I'll try to reflash the bootloader. (BTW, I saved the corrupted one here: dump-bootloader.bin)

Code: Select all

esptool.py -p COM21 -b 921600 --chip esp32 --before default_reset --after no_reset write_flash -z --flash_mode dio --flash_freq 80m --flash_size 16MB 0x1000 bootloader.bin
After reflashing the bootloader, the device boots properly (booting from OTA0).

Next, to be sure, flash the corrupted bootloader to a working poco and test:

Code: Select all

esptool.py -p COM21 -b 921600 --chip esp32 --before default_reset --after no_reset write_flash -z --flash_mode dio --flash_freq 80m --flash_size 16MB 0x1000 dump-bootloader.bin

After flashing the corrupt bootloader to a previously working device, it no longer boots with similar error.

So I'm confident that this corruption is the issue. Now to figure out how/why it got corrupted.

ESP_Angus
Posts: 2344
Joined: Sun May 08, 2016 4:11 am

Re: device bricked in the field

Postby ESP_Angus » Tue Jan 07, 2020 12:46 am

Hi Paul,

Sorry to hear about the device failure.

I think you're right that the bootloader was corrupted. Most likely the byte at offset 4 is the reason it crashes immediately. This byte is part of the address of the "entry point" where the ROM bootloader will jump to on load:

Code: Select all

00000004: 68 E8
...
entry 0x40080068
              ^^
I am guessing if you check the equivalent log line from a working device it will log entry 0x400800E8, which is the "real" entry point entry point address. On the bad device it's jumping to 0x40080068 and immediately crashing as this isn't a valid instruction.

(The source backtraces are unfortunately wrong as they look up the addresses in the app not the bootloader, can ignore these.)

So in both cases a bit has been cleared, 0xE8 & ~0x80 == 0x68 and 0xEE & ~0x04 == 0xEA.

There are three possible ways this could have happened:
  • The software OTA routines have a bug which wrote to a bootloader address. I don't think this is likely, looks like the esphttpd component uses the OTA APIs and these are partition focused. The default ESP-IDF config has a flag set where the flash APIs will not accept writing to the bootloader range at all. Does anything else write to flash from software?
  • The SPI signals were corrupted electrically when sending the address bits to the flash chip, so it latched the wrong address and then wrote to it. This seems unlikely to me because only two bytes were written incorrectly, and the write size is much bigger than this. If the app itself is valid in the flash (ie flash readback of the app matches the .bin that was sent) then you can probably rule this out, as a "wrong address" would mean some part of the app was not written as expected so you'd expect to see an unwritten part of the app somewhere.
  • The flash chip is faulty, or had a bad erase/write cycle. This has caused two bits to fail. This seems like the most likely explanation at this stage.
    • Has the WROVER module spent any time in conditions that might have been out of spec? (ie unusual voltage ranges, unstable supply voltage, or unusual temperature ranges?)
    • At the time initial factory flash happened to write the bootloader, is it possible the 3.3V power to the module was unstable? (If the initial "erase" which sets all bytes to 0xFF happens out of spec then it's possible some bits were not reliably erased to 1 at the physical level, and have floated back to 0 over time).
One more debugging question: How long before the OTA update did the in-field device last successfully boot? Asking to rule out the possibility that the corruption happened before the OTA event, but was only exposed after the post-OTA reset.
After flashing the corrupt bootloader to a previously working device, it no longer boots with similar error.
Just checking: This is a typo, yes? Flashing the corrupt bootloader to a working device causes it to fail with the same error you saw on the original failed device?

phatpaul
Posts: 110
Joined: Fri Aug 24, 2018 1:14 pm

Re: device bricked in the field

Postby phatpaul » Tue Jan 07, 2020 3:29 pm

Thanks for your detailed and prompt reply.
Does anything else write to flash from software?
Yes, I am using NVS storage and a FAT filesystem on a partition "internalfs". I guess there could be a bug there too...
Has the WROVER module spent any time in conditions that might have been out of spec? (ie unusual voltage ranges, unstable supply voltage, or unusual temperature ranges?)
I can't say for sure, but my current electrical design lacks a voltage supervisor. (added it to the next hardware rev)
At the time initial factory flash happened to write the bootloader, is it possible the 3.3V power to the module was unstable? (If the initial "erase" which sets all bytes to 0xFF happens out of spec then it's possible some bits were not reliably erased to 1 at the physical level, and have floated back to 0 over time).
I'll scope the 3.3V during programming to confirm.

BTW, my factory programming script does not do a flash_erase at the beginning. It only does flash_write.
Stupid question: doesn't the flash_write operation effectively erase the flash before writing it?
If not, is it possible that the '1' flash bits were not cleared during my factory programming, but remained (weakly) '1' from the flash chip manufacturer?
Just checking: This is a typo, yes? Flashing the corrupt bootloader to a working device causes it to fail with the same error you saw on the original failed device?
What you said is correct. What I said seemed correct to me but now I see it's confusing.
One more debugging question: How long before the OTA update did the in-field device last successfully boot? Asking to rule out the possibility that the corruption happened before the OTA event, but was only exposed after the post-OTA reset.
I checked with some people and now I'm not sure that the customer ever did an OTA. By examining the flash dump, the newer firmware is present in the flash on OTA0 partition and verified OK. But it is likely that we performed an OTA update here before shipping to the customer. (because we can't re-flash the factory app after we seal the device enclosure - that's another issue...)

This device was given to one of our important customers. Our customer claims they used it frequently to test and demonstrate functionality. It was working for a while and then was passed to one of their employees for testing but he claimed that it never worked when he received it.
So it seems that it could have failed during normal operation, not an OTA.

We have shipped approx. 150 units since the first prototypes in 2018. This is the first failure of this kind. We are in process of ramping up production so I hope we don't see more failures.

The similarities with this other guy's issue are suspicious. viewtopic.php?f=12&t=10346&hilit=bootloader+corrupt
He also saw corruption in the bootloader area, but the root cause was never found.
I would be love to conclude that he and I both received a bad batch of flash chips. But he is supplying his own flash chip on his board, whereas I am using the integrated WROVER module.
Also when reading through his thread I learned that there are methods for the bootloader to re-write itself (I guess used for encryption?).

Do you have some suggestions for stress-testing the device?

ESP_Angus
Posts: 2344
Joined: Sun May 08, 2016 4:11 am

Re: device bricked in the field

Postby ESP_Angus » Tue Jan 07, 2020 10:53 pm

Hi paul,

Your next steps for debugging sound like good ones. Have a couple of comments below:
phatpaul wrote:
Tue Jan 07, 2020 3:29 pm
Does anything else write to flash from software?
Yes, I am using NVS storage and a FAT filesystem on a partition "internalfs". I guess there could be a bug there too...
Could be. If you have the config option noted above set then I would expect any similar bug would be of the "electrical corruption causes bad address" kind not the "software bug causes wrong address" kind.

But the more times a write command is being sent to the flash chip, the more opportunity for this kind of noise/glitch to be introduced.
phatpaul wrote:
Tue Jan 07, 2020 3:29 pm
BTW, my factory programming script does not do a flash_erase at the beginning. It only does flash_write.
Stupid question: doesn't the flash_write operation effectively erase the flash before writing it?
If not, is it possible that the '1' flash bits were not cleared during my factory programming, but remained (weakly) '1' from the flash chip manufacturer?
Not a stupid question. NOR flash can only do two things - you can erase any 4KB sector of the flash to all 0xFF (all 1 bits), and you can write any bit from 1 to 0.

The esptool write_flash command will erase any individual sectors before writing to them, but it doesn't erase the whole chip (the flash_erase command would do that). NVS and FAT (plus the IDF wear levelling layer) will erase sectors & write bytes as necessary to store the data which is needed. They both use techniques to minimize the number of erase cycles the flash experiences, as these are finite.

So during the factory "write_flash" there would be a step where the 4KB sectors of the bootloader are erased to all 0xFF, and then a step where the freshly erased sectors get written with the bootloader binary data. If the first step happened under less-than-ideal conditions, it's possible there were some "weak 1 bits" in that region.

(Also possible the flash chip is simply faulty, rather than this being anything environmental.)
phatpaul wrote:
Tue Jan 07, 2020 3:29 pm
The similarities with this other guy's issue are suspicious. viewtopic.php?f=12&t=10346&hilit=bootloader+corrupt
He also saw corruption in the bootloader area, but the root cause was never found.
I would be love to conclude that he and I both received a bad batch of flash chips. But he is supplying his own flash chip on his board, whereas I am using the integrated WROVER module.
Agree on both counts. It's possible these are both faulty flash chips from different sources, also possible it's something else... In that case it looked very likely that noisy power played a role.
phatpaul wrote:
Tue Jan 07, 2020 3:29 pm
Also when reading through his thread I learned that there are methods for the bootloader to re-write itself (I guess used for encryption?).
If flash encryption is enabled using the default workflow then the bootloader encrypts itself on first boot, in the factory. This doesn't happen under any other circumstances.

If the "safety" config option mentioned above is disabled then an app can call the SPI flash APIs to erase any part of the flash and rewrite it, including rewriting the bootloader. This is not safe because if there's a reset or a crash during these steps, there's no recovery option - so we don't support doing it. But it's technically possible.
phatpaul wrote:
Tue Jan 07, 2020 3:29 pm
Do you have some suggestions for stress-testing the device?
Without knowing more about your product it's hard to make a suggestion that will approximate real-world conditions, but you could try setting up an automated OTA loop that continually updates the app with a new binary and then resets? If your product spends time in extreme temperatures or with noisier-than-average power sources then try to introduce those.

You could also try generating some test patterns of random bytes, size of the full flash chip, and use "esptool write_flash --verify ..." to write and then verify the pattern, under something close to the factory conditions. This doesn't account for the possibility of "weakly erased" bits that flip later on, but if the flash chip is generally flaky (or subject to noise) then you might pick something else up.

scimon
Posts: 2
Joined: Fri Jun 03, 2022 1:43 am

Re: device bricked in the field

Postby scimon » Sun Jun 05, 2022 3:34 pm

I'm currently analyzing an issue which has some suspicious simmilarities with this topic:
  • two devices got bricked in the field after (not neccessarily due to) performing an OTA update
  • a third one failed in the lab with simmilar symptoms (that was not expected and no OTA was involved)
  • failed devices experienced an ESP-IDF update (old bootloader with new app?)
  • writing a "corrupted"* [working] bootloader image to 0x1000 bricks [fixes] a device with symptoms
  • * I'm not convinced the bootloader is actually corrupted as the binary data looks reasonable and I have no explanation in which way it could have been corrupted to lead to the given binary as a result. Also, the "corrupted" bootloader I got from the bricked device still works with app images created with older versions.
For my analysis, I gathered several bootloader images (of different versions) and compiled a simple "hello world" firmware (also with different versions). Through systematically testing those against each other, I found that there are incompatibilities between certain vesions. The result in all positive cases was either a first stage bootloop (sometimes rapid, sometimes the RTCWDT kicking every 9 seconds) or the second stage bootloader not starting.

Beside that I'm going to have a customer return 2 of our devices where one has already had the issue while the second is still in tact. But as those devices should be quite identical (sequential serial numbers, same configuration by customer), I expect to have a device where I'll be able to monitor the incident live in lab.

After my observations up to this point, a device being bricked after an OTA (where only the app image, but not the bootloader is updated) seems generally expectable to me.

If anyone is interested, I'd be pleased to share my data and analysis about the issue so far.

Who is online

Users browsing this forum: Bing [Bot], chegewara, W3C [Validator] and 73 guests