Improving programming speed

Bryght-Richard · Postby **Bryght-Richard** » Tue Jun 04, 2024 7:26 pm

We're using ESP32-S3-WROOM-1, N16R8. I've increased the baud-rate to 1000000, adjusted flash-mode to QSPI/QIO, and our ~3.5 megabyte firmware takes roughly 40 seconds to program our PCB with a single esptool.py call. Example partition result:

Code: Select all

Compressed 3396656 bytes to 1390603...
Wrote 3396656 bytes (1390603 compressed) at 0x00220000 in 22.9 seconds (effective 1189.0 kbit/s)...
Hash of data verified.

Increasing the baud rate further does not seem to improve programming time, do you think our programming process is limited by the QSPI flash programming rate, and not the UART?

Also, I tried using DFU-mode for programming, but it took much longer, several minutes. I was hoping that using USB-DFU might be faster - is there a way to improve this?

If you've got a high volume product in production, what are you using for programming? Do you see anything I've missed? Thank you.

RandomInternetGuy · Postby **RandomInternetGuy** » Wed Jun 05, 2024 3:00 pm

You can play some games like using multiple partitions and keeping the part of your code that you're not constantly changing (e.g. resources) in another partition that you don't reload on each build. For example, put your .png files in a LittleFS and just don't keep reloading them all the time.

An appreciable part of that time is the sector erasing of the flash (remember, you can only set bits but not clear them - or vice versa - so you really do have to do a sector clear on the entire flash when reprogramming) so tinkering with bit rate just doesn't help much beyond 921.600 or so. There are combinations that claim to do 2mbps, but the additional spaces "between the bits" dominates the time so it's not like it really doubles the experience.

On other SOCs, I've had good luck using JTAG to just blast the image right into RAM but that's pretty fragile if you have code that normally stores constant (read-only) data in flash. It's probably possible to fiddle with the ilnker scripts and the JTAG config and create such an environment (esp. if you have something like an 8MB OSPI part where RAM is fast and "plentiful") but at some point, you're not running what you're actually shipping so you have to be cautious about exotic build configurations. I've not tried it on ESP32, but it's definitely possible on other SOCs I've used. If you're used to other systems where everything is stored in flash and then copied to RAM to boot and run (e.g. IDT/MIPS, most embedded RISC-V parts, or, I think, STM32) you basically "just" need to create the second half of the memory configuration after the big memcpy and before the jump to the RAM image you just created/copied/uncompressed.

The Legacy ESP32s also have a landmine of goofy rules about what memory can be used for what purposes, so defining the right linker maps might not be trivial.

Depending on the type of development you're doing, it might also be possible to use the QEMU simulator and run your code on a beefy virtualized host, but if you're doing any real amount of hardware twiddling - which is kind of the point of these parts - I just can't imagine that working well. Here's your chance to earn valuable internet points and be the hero of your peers!

Regardless if you implement the 'copy to memory' mode or not, to make it rock, you really have to get out of erasing and rewriting flash as much as you can. There may be an awesome tutorial on how to do this, but a cursory search - even thinking I know most of the key words - doesn't turn up a great HOWTO on this topic, so there's probably an opportunity to improve the ESP32 world if you'd sit down for a weekend and really work through it all and document it for the world's ESP32 developers.

Personally, I just try to write as much C++ as I can that I can unit test on a Real Computer so when I build it for ESP32, I then have some reasonable expectations of not being in a save-build-run loop.

Bryght-Richard · Postby **Bryght-Richard** » Wed Jun 05, 2024 4:04 pm

I should've clarified, I'm mostly interested from a production-programming focus. But your ideas for development are quite helpful too!

I'm afraid that a large part of our factory programming time consists of the NOR erasing, and the NOR programming, and if so there may not be must else to optimize.

Postby **ESP_Sprite** » Fri Jun 07, 2024 1:00 am

The 'zen' answer is that programming goes fastest if you don't do it at all. More down to earth, given a certain MOQ, you can order ESP modules from Espressif that are pre-programmed with whatever code you wish; we handle the programming for you. If you're interested, suggest you ask sales@espressif.com for more details.

Bryght-Richard · Postby **Bryght-Richard** » Tue Jun 11, 2024 4:26 pm

Thank you ESP_Sprite. We're also considering per-device flash encryption, so it seems like for our case pre-programming might not help too much if we have to re-encrypt the flash, which might take a similar amount of time? Each device passes through a programming and test station, so for now, the station does both jobs. But we'll consider this.

Another thing I have noticed is that programming long sequences of 0xFF bytes is faster than programming random data, but not by much, only because the 0xFF compress well. If programming 0xFF was fast, I could combine non-full partitions to transfer together. Here's what I mean. With 4MB flash area and esptool.py v4.7.0:

Code: Select all

>esptool -b 115200 --before default_reset --after hard_reset --chip esp32s3 erase_region 0x220000 0x400000
...
Erase completed successfully in 9.6 seconds.

Code: Select all

>esptool -b 115200 --before default_reset --after hard_reset --chip esp32s3 write_flash --flash_mode keep --flash_size 16MB --flash_freq 80m 0x220000 ff.bin
...
Flash will be erased from 0x00220000 to 0x0061ffff...
Compressed 4194304 bytes to 4086...
Wrote 4194304 bytes (4086 compressed) at 0x00220000 in 19.1 seconds (effective 1752.7 kbit/s)...
...

Code: Select all

>esptool -b 115200 --before default_reset --after hard_reset --chip esp32s3 write_flash --flash_mode keep --flash_size 16MB --flash_freq 80m 0x220000 rand.bin
esptool.py v4.7.0
...
Flash will be erased from 0x00220000 to 0x0061ffff...
Compressed 4194304 bytes to 4195590...
Wrote 4194304 bytes (4195590 compressed) at 0x00220000 in 32.8 seconds (effective 1021.7 kbit/s)...
...

On another system, I was able to modify the firmware-flashing executable's decompressor, so that I could check if the next page was entirely 0xFF without decompressing it, then skip programming that page, and drop it from the decompressor. However, it looks like esptool is communicating with the ROM only? If so, perhaps Espressif might consider this for a future chip. Or, is it possible for my to transfer my own flash-loader program into SRAM, in the same way the RF modulator test tool does?

Postby **ESP_Sprite** » Wed Jun 12, 2024 12:58 am

You could; esptool actually uses a 'stub' for most chips, a bit of programming that is uploaded to SRAM to speed up the protocol by a bit. You can probably modify that and/or the python code to implement what you want.

Bryght-Richard · Postby **Bryght-Richard** » Thu Jul 11, 2024 8:40 pm

Thanks ESP_Sprite!

I hooked up three GPIOs so I could monitor the flasher-stub's progress doing decompression, flash-erase, and flash-program with logic analyzer. Then I compared with and without trimming leading & trailing bytes from each decompressed block. These numbers are from programming a 4MB file of 0xFF to a new ESP32-S3-WROOM-1:

No 0xFF Trimming: 23.2 seconds
Trimming entire file: 14.1 seconds
Erasing same region: 13.3 seconds

So, it is possible skipping the programming of large blocks of 0xFF can help, and we can get close to the erasing time. But, does it help factory programming? For a real-world 10MB contiguous disk image, it only helps about 10%, and not more than programming the partitions separately.

10MB image no trimming: Wrote 10485760 bytes (3024354 compressed) in 66.8 seconds (effective 1255.3 kbit/s)
10MB image FF trimming: Wrote 10485760 bytes (3024354 compressed) in 59.9 seconds (effective 1401.4 kbit/s)
7 separate partitions sum/avg: Wrote 3926192 bytes (1931980 compressed) in 30.6 seconds (effective 1002kbit/s)

The trade-off is that if the gaps between the end of a partition data and the start of the next partition are large enough, it's better to program in a separate segment to skip erasing unused data when possible. Of course, it would vary with Flash IC, but for us, as you said the 'zen' answer is that erasing goes fastest if you don't do it at all.

Improving programming speed

Improving programming speed

Re: Improving programming speed

Re: Improving programming speed

Re: Improving programming speed

Re: Improving programming speed

Re: Improving programming speed

Re: Improving programming speed

Who is online

About Us

Extra

Information