Faster, optimized ESP-IDF fork + PSRAM Issues
-
- Posts: 4
- Joined: Mon Dec 10, 2018 4:32 pm
Faster, optimized ESP-IDF fork + PSRAM Issues
During the development of low.js, a Node.JS port for ESP32 boards (https://www.lowjs.org/), we had a few challenges to overcome. I would like to give back to the ESP-IDF community with two things:
1. ESP-IDF modified to use dlmalloc
The default memory allocator in ESP-IDF is self made by Espressif (at least so it seems). It is not very fast, and becomes very slow when memory gets fragmented. This problem becomes evident when using SPI RAM.
esp-idf-dlmalloc is a fork of ESP-IDF which was modified to use dlmalloc, an industry standard memory allocator. It is almost twice as fast as the default memory allocator, and does not slow down notably with fragmented memory.
The fork has its own GitHub repository here: https://github.com/neonious/esp-idf-dlmalloc
Hopefully Espressif is interested to switch the memory allocator in the default branch of ESP-IDF too (and hey, Espressif, while you are here, maybe you want to take a look at low.js, might be interesting to offically back, too).
2. Still cache issues with PSRAM
This is not really a gift to the community, but might become one when our report helps fix this problem:
We stumbled upon the fact that cache issue with PSRAM still exist, even in the newest development environment. This can produce random crashes, even if the code is 100 % valid.
You can find the project showcasing the problem here: https://github.com/neonious/memcrash-esp32
Thomas
1. ESP-IDF modified to use dlmalloc
The default memory allocator in ESP-IDF is self made by Espressif (at least so it seems). It is not very fast, and becomes very slow when memory gets fragmented. This problem becomes evident when using SPI RAM.
esp-idf-dlmalloc is a fork of ESP-IDF which was modified to use dlmalloc, an industry standard memory allocator. It is almost twice as fast as the default memory allocator, and does not slow down notably with fragmented memory.
The fork has its own GitHub repository here: https://github.com/neonious/esp-idf-dlmalloc
Hopefully Espressif is interested to switch the memory allocator in the default branch of ESP-IDF too (and hey, Espressif, while you are here, maybe you want to take a look at low.js, might be interesting to offically back, too).
2. Still cache issues with PSRAM
This is not really a gift to the community, but might become one when our report helps fix this problem:
We stumbled upon the fact that cache issue with PSRAM still exist, even in the newest development environment. This can produce random crashes, even if the code is 100 % valid.
You can find the project showcasing the problem here: https://github.com/neonious/memcrash-esp32
Thomas
-
- Posts: 47
- Joined: Thu Dec 20, 2018 9:47 am
Re: Faster, optimized ESP-IDF fork + PSRAM Issues
@neoniousTR If you think the allocator is better why not create a pull request on GitHub?
Re: Faster, optimized ESP-IDF fork + PSRAM Issues
Just want to comment that I'm following this thread with great interest. I have noticed rare, random crashes that I cannot explain and my project makes extensive use of PSRAM. Additionally, I had to create my own simple block allocator to deal with the slowness of the default allocator. Lots of interesting things here for me
-
- Posts: 4
- Joined: Mon Dec 10, 2018 4:32 pm
Re: Faster, optimized ESP-IDF fork + PSRAM Issues
Regarding pull request: The fork works, but some debugging features are not reimplemented. It's more a proof of concept. As I am not paid by Espressif I can't do more than that.
-
- Posts: 9708
- Joined: Thu Nov 26, 2015 4:08 am
Re: Faster, optimized ESP-IDF fork + PSRAM Issues
FYI, we'll look into the weird PSRAM issue; we may not have a response for that until the new year has started, though.
Re: Faster, optimized ESP-IDF fork + PSRAM Issues
Hi,ESP_Sprite wrote: ↑Fri Dec 28, 2018 3:32 pmFYI, we'll look into the weird PSRAM issue; we may not have a response for that until the new year has started, though.
I just want to k ow that is there any critical issue while using PSRAM as we are planning to use into one of our product. So, It will be helpful if any known critical issue for that to reduce time into debugging as well.
Regards,
Ritesh Prajapati
Ritesh Prajapati
-
- Posts: 9708
- Joined: Thu Nov 26, 2015 4:08 am
Re: Faster, optimized ESP-IDF fork + PSRAM Issues
I can't say at this point; need to reproduce the issue first. As it is, I can say that this probably is not an issue that happens very often, as I know of multiple commercial products using an ESP32 with PSRAM that run without issue. When I know more about this particular issue, I'll post it here.
Re: Faster, optimized ESP-IDF fork + PSRAM Issues
Ok. No issues. Thanks for update..ESP_Sprite wrote: ↑Sat Dec 29, 2018 8:22 amI can't say at this point; need to reproduce the issue first. As it is, I can say that this probably is not an issue that happens very often, as I know of multiple commercial products using an ESP32 with PSRAM that run without issue. When I know more about this particular issue, I'll post it here.
Regards,
Ritesh Prajapati
Ritesh Prajapati
Re: Faster, optimized ESP-IDF fork + PSRAM Issues
(This reply about is about esp-idf-dlmalloc, not the PSRAM report. Suggest to avoid parallel overlapping discussion we move further discussion of PSRAM to the GitHub issue also raised by neoniusTR.)
Thanks again for pointing out the performance issues with the IDF heap allocator and for posting the fork with dlmalloc.
I also want to note for anyone reading this thread: it's not until people say "This feature isn't fast enough for what I'm doing!" that we have any reason to prioritize performance work over the other things we could be doing. So it's worth speaking up about these kind of issues in any case.
The multi_heap allocator currently used in ESP-IDF is very simple, and was designed to minimize size overhead. The algorithm is basically the same as that used by umm_malloc and the FreeRTOS "heap_4.c" implementation - a single linked free list, and small heap chunk headers (4 bytes) to reduce heap accounting size overhead. (The original plan was to adapt umm_malloc to support multiple heaps, but it turned out to be easier to write a new implementation.)
This kind of simple "Microcontroller style" allocator does all operations in O(n) time where n is the number of chunks in the freelist. This means performance degrades as the heap fragments. Worst case performance is seen when a heavily fragmented heap is in PSRAM, due to cache thrashing - the allocator has to walk the freelist each time, and it has to fill any cache misses via SPI, and any cache miss means filling a 32 byte cache line to read a 4 byte header.
dlmalloc improves on this in at least two key ways - double linked freelists means free operations are O(1) not O(n), and binning of allocations (linked list for small allocations, trie structures for large ones) means finding the best free chunk is much faster and scales much better as the heap fragments.
When we originally evaluated heap implementations, anything this complex was ruled out due to the additional size overhead of larger chunk headers and more accounting data structures.
For example, comparing dlmalloc to multi_heap the ESP-IDF hello_world example free heap size is 298572 vs 296624 (-2KB), https_request free heap size (mid request) is 187336 vs 184272 (-3KB), gatt_server free heap size is 203304 vs 198720 (-4.5KB). Overhead should scale roughly linearly with the number of allocations. This additional usage is probably a non-issue for many users (especially those with PSRAM), but not users who are already squeezing out every last kilobyte.
dlmalloc also adds about 6KB of IRAM usage (approx triple the multi_heap IRAM usage). Which again isn't an issue for many users, but will be an issue for some. An option to place heap code into flash is already planned, but this may roll back some of the performance benefit of using dlmalloc in the first place (more code to cache thrash on).
There are at least 3 different approaches ESP-IDF could take:
Thanks again for pointing out the performance issues with the IDF heap allocator and for posting the fork with dlmalloc.
I also want to note for anyone reading this thread: it's not until people say "This feature isn't fast enough for what I'm doing!" that we have any reason to prioritize performance work over the other things we could be doing. So it's worth speaking up about these kind of issues in any case.
The multi_heap allocator currently used in ESP-IDF is very simple, and was designed to minimize size overhead. The algorithm is basically the same as that used by umm_malloc and the FreeRTOS "heap_4.c" implementation - a single linked free list, and small heap chunk headers (4 bytes) to reduce heap accounting size overhead. (The original plan was to adapt umm_malloc to support multiple heaps, but it turned out to be easier to write a new implementation.)
This kind of simple "Microcontroller style" allocator does all operations in O(n) time where n is the number of chunks in the freelist. This means performance degrades as the heap fragments. Worst case performance is seen when a heavily fragmented heap is in PSRAM, due to cache thrashing - the allocator has to walk the freelist each time, and it has to fill any cache misses via SPI, and any cache miss means filling a 32 byte cache line to read a 4 byte header.
dlmalloc improves on this in at least two key ways - double linked freelists means free operations are O(1) not O(n), and binning of allocations (linked list for small allocations, trie structures for large ones) means finding the best free chunk is much faster and scales much better as the heap fragments.
When we originally evaluated heap implementations, anything this complex was ruled out due to the additional size overhead of larger chunk headers and more accounting data structures.
For example, comparing dlmalloc to multi_heap the ESP-IDF hello_world example free heap size is 298572 vs 296624 (-2KB), https_request free heap size (mid request) is 187336 vs 184272 (-3KB), gatt_server free heap size is 203304 vs 198720 (-4.5KB). Overhead should scale roughly linearly with the number of allocations. This additional usage is probably a non-issue for many users (especially those with PSRAM), but not users who are already squeezing out every last kilobyte.
dlmalloc also adds about 6KB of IRAM usage (approx triple the multi_heap IRAM usage). Which again isn't an issue for many users, but will be an issue for some. An option to place heap code into flash is already planned, but this may roll back some of the performance benefit of using dlmalloc in the first place (more code to cache thrash on).
There are at least 3 different approaches ESP-IDF could take:
- Use dlmalloc in ESP-IDF (after adding the remaining debugging features for parity). This is difficult, at least without concurrently freeing up >5KB of DRAM and 6KB of IRAM in some other way.
- Allow configurable/pluggable heap implementations. Let users choose size vs performance based on their needs. Maybe even allow separate implementations for internal vs external RAM. Extra complexity is always a downside, but this could work.
- Optimise the existing multi_heap to try and get a "happy medium" between dlmalloc and simple allocator algorithms. This probably means adding binning of freelists.
Re: Faster, optimized ESP-IDF fork + PSRAM Issues
We're coming into 4 months after this bug was reported, but no movement on the GitHub issue.
This is a fundamental random memory corruption fault that is causing all sorts of issues. Can we get a resolution, or at least an update as to what is going on?
This is a fundamental random memory corruption fault that is causing all sorts of issues. Can we get a resolution, or at least an update as to what is going on?
Who is online
Users browsing this forum: rutrilla and 51 guests