PSRAM read-only performance
PSRAM read-only performance
Hello,
Finding this forum very helpful as a silent reader, I now hope to get my own question answered as well
I'm working on a E-Ink driver board based on the ESP32 (https://hackaday.io/project/168193-epdi ... controller). Those displays are driven by scanning an active matrix and applying some voltage multiple times to reach a desired grey value. Thus, performance is critical for reasonable update speeds.
As a 1200x825 4-bit framebuffer gets quite large (0.5MB), I use a WROVER-module with external SPIRAM, where framebuffers are stored.
This framebuffer has to be read for every matrix scan, making this data transfer the bottleneck of my application. Using one core solely for copying data, the troughput is ~20MB/s, which seems to be what I can expect when going through the cache (viewtopic.php?t=8492).
However, some threads mention 40MB/s as the theoretical maximum (https://esp32.com/viewtopic.php?t=13356, viewtopic.php?t=7158). Is that possible to achieve in practice somehow? E.g. by bypassing the cache or talking to the RAM directly? Maybe a way to issue larger reads?
Just curious to see if there's still room for improvement.
Thanks!
Finding this forum very helpful as a silent reader, I now hope to get my own question answered as well
I'm working on a E-Ink driver board based on the ESP32 (https://hackaday.io/project/168193-epdi ... controller). Those displays are driven by scanning an active matrix and applying some voltage multiple times to reach a desired grey value. Thus, performance is critical for reasonable update speeds.
As a 1200x825 4-bit framebuffer gets quite large (0.5MB), I use a WROVER-module with external SPIRAM, where framebuffers are stored.
This framebuffer has to be read for every matrix scan, making this data transfer the bottleneck of my application. Using one core solely for copying data, the troughput is ~20MB/s, which seems to be what I can expect when going through the cache (viewtopic.php?t=8492).
However, some threads mention 40MB/s as the theoretical maximum (https://esp32.com/viewtopic.php?t=13356, viewtopic.php?t=7158). Is that possible to achieve in practice somehow? E.g. by bypassing the cache or talking to the RAM directly? Maybe a way to issue larger reads?
Just curious to see if there's still room for improvement.
Thanks!
-
- Posts: 9769
- Joined: Thu Nov 26, 2015 4:08 am
Re: PSRAM read-only performance
In theory you should already get close to 40MByte/sec when only reading the PSRAM. The 20MByte/sec figure is only for writes.
Re: PSRAM read-only performance
Hi, sorry to revive this, but I'm still struggling.
The following code:
Produces the following output:
So still only 21 mb / sec. The disassembly only contains a l32i.n instruction, so no writes as far as I can tell.
Flash is set to 80MHz and QIO, PSRAM is set to 80MHz.
Is there something I have overlooked?
The following code:
Code: Select all
double GetTime() { return (double)esp_timer_get_time() / 1000000; }
int IRAM_ATTR RamTest()
{
int rs[] = { 1,2,4,8,16,32,64,128,256,512,1024,2048,3600 };
printf("Ram Speed Test!\n\n");
uint32_t xx = 0;
for (int a = 0; a < 13; a++)
{
printf("Read Speed 32bit ArraySize %4dkb ", rs[a]);
int ramsize = rs[a] * 1024;
const int * rm = (const int*)heap_caps_malloc(ramsize, MALLOC_CAP_SPIRAM);
int iters = 10; // Just enuff to boot the dog
if (rs[a] < 512) iters = 50;
double st = GetTime();
for (int b = 0; b < iters; b++) {
const int * test = rm;
for (int c = 0; c < ramsize/4; c++)
xx |= *(test++);
}
st = GetTime() - st;
vTaskDelay(1); // Dog it!
double speed = ((double)(iters*ramsize) / (1024 * 1024)) / (st);
printf(" time: %2.1f %2.1f mb/sec \n", st, speed);
free(rm);
}
printf("Test done!\n");
printf("%d\n", xx);
return 0;
}
Code: Select all
Read Speed 32bit ArraySize 1kb time: 0.0 93.0 mb/sec
Read Speed 32bit ArraySize 2kb time: 0.0 96.9 mb/sec
Read Speed 32bit ArraySize 4kb time: 0.0 96.4 mb/sec
Read Speed 32bit ArraySize 8kb time: 0.0 96.7 mb/sec
Read Speed 32bit ArraySize 16kb time: 0.0 96.8 mb/sec
Read Speed 32bit ArraySize 32kb time: 0.0 95.9 mb/sec
Read Speed 32bit ArraySize 64kb time: 0.1 21.8 mb/sec
Read Speed 32bit ArraySize 128kb time: 0.3 21.7 mb/sec
Read Speed 32bit ArraySize 256kb time: 0.6 21.7 mb/sec
Read Speed 32bit ArraySize 512kb time: 0.2 21.7 mb/sec
Read Speed 32bit ArraySize 1024kb time: 0.5 21.7 mb/sec
Read Speed 32bit ArraySize 2048kb time: 0.9 21.7 mb/sec
Read Speed 32bit ArraySize 3600kb time: 1.6 21.6 mb/sec
Test done!
Flash is set to 80MHz and QIO, PSRAM is set to 80MHz.
Is there something I have overlooked?
Re: PSRAM read-only performance
Interestingly, when switching the above code to use instead of reading in a loop, i get up to 25.7 MB/s. So I can go above 20MB/s, but cannot quite reach 40. I already tried to reduce the freeRTOS tick rate, etc. to prevent context switches from interfering, but that doesn't change anything.
I also tried different silicon revisions (1 and 3) no change here as well. Same when forcing different cache modes or running in single-core mode.
Any Idea what else I can do?
Code: Select all
memcpy()
Code: Select all
*(test++)
I also tried different silicon revisions (1 and 3) no change here as well. Same when forcing different cache modes or running in single-core mode.
Any Idea what else I can do?
Re: PSRAM read-only performance
When you switched to V3 chip did you disable the psram workaround in menuconfig?
Re: PSRAM read-only performance
Hi WiFive,
Yes, the Workaround is disabled in menuconfig and the minimum chip revision is set to 3.
The serial debug output on satrtup seems reasonable:
Yes, the Workaround is disabled in menuconfig and the minimum chip revision is set to 3.
The serial debug output on satrtup seems reasonable:
Code: Select all
I (32) boot: chip revision: 3
I (36) qio_mode: Enabling default flash chip QIO
I (41) boot.esp32: SPI Speed : 80MHz
I (46) boot.esp32: SPI Mode : QIO
I (51) boot.esp32: SPI Flash Size : 4MB
....
I (186) psram: This chip is ESP32-D0WD
I (186) spiram: Found 64MBit SPI RAM device
I (186) spiram: SPI RAM mode: flash 80m sram 80m
I (189) spiram: PSRAM initialized, cache is in low/high (2-core) mode.
-
- Posts: 9769
- Joined: Thu Nov 26, 2015 4:08 am
Re: PSRAM read-only performance
Interesting. If any, I can replicate your results. The issue here may be that the cache in the ESP32 isn't super-smart compared to e.g. the ESP32S2's cache: from what I know, it tries to load the entire cache line from PSRAM before continuing. This means that the cache load and the CPU doing the reading from cache won't happen at the same time, as the CPU effectively is halted when reading the first word in the cache line until the cache line is fully loaded. The memcpy() thing supports that: as the code executes a bit faster (as memcpy() is optimized) the memory transfers happen faster. This is also indicated by changing your inner loop to this
nets me a cool 24.9MBit/sec. (This code is faster as the compiler can optimize this to 8 instructions with absolute offsets instead of doing a load+add+loop every iteration)
The unfortunate bit is that if this is the limit, I'm not quite sure how to go faster... the ESP32S2 and later chips have knobs that allow you to do cache preread (and have smarter cache handling in general, so this should be less of a bottleneck in the first place) but the ESP32 misses a lot of those. The only workaround I can think of to get faster speed is by effectively sacrificing one of the CPU cores... let one core do the 'pre-read' by reading and discarding one word per cache line of psram, then the other CPU can do PSRAM operations at full speed. I'll be the first to admit that's a somewhat harebrained and impractical scheme, though.
Code: Select all
for (int b = 0; b < iters; b++) {
const int * test = rm;
for (int c = 0; c < ramsize/4; c+=(32/4)) {
xx |= test[0];
xx |= test[1];
xx |= test[2];
xx |= test[3];
xx |= test[4];
xx |= test[5];
xx |= test[6];
xx |= test[7];
test+=8;
}
}
The unfortunate bit is that if this is the limit, I'm not quite sure how to go faster... the ESP32S2 and later chips have knobs that allow you to do cache preread (and have smarter cache handling in general, so this should be less of a bottleneck in the first place) but the ESP32 misses a lot of those. The only workaround I can think of to get faster speed is by effectively sacrificing one of the CPU cores... let one core do the 'pre-read' by reading and discarding one word per cache line of psram, then the other CPU can do PSRAM operations at full speed. I'll be the first to admit that's a somewhat harebrained and impractical scheme, though.
Re: PSRAM read-only performance
Hi, ESP_Sprite,
thanks for the informative answer. That's good to know. I need the second core for computation unfortunately, so switching to the S2 wouldn't help much. But at least I know I'm not leaving that performance on the table because of stupidity.
Another question if you don't mind: Do you know of any way to disable the cache workaround for one function only? Assume I have a function where I know it will only ever write to (and read from) internal memory buffers. If I enable the workaround, it is littered with memw. Is there any annotation, macro, etc that I can use to prevent the compiler from inserting them in this function?
thanks for the informative answer. That's good to know. I need the second core for computation unfortunately, so switching to the S2 wouldn't help much. But at least I know I'm not leaving that performance on the table because of stupidity.
Another question if you don't mind: Do you know of any way to disable the cache workaround for one function only? Assume I have a function where I know it will only ever write to (and read from) internal memory buffers. If I enable the workaround, it is littered with memw. Is there any annotation, macro, etc that I can use to prevent the compiler from inserting them in this function?
-
- Posts: 9769
- Joined: Thu Nov 26, 2015 4:08 am
Re: PSRAM read-only performance
There isn't, sorry. A workaround would be to put the function into a separate file and then somehow telling cmake not to feed gcc the cache workaround command line flag when compiling that file.
Re: PSRAM read-only performance
All right, I'll try that. Thank you for your help!
Who is online
Users browsing this forum: No registered users and 80 guests