PSRAM read-only performance

vroland · Postby **vroland** » Sun Jun 14, 2020 11:50 am

Hello,

Finding this forum very helpful as a silent reader, I now hope to get my own question answered as well

I'm working on a E-Ink driver board based on the ESP32 (https://hackaday.io/project/168193-epdi ... controller). Those displays are driven by scanning an active matrix and applying some voltage multiple times to reach a desired grey value. Thus, performance is critical for reasonable update speeds.

As a 1200x825 4-bit framebuffer gets quite large (0.5MB), I use a WROVER-module with external SPIRAM, where framebuffers are stored.
This framebuffer has to be read for every matrix scan, making this data transfer the bottleneck of my application. Using one core solely for copying data, the troughput is ~20MB/s, which seems to be what I can expect when going through the cache (viewtopic.php?t=8492).

However, some threads mention 40MB/s as the theoretical maximum (https://esp32.com/viewtopic.php?t=13356, viewtopic.php?t=7158). Is that possible to achieve in practice somehow? E.g. by bypassing the cache or talking to the RAM directly? Maybe a way to issue larger reads?
Just curious to see if there's still room for improvement.

Thanks!

Postby **ESP_Sprite** » Mon Jun 15, 2020 10:29 am

In theory you should already get close to 40MByte/sec when only reading the PSRAM. The 20MByte/sec figure is only for writes.

vroland · Postby **vroland** » Mon Jan 25, 2021 4:57 pm

Hi, sorry to revive this, but I'm still struggling.
The following code:

Code: Select all

double GetTime() { return (double)esp_timer_get_time() / 1000000; }

int IRAM_ATTR RamTest()
	{
	int rs[] = { 1,2,4,8,16,32,64,128,256,512,1024,2048,3600 };
	printf("Ram Speed Test!\n\n");
	uint32_t xx = 0;
	for (int a = 0; a < 13; a++)
		{
		printf("Read Speed 32bit ArraySize %4dkb ", rs[a]);
		int ramsize = rs[a] * 1024;
		const int * rm = (const int*)heap_caps_malloc(ramsize, MALLOC_CAP_SPIRAM);

		int iters = 10; // Just enuff to boot the dog
		if (rs[a] < 512) iters = 50;
		double st = GetTime();
		for (int b = 0; b < iters; b++) {
                        const int * test = rm;
			for (int c = 0; c < ramsize/4; c++)
				xx |= *(test++);
                }
		st = GetTime() - st;
		vTaskDelay(1); // Dog it!
		double speed = ((double)(iters*ramsize) / (1024 * 1024)) / (st);
		printf(" time: %2.1f %2.1f mb/sec  \n", st, speed);
		free(rm);
		}
	printf("Test done!\n");
	printf("%d\n", xx);
    return 0;
}

Produces the following output:

Code: Select all

Read Speed 32bit ArraySize    1kb  time: 0.0 93.0 mb/sec  
Read Speed 32bit ArraySize    2kb  time: 0.0 96.9 mb/sec  
Read Speed 32bit ArraySize    4kb  time: 0.0 96.4 mb/sec  
Read Speed 32bit ArraySize    8kb  time: 0.0 96.7 mb/sec  
Read Speed 32bit ArraySize   16kb  time: 0.0 96.8 mb/sec  
Read Speed 32bit ArraySize   32kb  time: 0.0 95.9 mb/sec  
Read Speed 32bit ArraySize   64kb  time: 0.1 21.8 mb/sec  
Read Speed 32bit ArraySize  128kb  time: 0.3 21.7 mb/sec  
Read Speed 32bit ArraySize  256kb  time: 0.6 21.7 mb/sec  
Read Speed 32bit ArraySize  512kb  time: 0.2 21.7 mb/sec  
Read Speed 32bit ArraySize 1024kb  time: 0.5 21.7 mb/sec  
Read Speed 32bit ArraySize 2048kb  time: 0.9 21.7 mb/sec  
Read Speed 32bit ArraySize 3600kb  time: 1.6 21.6 mb/sec  
Test done!

So still only 21 mb / sec. The disassembly only contains a l32i.n instruction, so no writes as far as I can tell.
Flash is set to 80MHz and QIO, PSRAM is set to 80MHz.
Is there something I have overlooked?

vroland · Postby **vroland** » Sat Jan 30, 2021 1:04 pm

Interestingly, when switching the above code to use

Code: Select all

memcpy()

instead of reading

Code: Select all

*(test++)

in a loop, i get up to 25.7 MB/s. So I can go above 20MB/s, but cannot quite reach 40. I already tried to reduce the freeRTOS tick rate, etc. to prevent context switches from interfering, but that doesn't change anything.
I also tried different silicon revisions (1 and 3) no change here as well. Same when forcing different cache modes or running in single-core mode.
Any Idea what else I can do?

WiFive · Postby **WiFive** » Sat Jan 30, 2021 4:07 pm

When you switched to V3 chip did you disable the psram workaround in menuconfig?

vroland · Postby **vroland** » Sat Jan 30, 2021 4:51 pm

Hi WiFive,
Yes, the Workaround is disabled in menuconfig and the minimum chip revision is set to 3.
The serial debug output on satrtup seems reasonable:

Code: Select all

I (32) boot: chip revision: 3
I (36) qio_mode: Enabling default flash chip QIO
I (41) boot.esp32: SPI Speed      : 80MHz
I (46) boot.esp32: SPI Mode       : QIO
I (51) boot.esp32: SPI Flash Size : 4MB
....
I (186) psram: This chip is ESP32-D0WD
I (186) spiram: Found 64MBit SPI RAM device
I (186) spiram: SPI RAM mode: flash 80m sram 80m
I (189) spiram: PSRAM initialized, cache is in low/high (2-core) mode.

Postby **ESP_Sprite** » Mon Feb 01, 2021 4:24 am

Interesting. If any, I can replicate your results. The issue here may be that the cache in the ESP32 isn't super-smart compared to e.g. the ESP32S2's cache: from what I know, it tries to load the entire cache line from PSRAM before continuing. This means that the cache load and the CPU doing the reading from cache won't happen at the same time, as the CPU effectively is halted when reading the first word in the cache line until the cache line is fully loaded. The memcpy() thing supports that: as the code executes a bit faster (as memcpy() is optimized) the memory transfers happen faster. This is also indicated by changing your inner loop to this

Code: Select all

		for (int b = 0; b < iters; b++) {
			const int * test = rm;
			for (int c = 0; c < ramsize/4; c+=(32/4)) {
				xx |= test[0];
				xx |= test[1];
				xx |= test[2];
				xx |= test[3];
				xx |= test[4];
				xx |= test[5];
				xx |= test[6];
				xx |= test[7];
				test+=8;
			}
 		}

nets me a cool 24.9MBit/sec. (This code is faster as the compiler can optimize this to 8 instructions with absolute offsets instead of doing a load+add+loop every iteration)

The unfortunate bit is that if this is the limit, I'm not quite sure how to go faster... the ESP32S2 and later chips have knobs that allow you to do cache preread (and have smarter cache handling in general, so this should be less of a bottleneck in the first place) but the ESP32 misses a lot of those. The only workaround I can think of to get faster speed is by effectively sacrificing one of the CPU cores... let one core do the 'pre-read' by reading and discarding one word per cache line of psram, then the other CPU can do PSRAM operations at full speed. I'll be the first to admit that's a somewhat harebrained and impractical scheme, though.

vroland · Postby **vroland** » Mon Feb 01, 2021 1:58 pm

Hi, ESP_Sprite,
thanks for the informative answer. That's good to know. I need the second core for computation unfortunately, so switching to the S2 wouldn't help much. But at least I know I'm not leaving that performance on the table because of stupidity.

Another question if you don't mind: Do you know of any way to disable the cache workaround for one function only? Assume I have a function where I know it will only ever write to (and read from) internal memory buffers. If I enable the workaround, it is littered with memw. Is there any annotation, macro, etc that I can use to prevent the compiler from inserting them in this function?

Postby **ESP_Sprite** » Tue Feb 02, 2021 1:37 am

There isn't, sorry. A workaround would be to put the function into a separate file and then somehow telling cmake not to feed gcc the cache workaround command line flag when compiling that file.

vroland · Postby **vroland** » Tue Feb 02, 2021 10:18 am

All right, I'll try that. Thank you for your help!

PSRAM read-only performance

PSRAM read-only performance

Re: PSRAM read-only performance

Re: PSRAM read-only performance

Re: PSRAM read-only performance

Re: PSRAM read-only performance

Re: PSRAM read-only performance

Re: PSRAM read-only performance

Re: PSRAM read-only performance

Re: PSRAM read-only performance

Re: PSRAM read-only performance

Who is online

About Us

Extra

Information