Page 1 of 3

Unexpectedly low floating-point performance in C

Posted: Fri Dec 23, 2016 1:48 pm
by aschweiz
Hi,

according to the datasheet, the ESP32 seems to contain a hardware FPU, but in my tests, I get very bad FPU performance. For example, the following addition of two "float" variables and assignment to a third "float" variable (all 3 declared as volatile so that they don't get optimized away) takes 8.6 microseconds (measured on the GPIO output):

Code: Select all

    volatile float f1 = 1.11111111111111111111;
    volatile float f2 = 4.44444444444444444444;
    volatile float fSum;
...
    gpio_set_level(GPIO_NUM_18, 1);
    fSum = f1 + f2;
    gpio_set_level(GPIO_NUM_18, 0);
Is there a known issue with the FPU unit (couldn't find anything in the errata)? Or may be the compiler has not yet been optimized for floating-point calculations, or needs additional compilation flags? (I've used the default ESP-IDF configuration with 240 MHz clock speed.)

Cheers
Andreas

Re: Unexpectedly low floating-point performance in C

Posted: Fri Dec 23, 2016 1:51 pm
by aschweiz
additional info, assembly output of the compiler generated with -save-temps, plus some comments from the Xtensa instruction set architecture reference manual:

Code: Select all

	call8	vTaskSuspendAll
.LVL110:
	.loc 1 272 0
	movi.n	a11, 1
	movi.n	a10, 0x12
	call8	gpio_set_level    <------ t=0
.LVL111:
	.loc 1 273 0
	l32r	a3, .LC43          <-- variable f1, "32-bit load PC-relative"
	l32r	a2, .LC44          <-- variable f2
.LVL112:
	memw
	lsi	f1, a3, 0           <-- "load single-precision immediate"
	memw
	lsi	f0, a2, 0
	l32r	a4, .LC45          <-- variable fSum
	add.s	f0, f1, f0        <-- *** "single-precision add" ***
	.loc 1 274 0
	movi.n	a11, 0           <-- "load register with 12-bit signed constant"
	movi.n	a10, 0x12
	.loc 1 273 0
	memw
	ssi	f0, a4, 0           <-- "store from floating-point register to memory"
	.loc 1 274 0
	call8	gpio_set_level    <------ t=8.4µs
.LVL113:

Re: Unexpectedly low floating-point performance in C

Posted: Fri Dec 23, 2016 4:10 pm
by kolban
Andreas,
I wondered how you were measuring performance down to the micro-second and then I noticed that you were changing GPIO values before and after the arithmetic. If one then used as logic analyzer, one could then see the time from a transition from when the signal went high ... to when it went low. Nice ... I wouldn't have thought of that.

However, in my naive thinking, I am presuming that the call to gpio_set_level() is not instantaneous. As such it seems to me you might be measuring:

The time within gpio_set_level after the setting to logic 1 till the function ends +
The time for arithmetic +
The time within gpio_set_level before the start of the call to setting to logic 0

what if you replicated the arithmetic statement (say) 100 times or 1000 times. Then the error introduced by the calls to gpio_set_level() might be reduced and we might get a new number?

Again ... I may be all washed up here ... but I'd be interested in your thoughts.

Re: Unexpectedly low floating-point performance in C

Posted: Fri Dec 23, 2016 4:29 pm
by ESP_Sprite
Not sure if this applies here, but the first FPU calculation (either the first in a task or the first after another task which uses the FPU has ran) is slower than you'd expect. This is because FreeRTOS on the Xtensa does lazy context switching of the FPU registers. Basically, it initially assumes no task ever will use the FPU, and it will disable it to make sure of this. Once a task happens to do use the FPU, the fact that it is disabled generates an exception. In this exception, the Xtensa FreeRTOS will scramble to get the FPU in an usable state, when that has happened it will return. It may be you're measuring this initial startup delay as well. You should only see it the first time you use the FPU.

Re: Unexpectedly low floating-point performance in C

Posted: Fri Dec 23, 2016 4:43 pm
by aschweiz
Hi Neil,

that was also my first guess, but it turns out that it just takes 200 nanoseconds to toggle the output high and low again.

Meanwhile, I also tried the idea with the loop and the performance is much better. Doing an "f1 += f2" 100 times takes only 64 microseconds, 640 nanoseconds per addition.

@ESP_Sprite, thank you for the information. Actually, I did a couple of different floating-point operations in sequence (see here: https://blog.classycode.com/esp32-float ... .icfif348q) and wondered why the multiplication was faster than the addition (4.1µs vs. 8.7µs). The initialisation you describe could explain this.

I wonder if maybe these "memw" instructions play a role? Are these some sort of memory barrier or cache flushes?

cheers
Andreas

Re: Unexpectedly low floating-point performance in C

Posted: Fri Dec 23, 2016 11:38 pm
by ESP_igrr
Yes, these memory barriers are added by the compiler when you use volatile variables.

Re: Unexpectedly low floating-point performance in C

Posted: Sat Dec 24, 2016 1:07 am
by Greenja
Hello Andreas,

The big question is, are you running a RTOS on the STMF7?

Thanks,

Re: Unexpectedly low floating-point performance in C

Posted: Tue Dec 27, 2016 11:58 pm
by ESP_Angus
Hi Andreas,

Could you please post your test code (for ARM & esp-idf)? We'd be interested to take a look.

Angus

Re: Unexpectedly low floating-point performance in C

Posted: Sat Jan 07, 2017 12:59 pm
by aschweiz
Hi Angus,

attached is the test code. Let me know if you need more information or other files.

cheers
Andreas

Re: Unexpectedly low floating-point performance in C

Posted: Sat Jan 07, 2017 1:19 pm
by aschweiz
Hi Greenja,

good point :)

Indeed, the code on the STM32F767 was run without RTOS.

However I need to disappoint you - I've repeated the test with the code running in a FreeRTOS task on the STM32F767 and the numbers are still more or less the same.

Attached is the assembly output of the compiler. The left side is without FreeRTOS, the right side with FreeRTOS. I've disabled compiler optimisations (-O0) in both cases.

N.B.: Don't be mislead by the comment "7400ns --> 7800ns" - I first thought that the code with FreeRTOS is slightly slower but the reason was that I had initially compiled it with -O0 vs. -O3 for the non-FreeRTOS version. So, 7400ns for "pow" is with -O3, 7800ns with -O0.

cheers
Andreas