Unexpectedly low floating-point performance in C
Re: Unexpectedly low floating-point performance in C
There are several things to be considered on the floating point test.
1. The ESP runs out of FLASH + 64K of CACHED RAM. The RAM runs at full speed.
The FLASH is clocked at 80MHz and is running in DUAL mode (2 bits per clock cycle) 200ns for a 32 bit word and 150ns for a 24bit word.
The ESP32 has both 24 and 32bit instructions.
2. The first time through a set of instructions it needs to go to the FLASH to get the code.
It stores it in the 64K cache for use next time.
The cache can get overwritten is there is a lot of code between the first read and the second usage. If this happens the code must be fetched from FLASH again.
3. There are 16 floating point registers some number of these are used during a floating point operation.
These need to be loaded and unloaded with data. If the registers already contain the value they can be used again.
4. There are 38 single precision floating point instructions.
These native instructions include add, subtract, multiply, and multiply with add or subtract (useful for FIR filters in DSP code).
They do not include divide, or any higher level operations like sqrt, pow, sin, ect. If these higher power operations are needed they are typical ran in math library.
5. The FPU must be attached to the task the first time it is needed.
6. Operations using the GPIO are generally written for general use and may involve one or more subroutine calls.
The ESP32 has special registers that allow set, clear, in, out with just one instruction.
Since GPIO are I/O devices and on more compiled processors like the ESP32 require multiple CPU clock cycle to operate.
I wrote several floating point test routines. Some are very short only testing the native FP operations. Others are more complex using higher power operations like DIVIDE, POW, SQRT, SIN.
All routines use loops to test the first operation out of FLASH and then on the second iteration the cached instructions out of RAM.
Note that even the first use of code may already de cached due to line fills and pre fetched code.
This can be seen when instructions that take longer time to execute it gives the FLASH time to run ahead and store the next instruction in the cache before they are needed.
This is why the MUL looks like it is faster than the ADD in Aschweiz test.
With all that said here are some of my results.
1. GPIO operations to set and clear a bit on a port takes about 62.4ns and must be subtracted from the time it takes a test to run since I set and clear a bit for every FP instruction tested.
2. Starting the FP processor the first time takes ~ 5.6172us. This time only happens once on a single task system.
3. A single precision ADD takes 2.065us the first time through the loop.
The second time running out of cache it only takes 62.4ns.
4. A single precision MUL takes 4.0656us the first time through the loop.
The second time running out of cache it only takes 62.4ns.
5. A single precision MUL+ADD takes 2.09us the first time through the loop since some of the FLASH code is already cached due to the longer time of the previous instruction.
The second time running out of cache it only takes 100.4ns.
All other operations use multiple FP instructions to do the operation and the code is running out of cache. Here are the single precision results
1. DIVIDE takes 1.158us
2. SQRT takes 8.155us
3. POW takes 55.8us
4. SIN takes 15.776us
Since there is no double precision instructions all operations use multiple FP instructions to do the operation and the code is running out of cache. Here are the double precision results
1. DADD takes 400ns
2. DMUL takes 787ns
3. DMUL+DADD takes 1.11us
5. DDIVIDE takes 4.085us
6. DSQRT takes 7.88us same as single precision
7. DPOW takes 55.37us same as single precision
8. DSIN takes 15.776us same as single precision
I was using the Arduino IDE and math Libraries and they only accept single precision inputs to SQRT, POW, SIN and this is why they are the same as single precision.
Although from the clock tick counter it looks like I am running at 240MHz. Looking at the shortest time of execution it takes 15 clock cycle to do an ADD.
I assume this is due to the loading and unloading of the floating point registers since all of them are declared volatile.
If I remove the volatile from the one input and let it to be cached in one the 16 FP registers then one of the inputs and one output need to be loaded/unloaded from the FP register. The ADD and MUL drop to 6 clock cycles or about 25ns.
Doing two ADD’s with one volatile on its input and no volatile one it output and then using these two results for a third ADD with it output declared volatile allows three registers to be cached in the FP registers.
The last ADD will use two of the results directly from the FP registers and only unload one result since it output is declare volatile. This reduces the ADD to 3 clock cycles or about 12.5ns.
One thing about using GPIO for speed measurements. You can only change GPIO so fast below that value the GPIO will not toggle.
In that case you need to run the code with and without the test instruction to determine its speed by subtracting the difference or use multiple iterations.
In my case it looks like the lower limit is around 100ns. After subtracting off the 62.4ns for the GPIO calls this leaves 37.6ns or 9 clock cycles.
1. The ESP runs out of FLASH + 64K of CACHED RAM. The RAM runs at full speed.
The FLASH is clocked at 80MHz and is running in DUAL mode (2 bits per clock cycle) 200ns for a 32 bit word and 150ns for a 24bit word.
The ESP32 has both 24 and 32bit instructions.
2. The first time through a set of instructions it needs to go to the FLASH to get the code.
It stores it in the 64K cache for use next time.
The cache can get overwritten is there is a lot of code between the first read and the second usage. If this happens the code must be fetched from FLASH again.
3. There are 16 floating point registers some number of these are used during a floating point operation.
These need to be loaded and unloaded with data. If the registers already contain the value they can be used again.
4. There are 38 single precision floating point instructions.
These native instructions include add, subtract, multiply, and multiply with add or subtract (useful for FIR filters in DSP code).
They do not include divide, or any higher level operations like sqrt, pow, sin, ect. If these higher power operations are needed they are typical ran in math library.
5. The FPU must be attached to the task the first time it is needed.
6. Operations using the GPIO are generally written for general use and may involve one or more subroutine calls.
The ESP32 has special registers that allow set, clear, in, out with just one instruction.
Since GPIO are I/O devices and on more compiled processors like the ESP32 require multiple CPU clock cycle to operate.
I wrote several floating point test routines. Some are very short only testing the native FP operations. Others are more complex using higher power operations like DIVIDE, POW, SQRT, SIN.
All routines use loops to test the first operation out of FLASH and then on the second iteration the cached instructions out of RAM.
Note that even the first use of code may already de cached due to line fills and pre fetched code.
This can be seen when instructions that take longer time to execute it gives the FLASH time to run ahead and store the next instruction in the cache before they are needed.
This is why the MUL looks like it is faster than the ADD in Aschweiz test.
With all that said here are some of my results.
1. GPIO operations to set and clear a bit on a port takes about 62.4ns and must be subtracted from the time it takes a test to run since I set and clear a bit for every FP instruction tested.
2. Starting the FP processor the first time takes ~ 5.6172us. This time only happens once on a single task system.
3. A single precision ADD takes 2.065us the first time through the loop.
The second time running out of cache it only takes 62.4ns.
4. A single precision MUL takes 4.0656us the first time through the loop.
The second time running out of cache it only takes 62.4ns.
5. A single precision MUL+ADD takes 2.09us the first time through the loop since some of the FLASH code is already cached due to the longer time of the previous instruction.
The second time running out of cache it only takes 100.4ns.
All other operations use multiple FP instructions to do the operation and the code is running out of cache. Here are the single precision results
1. DIVIDE takes 1.158us
2. SQRT takes 8.155us
3. POW takes 55.8us
4. SIN takes 15.776us
Since there is no double precision instructions all operations use multiple FP instructions to do the operation and the code is running out of cache. Here are the double precision results
1. DADD takes 400ns
2. DMUL takes 787ns
3. DMUL+DADD takes 1.11us
5. DDIVIDE takes 4.085us
6. DSQRT takes 7.88us same as single precision
7. DPOW takes 55.37us same as single precision
8. DSIN takes 15.776us same as single precision
I was using the Arduino IDE and math Libraries and they only accept single precision inputs to SQRT, POW, SIN and this is why they are the same as single precision.
Although from the clock tick counter it looks like I am running at 240MHz. Looking at the shortest time of execution it takes 15 clock cycle to do an ADD.
I assume this is due to the loading and unloading of the floating point registers since all of them are declared volatile.
If I remove the volatile from the one input and let it to be cached in one the 16 FP registers then one of the inputs and one output need to be loaded/unloaded from the FP register. The ADD and MUL drop to 6 clock cycles or about 25ns.
Doing two ADD’s with one volatile on its input and no volatile one it output and then using these two results for a third ADD with it output declared volatile allows three registers to be cached in the FP registers.
The last ADD will use two of the results directly from the FP registers and only unload one result since it output is declare volatile. This reduces the ADD to 3 clock cycles or about 12.5ns.
One thing about using GPIO for speed measurements. You can only change GPIO so fast below that value the GPIO will not toggle.
In that case you need to run the code with and without the test instruction to determine its speed by subtracting the difference or use multiple iterations.
In my case it looks like the lower limit is around 100ns. After subtracting off the 62.4ns for the GPIO calls this leaves 37.6ns or 9 clock cycles.
Re: Unexpectedly low floating-point performance in C
So your basically saying that the ESP32 does a very crappy job when it comes to FP and math functions.
I think someone from Espressif should see how they can eliminate all the wasted clock cycles both in firmware and in the chip itself. To be fair though, I'm not sure a Cortex F7 and the Tensila Lx106(8?) would be a apples to apples comparison.
Good job though in the thorough summary. It is definitely something to take note of.
Thanks,
I think someone from Espressif should see how they can eliminate all the wasted clock cycles both in firmware and in the chip itself. To be fair though, I'm not sure a Cortex F7 and the Tensila Lx106(8?) would be a apples to apples comparison.
Good job though in the thorough summary. It is definitely something to take note of.
Thanks,
Re: Unexpectedly low floating-point performance in C
For a $6.50 part with all the I/O and WIFI and BT and a crappy FP unit I can not complain
-
- Posts: 18
- Joined: Tue Jan 12, 2016 4:12 pm
- Location: Pune, India
- Contact:
Re: Unexpectedly low floating-point performance in C
@russbarr Great work on profiling FP! Would you be so kind as to post your code so that we could re-create the tests?
Re: Unexpectedly low floating-point performance in C
I measured 700 ns for a double multiplication and 10 ns for a single multiplication in this testing code (a modification of the ESP Hello World example) compiled with no optimization:
This matches the numbers given above by @RussBar. This single precision performance is OK for me.
Code: Select all
#include <stdio.h>
#include <time.h>
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "esp_system.h"
//typedef float ftype;
typedef double ftype;
ftype f= 1;
const int N= 100000;
const ftype C= 1.00001;
const ftype CI= (1/1.00001);
void hello_task(void *pvParameter)
{
printf("Hello world!\n");
for (int i = 10; i >= 0; i--) {
printf("Restarting in %d seconds...\n", i);
clock_t start= clock();
if (i % 2 == 0)
for (int i= 0; i < N; i++)
f*= C;
else
for (int i= 0; i < N; i++)
f*= CI;
// f/= C;
printf("%lf\n", (double) f);
printf("%lf us\n", (clock() - start) / (double) N * 1000);
vTaskDelay(1000 / portTICK_PERIOD_MS);
}
printf("Restarting now.\n");
fflush(stdout);
esp_restart();
}
void app_main()
{
xTaskCreate(&hello_task, "hello_task", 2048, NULL, 5, NULL);
}
Re: Unexpectedly low floating-point performance in C
There is the possibility to define functions to be located in IRAM by using the "IRAM_ATTR" attribute. In this case the code will not be fetched from slow serial FLASH.
Could You "Benchmark - Folks" test the execution speed again with using the IRAM_ATTR attribute?
Regards
Could You "Benchmark - Folks" test the execution speed again with using the IRAM_ATTR attribute?
Regards
"Whoever believes to be someone has stopped becoming someone"
Sokrates
Sokrates
-
- Posts: 1
- Joined: Tue Jul 10, 2018 2:48 pm
Re: Unexpectedly low floating-point performance in C
Digging up this old thread...
I ran some similar tests comparing the ESP32 to the STM32F767, and was similarly perplexed.
I also tested integer arithmetic, and in that case the ESP32 did even worse. The F7 is 3-4x faster on integer adds, while running at 80% of the clock speed, 192 MHz (it can max out at 216, but I didn't bother changing the clock settings).
Since the benchmarking code is so small, that pretty much eliminates issues with caching and the heavily limited bandwidth for code fetches on the ESP32. I ran 1,000,000 loops each with 10 operations, so the initial code fetches and loop overhead are negligible. Adding IRAM_ATTR doesn't make a difference, the code will stay in cache until it needs to be flushed, so it only matters on the initial fetch.
After doing some research into the core architectures, I think I've found the reason for the difference:
*The Cortex M7 is a superscalar core*
Practically speaking, this means it can issue and execute two instructions per clock cycle. This is the only thing I can find that would explain it. Beyond that, the M7 also has a branch predictor which speeds up the loop itself (which is why I ran 10 ops per loop to try to mitigate that effect).
I would expect to see a similar difference between the M7 and M4.
I think this is a case of "you get what you pay for". The Xtensa is cheap. The ARM is fast.
I ran some similar tests comparing the ESP32 to the STM32F767, and was similarly perplexed.
I also tested integer arithmetic, and in that case the ESP32 did even worse. The F7 is 3-4x faster on integer adds, while running at 80% of the clock speed, 192 MHz (it can max out at 216, but I didn't bother changing the clock settings).
Since the benchmarking code is so small, that pretty much eliminates issues with caching and the heavily limited bandwidth for code fetches on the ESP32. I ran 1,000,000 loops each with 10 operations, so the initial code fetches and loop overhead are negligible. Adding IRAM_ATTR doesn't make a difference, the code will stay in cache until it needs to be flushed, so it only matters on the initial fetch.
After doing some research into the core architectures, I think I've found the reason for the difference:
*The Cortex M7 is a superscalar core*
Practically speaking, this means it can issue and execute two instructions per clock cycle. This is the only thing I can find that would explain it. Beyond that, the M7 also has a branch predictor which speeds up the loop itself (which is why I ran 10 ops per loop to try to mitigate that effect).
I would expect to see a similar difference between the M7 and M4.
I think this is a case of "you get what you pay for". The Xtensa is cheap. The ARM is fast.
- Vader_Mester
- Posts: 300
- Joined: Tue Dec 05, 2017 8:28 pm
- Location: Hungary
- Contact:
Re: Unexpectedly low floating-point performance in C
I have taken a look at the Xtensa instruction set, and it should not be that slow, compared to what ARM uses.
Maybe the compiler is not optimized for this.
Maybe the compiler is not optimized for this.
Code: Select all
task_t coffeeTask()
{
while(atWork){
if(!xStreamBufferIsEmpty(mug)){
coffeeDrink(mug);
} else {
xTaskCreate(sBrew, "brew", 9000, &mug, 1, NULL);
xSemaphoreTake(sCoffeeRdy, portMAX_DELAY);
}
}
vTaskDelete(NULL);
}
Re: Unexpectedly low floating-point performance in C
So what is the result ? Did you tried suggestion ?
Re: Unexpectedly low floating-point performance in C
It would be nice if someone tested it with new toolchain that supports GCC 8.2 - a better compiler.Vader_Mester wrote:I have taken a look at the Xtensa instruction set, and it should not be that slow, compared to what ARM uses.
Maybe the compiler is not optimized for this.
I am sure it makes a difference
Who is online
Users browsing this forum: No registered users and 15 guests