Unexpectedly low floating-point performance in C

Oleg Endo
Posts: 18
Joined: Fri Sep 28, 2018 1:48 pm

Re: Unexpectedly low floating-point performance in C

Postby Oleg Endo » Tue Dec 18, 2018 3:20 pm

I've tried the following modified example, compiled with the GCC 8 toolchain, running on ESP32 240 MHz CPU core. SPI Flash/RAM is 80 MHz QSPI, but it doesn't matter, as the code and data easily fits into the cache.

The benchmark loops have been modified to do 4 independent instructions per loop, as opposed to 4 sequential dependent instructions. If some instructions can be executed in parallel, this should be hopefully a better test. I've also made sure that the compiler actually generates tight arithmetic code.

Code: Select all

#include <cstdio>
#include <chrono>
#include <array>
#include <functional>


#if 0
  typedef unsigned int ftype;

  ftype f0 = 1;
  ftype f1 = 2;
  ftype f2 = 3;
  ftype f3 = 4;

  const int N = 10000;
  const ftype C = 1.00001;
  const ftype CI = (100/1.00001);
  const ftype CII = 13;

#else
  typedef float ftype;
//  typedef double ftype;

  const int N = 10000;
  const ftype C = 1.00001;
  const ftype CI = (1/1.00001);
  const ftype CII = 0.001;
#endif

  const int M = 3;

std::array<ftype, 4> test_values;

[[gnu::noinline, gnu::optimize ("fast-math")]]
void test_addition (void)
{
  auto f0 = test_values[0];
  auto f1 = test_values[1];
  auto f2 = test_values[2];
  auto f3 = test_values[3];

  for (int j = 0; j < N/4; j++)
  {
    f0 += C;
    f1 += C;
    f2 += C;
    f3 += C;
  }

  test_values[0] = f0;
  test_values[1] = f1;
  test_values[2] = f2;
  test_values[3] = f3;
}

[[gnu::noinline, gnu::optimize ("fast-math")]]
void test_multiplication (void)
{
  auto f0 = test_values[0];
  auto f1 = test_values[1];
  auto f2 = test_values[2];
  auto f3 = test_values[3];

  for (int j = 0; j < N/4; j++)
  {
    f0 *= CI;
    f1 *= CI;
    f2 *= CI;
    f3 *= CI;
  }

  test_values[0] = f0;
  test_values[1] = f1;
  test_values[2] = f2;
  test_values[3] = f3;
}

[[gnu::noinline, gnu::optimize ("fast-math")]]
void test_multiply_accumulate (void)
{
  auto f0 = test_values[0];
  auto f1 = test_values[1];
  auto f2 = test_values[2];
  auto f3 = test_values[3];

  for (int j = 0; j < N/4; j++)
  {
    f0 = f0 + f3 * CII;
    f1 = f1 + f2 * CII;
    f2 = f2 + f1 * CII;
    f3 = f3 + f0 * CII;
  }

  test_values[0] = f0;
  test_values[1] = f1;
  test_values[2] = f2;
  test_values[3] = f3;
}


void run_test (const char* name, std::function<void (void)> func)
{
  std::printf ("%s ... \n", name);

  for (unsigned int i = 0; i < M; ++i)
  {
    auto start_time = std::chrono::high_resolution_clock::now ();

    if (func)
      func ();
    else
      __builtin_unreachable ();

    auto end_time = std::chrono::high_resolution_clock::now ();
    auto duration = std::chrono::duration_cast<std::chrono::nanoseconds> (end_time - start_time);

    std::printf ("f0 = %lf f1 = %lf f2 = %lf f3 = %lf\n",
		 (double)test_values[0], (double)test_values[1],
		 (double)test_values[2], (double)test_values[3]);

    std::printf ("%lf ns / insn\n", duration.count () / (double)N);
    std::printf ("%lf MOPS\n\n", (1'000'000'000 / (duration.count () / (double)N)) / 1'000'000);
  }
}



extern "C"
void app_main (void)
{
  test_values[0] = 1;
  test_values[1] = 2;
  test_values[2] = 3;
  test_values[3] = 4;

  run_test ("addition", &test_addition);

  run_test ("multiplication", &test_multiplication);

  run_test ("multiply-accumulate", &test_multiply_accumulate);
}
The following are the results.
At 240 MHz, 1 clock cycle = 4.17 ns.

Integer Addition
1.600000 ns / insn
625.000000 MOPS

Integer Multiplication
23.400000 ns / insn
42.735043 MOPS
This test is meaningless. The mul* instructions are not used and the compiler does some other arithmetic optimizations.

Integer Multiply-Accumulate
22.500000 ns / insn
44.444444 MOPS
This test is meaningless. The addx instruction is not used and the compiler does some other arithmetic optimizations.

Float Addition
5.800000 ns / insn
172.413793 MOPS

Float Multiplication
5.800000 ns / insn
172.413793 MOPS

Float Multiply-Accumulate
9.900000 ns / insn
101.010101 MOPS
This case actually does use the madd.s instruction.

Since there is no hardware support for double precision floating pont, there's little point in testing its performance, but just for completeness sake ...

Double Addition
246.500000 ns / insn
4.056795 MOPS

Double Multiplication
456.600000 ns / insn
2.190101 MOPS

Double Multiply-Accumulate
667.200000 ns / insn
1.498801 MOPS

Notice that this is a synthetic benchmark. It looks a little like the compiler has some issues with register allocation of larger code snippets and the actual computation gets slowed down by register moves and so on. As usual with these things, real-world performance will be lower than that, except for hand-tuned compute functions.

So yeah, hmm .. FP performance is a little low on the ESP32.

MartinJ
Posts: 2
Joined: Thu Mar 25, 2021 8:30 pm

Re: Unexpectedly low floating-point performance in C

Postby MartinJ » Thu Mar 25, 2021 9:21 pm

I just thought I would post here because a search for ep32 fp performance leads here.
I updated the test code in the previous post to fix a few things. My code and results are at the end.
The main point is that, in the best case, integer adds and multiplies all take one cycle while 32 bit float adds and multiplies take two.
For multiply-accumulate I count the multiply and add as 2 operations and there is no integer multiply-accumulate instruction.
It takes 3 cycles to do an integer multiply and then add for some reason. I'd say the single precision fp performance is very good for a microcontroller. It's probably as fast to use floats as fixed point integer operations in many situations since the fixed point code will need extra shifts.
Definitely don't use doubles unless you absolutely have to.

Also note that the esp32 CPU cores do have hardware fp sqrt and divide instructions but they are not currently generated by the compiler and only included in more recent versions of the c maths library, performance isn't great.

Summary:
int float double
addition 240 120 6.47
multiplication 240 120 2.24
multiply-accumulate 160 120 5.92

All in Mops/s, running at 240MHz.

Code: Select all

#include <cstdio>
#include <chrono>
#include <array>
#include <functional>
#include <esp_attr.h>
#include <freertos/FreeRTOS.h>
#include "freertos/task.h"


const int N = 3200000;

const int M = 3;

static double test_values[4];

template <typename T>
//[[gnu::noinline, gnu::optimize ("fast-math")]]
void IRAM_ATTR test_addition (void)
{
  T f0 = (T)test_values[0];
  T f1 = (T)test_values[1];
  T f2 = (T)test_values[2];
  T f3 = (T)test_values[3];

  for (int j = 0; j < N/16; j++)
  {
    f0 += f3;
    f1 += f2;
    f2 += f1;
    f3 += f0;
    f0 += f3;
    f1 += f2;
    f2 += f1;
    f3 += f0;
    f0 += f3;
    f1 += f2;
    f2 += f1;
    f3 += f0;
    f0 += f3;
    f1 += f2;
    f2 += f1;
    f3 += f0;
  }

  test_values[0] = (double)f0;
  test_values[1] = (double)f1;
  test_values[2] = (double)f2;
  test_values[3] = (double)f3;
}

template <typename T>
//[[gnu::noinline, gnu::optimize ("fast-math")]]
void IRAM_ATTR test_multiplication (void)
{
  T f0 = (T)test_values[0];
  T f1 = (T)test_values[1];
  T f2 = (T)test_values[2];
  T f3 = (T)test_values[3];

  for (int j = 0; j < N/16; j++)
  {
    f0 *= f3;
    f1 *= f2;
    f2 *= f1;
    f3 *= f0;
    f0 *= f3;
    f1 *= f2;
    f2 *= f1;
    f3 *= f0;
    f0 *= f3;
    f1 *= f2;
    f2 *= f1;
    f3 *= f0;
    f0 *= f3;
    f1 *= f2;
    f2 *= f1;
    f3 *= f0;
  }

  test_values[0] = (double)f0;
  test_values[1] = (double)f1;
  test_values[2] = (double)f2;
  test_values[3] = (double)f3;
}

template <typename T>
//[[gnu::noinline, gnu::optimize ("fast-math")]]
void IRAM_ATTR test_multiply_accumulate (void)
{
  T f0 = test_values[0];
  T f1 = test_values[1];
  T f2 = test_values[2];
  T f3 = test_values[3];

  for (int j = 0; j < N/32; j++)
  {
    f0 += f3*f3;
    f1 += f2*f0;
    f2 += f1*f1;
    f3 += f0*f2;
    f0 += f3*f3;
    f1 += f2*f0;
    f2 += f1*f1;
    f3 += f0*f2;
    f0 += f3*f3;
    f1 += f2*f0;
    f2 += f1*f1;
    f3 += f0*f2;
    f0 += f3*f3;
    f1 += f2*f0;
    f2 += f1*f1;
    f3 += f0*f2;
  }

  test_values[0] = f0;
  test_values[1] = f1;
  test_values[2] = f2;
  test_values[3] = f3;
}


void run_test (const char* name, std::function<void (void)> func)
{
  std::printf ("%s ... \n", name);

  for (unsigned int i = 0; i < M; ++i)
  {
    test_values[0] = 1.0;
    test_values[1] = 1.0;
    test_values[2] = 1.0;
    test_values[3] = 1.0;
    auto start_time = std::chrono::high_resolution_clock::now ();

    if (func)
      func ();
    else
      __builtin_unreachable ();

    auto end_time = std::chrono::high_resolution_clock::now ();
    auto duration = std::chrono::duration_cast<std::chrono::nanoseconds> (end_time - start_time);

    std::printf ("f0 = %lf f1 = %lf f2 = %lf f3 = %lf\n",
		 (double)test_values[0], (double)test_values[1],
		 (double)test_values[2], (double)test_values[3]);

    std::printf ("%lf ns / insn\n", duration.count () / (double)N);
    std::printf ("%lf MOPS\n\n", (1000000000.0 / (duration.count () / (double)N)) / 1000000.0);
    vTaskDelay(100);
  }
}



extern "C"
void app_main (void)
{


  run_test ("int addition", &test_addition<int>);
  run_test ("int multiplication", &test_multiplication<int>);
  run_test ("int multiply-accumulate", &test_multiply_accumulate<int>);
  run_test ("float addition", &test_addition<float>);
  run_test ("float multiplication", &test_multiplication<float>);
  run_test ("float multiply-accumulate", &test_multiply_accumulate<float>);
  run_test ("double addition", &test_addition<double>);
  run_test ("double multiplication", &test_multiplication<double>);
  run_test ("double multiply-accumulate", &test_multiply_accumulate<double>);
}
/* results:   

Summary:
                    int       float     double
addition            240       120       6.47
multiplication      240       120       2.24
multiply-accumulate 160       120       5.92

int addition ... 
f0 = 1857072413.000000 f1 = 1857072413.000000 f2 = 556049240.000000 f3 = 556049240.000000
4.191250 ns / insn
238.592305 MOPS

f0 = 1857072413.000000 f1 = 1857072413.000000 f2 = 556049240.000000 f3 = 556049240.000000
4.173125 ns / insn
239.628576 MOPS

f0 = 1857072413.000000 f1 = 1857072413.000000 f2 = 556049240.000000 f3 = 556049240.000000
4.173125 ns / insn
239.628576 MOPS

int multiplication ... 
f0 = 1.000000 f1 = 1.000000 f2 = 1.000000 f3 = 1.000000
4.180625 ns / insn
239.198684 MOPS

f0 = 1.000000 f1 = 1.000000 f2 = 1.000000 f3 = 1.000000
4.173125 ns / insn
239.628576 MOPS

f0 = 1.000000 f1 = 1.000000 f2 = 1.000000 f3 = 1.000000
4.173125 ns / insn
239.628576 MOPS

int multiply-accumulate ... 
f0 = -30414264.000000 f1 = -485382322.000000 f2 = -1429759061.000000 f3 = -1349136873.000000
6.266563 ns / insn
159.577121 MOPS

f0 = -30414264.000000 f1 = -485382322.000000 f2 = -1429759061.000000 f3 = -1349136873.000000
6.257813 ns / insn
159.800250 MOPS

f0 = -30414264.000000 f1 = -485382322.000000 f2 = -1429759061.000000 f3 = -1349136873.000000
6.257813 ns / insn
159.800250 MOPS

float addition ... 
f0 = inf f1 = inf f2 = inf f3 = inf
8.351250 ns / insn
119.742554 MOPS

f0 = inf f1 = inf f2 = inf f3 = inf
8.342500 ns / insn
119.868145 MOPS

f0 = inf f1 = inf f2 = inf f3 = inf
8.341250 ns / insn
119.886108 MOPS

float multiplication ... 
f0 = 1.000000 f1 = 1.000000 f2 = 1.000000 f3 = 1.000000
8.350938 ns / insn
119.747034 MOPS

f0 = 1.000000 f1 = 1.000000 f2 = 1.000000 f3 = 1.000000
8.341250 ns / insn
119.886108 MOPS

f0 = 1.000000 f1 = 1.000000 f2 = 1.000000 f3 = 1.000000
8.341250 ns / insn
119.886108 MOPS

float multiply-accumulate ... 
f0 = inf f1 = inf f2 = inf f3 = inf
8.349687 ns / insn
119.764961 MOPS

f0 = inf f1 = inf f2 = inf f3 = inf
8.342500 ns / insn
119.868145 MOPS

f0 = inf f1 = inf f2 = inf f3 = inf
8.341250 ns / insn
119.886108 MOPS

double addition ... 
f0 = inf f1 = inf f2 = inf f3 = inf
154.618750 ns / insn
6.467521 MOPS

f0 = inf f1 = inf f2 = inf f3 = inf
154.602812 ns / insn
6.468188 MOPS

f0 = inf f1 = inf f2 = inf f3 = inf
154.606250 ns / insn
6.468044 MOPS

double multiplication ... 
f0 = 1.000000 f1 = 1.000000 f2 = 1.000000 f3 = 1.000000
446.328438 ns / insn
2.240503 MOPS

f0 = 1.000000 f1 = 1.000000 f2 = 1.000000 f3 = 1.000000
446.312188 ns / insn
2.240584 MOPS

f0 = 1.000000 f1 = 1.000000 f2 = 1.000000 f3 = 1.000000
446.312188 ns / insn
2.240584 MOPS

double multiply-accumulate ... 
f0 = inf f1 = inf f2 = inf f3 = inf
168.985938 ns / insn
5.917652 MOPS

f0 = inf f1 = inf f2 = inf f3 = inf
168.968438 ns / insn
5.918265 MOPS

f0 = inf f1 = inf f2 = inf f3 = inf
168.967813 ns / insn
5.918287 MOPS


*/
Here's the assembly code just to prove that the loops are doing the correct instructions.

Code: Select all


400d09c8 <_Z13test_additionIiEvv>:
400d09c8:	004136        	entry	a1, 32
400d09cb:	fd9641        	l32r	a4, 400d0024 <_stext+0x4>
400d09ce:	14b8      	l32i.n	a11, a4, 4
400d09d0:	04a8      	l32i.n	a10, a4, 0
400d09d2:	fd9581        	l32r	a8, 400d0028 <_stext+0x8>
400d09d5:	0008e0        	callx8	a8
400d09d8:	34b8      	l32i.n	a11, a4, 12
400d09da:	0a6d      	mov.n	a6, a10
400d09dc:	24a8      	l32i.n	a10, a4, 8
400d09de:	fd9281        	l32r	a8, 400d0028 <_stext+0x8>
400d09e1:	0008e0        	callx8	a8
400d09e4:	54b8      	l32i.n	a11, a4, 20
400d09e6:	0a5d      	mov.n	a5, a10
400d09e8:	44a8      	l32i.n	a10, a4, 16
400d09ea:	fd8f81        	l32r	a8, 400d0028 <_stext+0x8>
400d09ed:	0008e0        	callx8	a8
400d09f0:	0a3d      	mov.n	a3, a10
400d09f2:	74b8      	l32i.n	a11, a4, 28
400d09f4:	64a8      	l32i.n	a10, a4, 24
400d09f6:	fd8c81        	l32r	a8, 400d0028 <_stext+0x8>
400d09f9:	0008e0        	callx8	a8
400d09fc:	fd8981        	l32r	a8, 400d0020 <_stext>
400d09ff:	0a2d      	mov.n	a2, a10
400d0a01:	1f8876        	loop	a8, 400d0a24 <_Z13test_additionIiEvv+0x5c>
400d0a04:	662a      	add.n	a6, a6, a2
400d0a06:	553a      	add.n	a5, a5, a3
400d0a08:	353a      	add.n	a3, a5, a3
400d0a0a:	262a      	add.n	a2, a6, a2
400d0a0c:	662a      	add.n	a6, a6, a2
400d0a0e:	553a      	add.n	a5, a5, a3
400d0a10:	335a      	add.n	a3, a3, a5
400d0a12:	226a      	add.n	a2, a2, a6
400d0a14:	662a      	add.n	a6, a6, a2
400d0a16:	553a      	add.n	a5, a5, a3
400d0a18:	335a      	add.n	a3, a3, a5
400d0a1a:	226a      	add.n	a2, a2, a6
400d0a1c:	662a      	add.n	a6, a6, a2
400d0a1e:	553a      	add.n	a5, a5, a3
400d0a20:	335a      	add.n	a3, a3, a5
400d0a22:	226a      	add.n	a2, a2, a6
400d0a24:	06ad      	mov.n	a10, a6
400d0a26:	fd8181        	l32r	a8, 400d002c <_stext+0xc>
400d0a29:	0008e0        	callx8	a8
400d0a2c:	04a9      	s32i.n	a10, a4, 0
400d0a2e:	14b9      	s32i.n	a11, a4, 4
400d0a30:	05ad      	mov.n	a10, a5
400d0a32:	fd7e81        	l32r	a8, 400d002c <_stext+0xc>
400d0a35:	0008e0        	callx8	a8
400d0a38:	24a9      	s32i.n	a10, a4, 8
400d0a3a:	34b9      	s32i.n	a11, a4, 12
400d0a3c:	03ad      	mov.n	a10, a3
400d0a3e:	fd7b81        	l32r	a8, 400d002c <_stext+0xc>
400d0a41:	0008e0        	callx8	a8
400d0a44:	44a9      	s32i.n	a10, a4, 16
400d0a46:	54b9      	s32i.n	a11, a4, 20
400d0a48:	02ad      	mov.n	a10, a2
400d0a4a:	fd7881        	l32r	a8, 400d002c <_stext+0xc>
400d0a4d:	0008e0        	callx8	a8
400d0a50:	64a9      	s32i.n	a10, a4, 24
400d0a52:	74b9      	s32i.n	a11, a4, 28
400d0a54:	f01d      	retw.n
	...

400d0a58 <_Z19test_multiplicationIiEvv>:
400d0a58:	004136        	entry	a1, 32
400d0a5b:	fd7241        	l32r	a4, 400d0024 <_stext+0x4>
400d0a5e:	14b8      	l32i.n	a11, a4, 4
400d0a60:	04a8      	l32i.n	a10, a4, 0
400d0a62:	fd7181        	l32r	a8, 400d0028 <_stext+0x8>
400d0a65:	0008e0        	callx8	a8
400d0a68:	34b8      	l32i.n	a11, a4, 12
400d0a6a:	0a6d      	mov.n	a6, a10
400d0a6c:	24a8      	l32i.n	a10, a4, 8
400d0a6e:	fd6e81        	l32r	a8, 400d0028 <_stext+0x8>
400d0a71:	0008e0        	callx8	a8
400d0a74:	54b8      	l32i.n	a11, a4, 20
400d0a76:	0a5d      	mov.n	a5, a10
400d0a78:	44a8      	l32i.n	a10, a4, 16
400d0a7a:	fd6b81        	l32r	a8, 400d0028 <_stext+0x8>
400d0a7d:	0008e0        	callx8	a8
400d0a80:	0a3d      	mov.n	a3, a10
400d0a82:	74b8      	l32i.n	a11, a4, 28
400d0a84:	64a8      	l32i.n	a10, a4, 24
400d0a86:	fd6881        	l32r	a8, 400d0028 <_stext+0x8>
400d0a89:	0008e0        	callx8	a8
400d0a8c:	fd6581        	l32r	a8, 400d0020 <_stext>
400d0a8f:	0a2d      	mov.n	a2, a10
400d0a91:	2f8876        	loop	a8, 400d0ac4 <_Z19test_multiplicationIiEvv+0x6c>
400d0a94:	826620        	mull	a6, a6, a2
400d0a97:	825530        	mull	a5, a5, a3
400d0a9a:	822620        	mull	a2, a6, a2
400d0a9d:	823530        	mull	a3, a5, a3
400d0aa0:	826620        	mull	a6, a6, a2
400d0aa3:	825530        	mull	a5, a5, a3
400d0aa6:	822260        	mull	a2, a2, a6
400d0aa9:	823350        	mull	a3, a3, a5
400d0aac:	826620        	mull	a6, a6, a2
400d0aaf:	825530        	mull	a5, a5, a3
400d0ab2:	822260        	mull	a2, a2, a6
400d0ab5:	823350        	mull	a3, a3, a5
400d0ab8:	826620        	mull	a6, a6, a2
400d0abb:	825530        	mull	a5, a5, a3
400d0abe:	822260        	mull	a2, a2, a6
400d0ac1:	823350        	mull	a3, a3, a5
400d0ac4:	06ad      	mov.n	a10, a6
400d0ac6:	fd5981        	l32r	a8, 400d002c <_stext+0xc>
400d0ac9:	0008e0        	callx8	a8
400d0acc:	04a9      	s32i.n	a10, a4, 0
400d0ace:	14b9      	s32i.n	a11, a4, 4
400d0ad0:	05ad      	mov.n	a10, a5
400d0ad2:	fd5681        	l32r	a8, 400d002c <_stext+0xc>
400d0ad5:	0008e0        	callx8	a8
400d0ad8:	24a9      	s32i.n	a10, a4, 8
400d0ada:	34b9      	s32i.n	a11, a4, 12
400d0adc:	03ad      	mov.n	a10, a3
400d0ade:	fd5381        	l32r	a8, 400d002c <_stext+0xc>
400d0ae1:	0008e0        	callx8	a8
400d0ae4:	44a9      	s32i.n	a10, a4, 16
400d0ae6:	54b9      	s32i.n	a11, a4, 20
400d0ae8:	02ad      	mov.n	a10, a2
400d0aea:	fd5081        	l32r	a8, 400d002c <_stext+0xc>
400d0aed:	0008e0        	callx8	a8
400d0af0:	64a9      	s32i.n	a10, a4, 24
400d0af2:	74b9      	s32i.n	a11, a4, 28
400d0af4:	f01d      	retw.n
	...

400d0af8 <_Z24test_multiply_accumulateIiEvv>:
400d0af8:	004136        	entry	a1, 32
400d0afb:	fd4a21        	l32r	a2, 400d0024 <_stext+0x4>
400d0afe:	12b8      	l32i.n	a11, a2, 4
400d0b00:	02a8      	l32i.n	a10, a2, 0
400d0b02:	fd4981        	l32r	a8, 400d0028 <_stext+0x8>
400d0b05:	0008e0        	callx8	a8
400d0b08:	32b8      	l32i.n	a11, a2, 12
400d0b0a:	0a6d      	mov.n	a6, a10
400d0b0c:	22a8      	l32i.n	a10, a2, 8
400d0b0e:	fd4681        	l32r	a8, 400d0028 <_stext+0x8>
400d0b11:	0008e0        	callx8	a8
400d0b14:	52b8      	l32i.n	a11, a2, 20
400d0b16:	0a5d      	mov.n	a5, a10
400d0b18:	42a8      	l32i.n	a10, a2, 16
400d0b1a:	fd4381        	l32r	a8, 400d0028 <_stext+0x8>
400d0b1d:	0008e0        	callx8	a8
400d0b20:	0a4d      	mov.n	a4, a10
400d0b22:	72b8      	l32i.n	a11, a2, 28
400d0b24:	62a8      	l32i.n	a10, a2, 24
400d0b26:	fd4081        	l32r	a8, 400d0028 <_stext+0x8>
400d0b29:	0008e0        	callx8	a8
400d0b2c:	fd4181        	l32r	a8, 400d0030 <_stext+0x10>
400d0b2f:	0a3d      	mov.n	a3, a10
400d0b31:	4f8876        	loop	a8, 400d0b84 <_Z24test_multiply_accumulateIiEvv+0x8c>
400d0b34:	82c330        	mull	a12, a3, a3
400d0b37:	cc6a      	add.n	a12, a12, a6
400d0b39:	82bc40        	mull	a11, a12, a4
400d0b3c:	bb5a      	add.n	a11, a11, a5
400d0b3e:	82abb0        	mull	a10, a11, a11
400d0b41:	aa4a      	add.n	a10, a10, a4
400d0b43:	829ca0        	mull	a9, a12, a10
400d0b46:	993a      	add.n	a9, a9, a3
400d0b48:	826990        	mull	a6, a9, a9
400d0b4b:	c6ca      	add.n	a12, a6, a12
400d0b4d:	825ac0        	mull	a5, a10, a12
400d0b50:	b5ba      	add.n	a11, a5, a11
400d0b52:	824bb0        	mull	a4, a11, a11
400d0b55:	a4aa      	add.n	a10, a4, a10
400d0b57:	823ca0        	mull	a3, a12, a10
400d0b5a:	939a      	add.n	a9, a3, a9
400d0b5c:	826990        	mull	a6, a9, a9
400d0b5f:	66ca      	add.n	a6, a6, a12
400d0b61:	825a60        	mull	a5, a10, a6
400d0b64:	55ba      	add.n	a5, a5, a11
400d0b66:	824550        	mull	a4, a5, a5
400d0b69:	44aa      	add.n	a4, a4, a10
400d0b6b:	823640        	mull	a3, a6, a4
400d0b6e:	339a      	add.n	a3, a3, a9
400d0b70:	829330        	mull	a9, a3, a3
400d0b73:	696a      	add.n	a6, a9, a6
400d0b75:	829460        	mull	a9, a4, a6
400d0b78:	595a      	add.n	a5, a9, a5
400d0b7a:	829550        	mull	a9, a5, a5
400d0b7d:	494a      	add.n	a4, a9, a4
400d0b7f:	829640        	mull	a9, a6, a4
400d0b82:	393a      	add.n	a3, a9, a3
400d0b84:	06ad      	mov.n	a10, a6
400d0b86:	fd2981        	l32r	a8, 400d002c <_stext+0xc>
400d0b89:	0008e0        	callx8	a8
400d0b8c:	02a9      	s32i.n	a10, a2, 0
400d0b8e:	12b9      	s32i.n	a11, a2, 4
400d0b90:	05ad      	mov.n	a10, a5
400d0b92:	fd2681        	l32r	a8, 400d002c <_stext+0xc>
400d0b95:	0008e0        	callx8	a8
400d0b98:	22a9      	s32i.n	a10, a2, 8
400d0b9a:	32b9      	s32i.n	a11, a2, 12
400d0b9c:	04ad      	mov.n	a10, a4
400d0b9e:	fd2381        	l32r	a8, 400d002c <_stext+0xc>
400d0ba1:	0008e0        	callx8	a8
400d0ba4:	42a9      	s32i.n	a10, a2, 16
400d0ba6:	52b9      	s32i.n	a11, a2, 20
400d0ba8:	03ad      	mov.n	a10, a3
400d0baa:	fd2081        	l32r	a8, 400d002c <_stext+0xc>
400d0bad:	0008e0        	callx8	a8
400d0bb0:	62a9      	s32i.n	a10, a2, 24
400d0bb2:	72b9      	s32i.n	a11, a2, 28
400d0bb4:	f01d      	retw.n
	...

400d0bb8 <_Z13test_additionIfEvv>:
400d0bb8:	006136        	entry	a1, 48
400d0bbb:	fd1a21        	l32r	a2, 400d0024 <_stext+0x4>
400d0bbe:	12b8      	l32i.n	a11, a2, 4
400d0bc0:	02a8      	l32i.n	a10, a2, 0
400d0bc2:	fd1c81        	l32r	a8, 400d0034 <_stext+0x14>
400d0bc5:	0008e0        	callx8	a8
400d0bc8:	fa3a50        	wfr	f3, a10
400d0bcb:	32b8      	l32i.n	a11, a2, 12
400d0bcd:	22a8      	l32i.n	a10, a2, 8
400d0bcf:	004133        	ssi	f3, a1, 0
400d0bd2:	fd1881        	l32r	a8, 400d0034 <_stext+0x14>
400d0bd5:	0008e0        	callx8	a8
400d0bd8:	fa2a50        	wfr	f2, a10
400d0bdb:	52b8      	l32i.n	a11, a2, 20
400d0bdd:	42a8      	l32i.n	a10, a2, 16
400d0bdf:	014123        	ssi	f2, a1, 4
400d0be2:	fd1481        	l32r	a8, 400d0034 <_stext+0x14>
400d0be5:	0008e0        	callx8	a8
400d0be8:	fa1a50        	wfr	f1, a10
400d0beb:	72b8      	l32i.n	a11, a2, 28
400d0bed:	62a8      	l32i.n	a10, a2, 24
400d0bef:	024113        	ssi	f1, a1, 8
400d0bf2:	fd1081        	l32r	a8, 400d0034 <_stext+0x14>
400d0bf5:	0008e0        	callx8	a8
400d0bf8:	fd0a81        	l32r	a8, 400d0020 <_stext>
400d0bfb:	000133        	lsi	f3, a1, 0
400d0bfe:	010123        	lsi	f2, a1, 4
400d0c01:	020113        	lsi	f1, a1, 8
400d0c04:	fa0a50        	wfr	f0, a10
400d0c07:	f03d      	nop.n
400d0c09:	2f8876        	loop	a8, 400d0c3c <_Z13test_additionIfEvv+0x84>
400d0c0c:	0a3300        	add.s	f3, f3, f0
400d0c0f:	0a2210        	add.s	f2, f2, f1
400d0c12:	0a0300        	add.s	f0, f3, f0
400d0c15:	0a1210        	add.s	f1, f2, f1
400d0c18:	0a3300        	add.s	f3, f3, f0
400d0c1b:	0a2210        	add.s	f2, f2, f1
400d0c1e:	0a0030        	add.s	f0, f0, f3
400d0c21:	0a1120        	add.s	f1, f1, f2
400d0c24:	0a3300        	add.s	f3, f3, f0
400d0c27:	0a2210        	add.s	f2, f2, f1
400d0c2a:	0a0030        	add.s	f0, f0, f3
400d0c2d:	0a1120        	add.s	f1, f1, f2
400d0c30:	0a3300        	add.s	f3, f3, f0
400d0c33:	0a2210        	add.s	f2, f2, f1
400d0c36:	0a0030        	add.s	f0, f0, f3
400d0c39:	0a1120        	add.s	f1, f1, f2
400d0c3c:	faa340        	rfr	a10, f3
400d0c3f:	004103        	ssi	f0, a1, 0
400d0c42:	024113        	ssi	f1, a1, 8
400d0c45:	014123        	ssi	f2, a1, 4
400d0c48:	fcfc81        	l32r	a8, 400d0038 <_stext+0x18>
400d0c4b:	0008e0        	callx8	a8
400d0c4e:	010123        	lsi	f2, a1, 4
400d0c51:	0062a2        	s32i	a10, a2, 0
400d0c54:	0162b2        	s32i	a11, a2, 4
400d0c57:	faa240        	rfr	a10, f2
400d0c5a:	fcf781        	l32r	a8, 400d0038 <_stext+0x18>
400d0c5d:	0008e0        	callx8	a8
400d0c60:	020113        	lsi	f1, a1, 8
400d0c63:	22a9      	s32i.n	a10, a2, 8
400d0c65:	32b9      	s32i.n	a11, a2, 12
400d0c67:	faa140        	rfr	a10, f1
400d0c6a:	fcf381        	l32r	a8, 400d0038 <_stext+0x18>
400d0c6d:	0008e0        	callx8	a8
400d0c70:	000103        	lsi	f0, a1, 0
400d0c73:	42a9      	s32i.n	a10, a2, 16
400d0c75:	52b9      	s32i.n	a11, a2, 20
400d0c77:	faa040        	rfr	a10, f0
400d0c7a:	fcef81        	l32r	a8, 400d0038 <_stext+0x18>
400d0c7d:	0008e0        	callx8	a8
400d0c80:	62a9      	s32i.n	a10, a2, 24
400d0c82:	72b9      	s32i.n	a11, a2, 28
400d0c84:	f01d      	retw.n
	...

400d0c88 <_Z19test_multiplicationIfEvv>:
400d0c88:	006136        	entry	a1, 48
400d0c8b:	fce621        	l32r	a2, 400d0024 <_stext+0x4>
400d0c8e:	12b8      	l32i.n	a11, a2, 4
400d0c90:	02a8      	l32i.n	a10, a2, 0
400d0c92:	fce881        	l32r	a8, 400d0034 <_stext+0x14>
400d0c95:	0008e0        	callx8	a8
400d0c98:	fa3a50        	wfr	f3, a10
400d0c9b:	32b8      	l32i.n	a11, a2, 12
400d0c9d:	22a8      	l32i.n	a10, a2, 8
400d0c9f:	004133        	ssi	f3, a1, 0
400d0ca2:	fce481        	l32r	a8, 400d0034 <_stext+0x14>
400d0ca5:	0008e0        	callx8	a8
400d0ca8:	fa2a50        	wfr	f2, a10
400d0cab:	52b8      	l32i.n	a11, a2, 20
400d0cad:	42a8      	l32i.n	a10, a2, 16
400d0caf:	014123        	ssi	f2, a1, 4
400d0cb2:	fce081        	l32r	a8, 400d0034 <_stext+0x14>
400d0cb5:	0008e0        	callx8	a8
400d0cb8:	fa1a50        	wfr	f1, a10
400d0cbb:	72b8      	l32i.n	a11, a2, 28
400d0cbd:	62a8      	l32i.n	a10, a2, 24
400d0cbf:	024113        	ssi	f1, a1, 8
400d0cc2:	fcdc81        	l32r	a8, 400d0034 <_stext+0x14>
400d0cc5:	0008e0        	callx8	a8
400d0cc8:	fcd681        	l32r	a8, 400d0020 <_stext>
400d0ccb:	000133        	lsi	f3, a1, 0
400d0cce:	010123        	lsi	f2, a1, 4
400d0cd1:	020113        	lsi	f1, a1, 8
400d0cd4:	fa0a50        	wfr	f0, a10
400d0cd7:	f03d      	nop.n
400d0cd9:	2f8876        	loop	a8, 400d0d0c <_Z19test_multiplicationIfEvv+0x84>
400d0cdc:	2a3300        	mul.s	f3, f3, f0
400d0cdf:	2a2210        	mul.s	f2, f2, f1
400d0ce2:	2a0300        	mul.s	f0, f3, f0
400d0ce5:	2a1210        	mul.s	f1, f2, f1
400d0ce8:	2a3300        	mul.s	f3, f3, f0
400d0ceb:	2a2210        	mul.s	f2, f2, f1
400d0cee:	2a0030        	mul.s	f0, f0, f3
400d0cf1:	2a1120        	mul.s	f1, f1, f2
400d0cf4:	2a3300        	mul.s	f3, f3, f0
400d0cf7:	2a2210        	mul.s	f2, f2, f1
400d0cfa:	2a0030        	mul.s	f0, f0, f3
400d0cfd:	2a1120        	mul.s	f1, f1, f2
400d0d00:	2a3300        	mul.s	f3, f3, f0
400d0d03:	2a2210        	mul.s	f2, f2, f1
400d0d06:	2a0030        	mul.s	f0, f0, f3
400d0d09:	2a1120        	mul.s	f1, f1, f2
400d0d0c:	faa340        	rfr	a10, f3
400d0d0f:	004103        	ssi	f0, a1, 0
400d0d12:	024113        	ssi	f1, a1, 8
400d0d15:	014123        	ssi	f2, a1, 4
400d0d18:	fcc881        	l32r	a8, 400d0038 <_stext+0x18>
400d0d1b:	0008e0        	callx8	a8
400d0d1e:	010123        	lsi	f2, a1, 4
400d0d21:	0062a2        	s32i	a10, a2, 0
400d0d24:	0162b2        	s32i	a11, a2, 4
400d0d27:	faa240        	rfr	a10, f2
400d0d2a:	fcc381        	l32r	a8, 400d0038 <_stext+0x18>
400d0d2d:	0008e0        	callx8	a8
400d0d30:	020113        	lsi	f1, a1, 8
400d0d33:	22a9      	s32i.n	a10, a2, 8
400d0d35:	32b9      	s32i.n	a11, a2, 12
400d0d37:	faa140        	rfr	a10, f1
400d0d3a:	fcbf81        	l32r	a8, 400d0038 <_stext+0x18>
400d0d3d:	0008e0        	callx8	a8
400d0d40:	000103        	lsi	f0, a1, 0
400d0d43:	42a9      	s32i.n	a10, a2, 16
400d0d45:	52b9      	s32i.n	a11, a2, 20
400d0d47:	faa040        	rfr	a10, f0
400d0d4a:	fcbb81        	l32r	a8, 400d0038 <_stext+0x18>
400d0d4d:	0008e0        	callx8	a8
400d0d50:	62a9      	s32i.n	a10, a2, 24
400d0d52:	72b9      	s32i.n	a11, a2, 28
400d0d54:	f01d      	retw.n
	...

400d0d58 <_Z24test_multiply_accumulateIfEvv>:
400d0d58:	006136        	entry	a1, 48
400d0d5b:	fcb221        	l32r	a2, 400d0024 <_stext+0x4>
400d0d5e:	12b8      	l32i.n	a11, a2, 4
400d0d60:	02a8      	l32i.n	a10, a2, 0
400d0d62:	fcb481        	l32r	a8, 400d0034 <_stext+0x14>
400d0d65:	0008e0        	callx8	a8
400d0d68:	fa3a50        	wfr	f3, a10
400d0d6b:	32b8      	l32i.n	a11, a2, 12
400d0d6d:	22a8      	l32i.n	a10, a2, 8
400d0d6f:	004133        	ssi	f3, a1, 0
400d0d72:	fcb081        	l32r	a8, 400d0034 <_stext+0x14>
400d0d75:	0008e0        	callx8	a8
400d0d78:	fa2a50        	wfr	f2, a10
400d0d7b:	52b8      	l32i.n	a11, a2, 20
400d0d7d:	42a8      	l32i.n	a10, a2, 16
400d0d7f:	014123        	ssi	f2, a1, 4
400d0d82:	fcac81        	l32r	a8, 400d0034 <_stext+0x14>
400d0d85:	0008e0        	callx8	a8
400d0d88:	fa1a50        	wfr	f1, a10
400d0d8b:	72b8      	l32i.n	a11, a2, 28
400d0d8d:	62a8      	l32i.n	a10, a2, 24
400d0d8f:	024113        	ssi	f1, a1, 8
400d0d92:	fca881        	l32r	a8, 400d0034 <_stext+0x14>
400d0d95:	0008e0        	callx8	a8
400d0d98:	fca681        	l32r	a8, 400d0030 <_stext+0x10>
400d0d9b:	000133        	lsi	f3, a1, 0
400d0d9e:	010123        	lsi	f2, a1, 4
400d0da1:	020113        	lsi	f1, a1, 8
400d0da4:	fa0a50        	wfr	f0, a10
400d0da7:	f03d      	nop.n
400d0da9:	2f8876        	loop	a8, 400d0ddc <_Z24test_multiply_accumulateIfEvv+0x84>
400d0dac:	4a3000        	madd.s	f3, f0, f0
400d0daf:	4a2310        	madd.s	f2, f3, f1
400d0db2:	4a1220        	madd.s	f1, f2, f2
400d0db5:	4a0310        	madd.s	f0, f3, f1
400d0db8:	4a3000        	madd.s	f3, f0, f0
400d0dbb:	4a2130        	madd.s	f2, f1, f3
400d0dbe:	4a1220        	madd.s	f1, f2, f2
400d0dc1:	4a0310        	madd.s	f0, f3, f1
400d0dc4:	4a3000        	madd.s	f3, f0, f0
400d0dc7:	4a2130        	madd.s	f2, f1, f3
400d0dca:	4a1220        	madd.s	f1, f2, f2
400d0dcd:	4a0310        	madd.s	f0, f3, f1
400d0dd0:	4a3000        	madd.s	f3, f0, f0
400d0dd3:	4a2130        	madd.s	f2, f1, f3
400d0dd6:	4a1220        	madd.s	f1, f2, f2
400d0dd9:	4a0310        	madd.s	f0, f3, f1
400d0ddc:	faa340        	rfr	a10, f3
400d0ddf:	004103        	ssi	f0, a1, 0
400d0de2:	024113        	ssi	f1, a1, 8
400d0de5:	014123        	ssi	f2, a1, 4
400d0de8:	fc9481        	l32r	a8, 400d0038 <_stext+0x18>
400d0deb:	0008e0        	callx8	a8
400d0dee:	010123        	lsi	f2, a1, 4
400d0df1:	0062a2        	s32i	a10, a2, 0
400d0df4:	0162b2        	s32i	a11, a2, 4
400d0df7:	faa240        	rfr	a10, f2
400d0dfa:	fc8f81        	l32r	a8, 400d0038 <_stext+0x18>
400d0dfd:	0008e0        	callx8	a8
400d0e00:	020113        	lsi	f1, a1, 8
400d0e03:	22a9      	s32i.n	a10, a2, 8
400d0e05:	32b9      	s32i.n	a11, a2, 12
400d0e07:	faa140        	rfr	a10, f1
400d0e0a:	fc8b81        	l32r	a8, 400d0038 <_stext+0x18>
400d0e0d:	0008e0        	callx8	a8
400d0e10:	000103        	lsi	f0, a1, 0
400d0e13:	42a9      	s32i.n	a10, a2, 16
400d0e15:	52b9      	s32i.n	a11, a2, 20
400d0e17:	faa040        	rfr	a10, f0
400d0e1a:	fc8781        	l32r	a8, 400d0038 <_stext+0x18>
400d0e1d:	0008e0        	callx8	a8
400d0e20:	62a9      	s32i.n	a10, a2, 24
400d0e22:	72b9      	s32i.n	a11, a2, 28
400d0e24:	f01d      	retw.n
	...


MartinJ
Posts: 2
Joined: Thu Mar 25, 2021 8:30 pm

Re: Unexpectedly low floating-point performance in C

Postby MartinJ » Thu Aug 19, 2021 12:54 am

An update to this. I realised why my previous test code wasn't always running as fast as it could. The results of a fp operation aren't available for a few instructions so the test sequences weren't well optimised.
Here's some new code (C instead of C++).

This gives the following results:

Code: Select all

Integer Addition         239.772217 MOP/S       CPI=1.000482
Integer Multiply         239.826126 MOP/S       CPI=1.000561
Integer Division         119.944527 MOP/S       CPI=2.000431
Integer Multiply-Add     239.826126 MOP/S       CPI=1.000509
Float Addition           239.808151 MOP/S       CPI=1.000586
Float Multiply           239.808151 MOP/S       CPI=1.000573
Float Division           5.043214 MOP/S         CPI=1.394347
Float Multiply-Add       479.544434 MOP/S       CPI=1.000682
Double Addition          6.629294 MOP/S         CPI=1.760988
Double Multiply          2.260351 MOP/S         CPI=1.620783
Double Division          0.487446 MOP/S         CPI=1.590582
Double Multiply-Add      5.771100 MOP/S         CPI=1.858178

Code: Select all

#include <stdio.h>
#include <esp_attr.h>
#include <esp_timer.h>
#include <freertos/FreeRTOS.h>
#include <freertos/task.h>

static double tv[8];
const int N = 3200000;

#define TEST(type,name,ops) void IRAM_ATTR name (void) {\
    type f0 = tv[0],f1 = tv[1],f2 = tv[2],f3 = tv[3];\
    type f4 = tv[4],f5 = tv[5],f6 = tv[6],f7 = tv[7];\
    for (int j = 0; j < N/16; j++) {\
        ops \
    }\
    tv[0] = f0;tv[1] = f1;tv[2] = f2;tv[3] = f3;\
    tv[4] = f4;tv[5] = f5;tv[6] = f6;tv[7] = f7;\
    }
    
#define fops(op1,op2) f0 op1##=f1 op2 f2;f1 op1##=f2 op2 f3;\
    f2 op1##=f3 op2 f4;f3 op1##=f4 op2 f5;\
    f4 op1##=f5 op2 f6;f5 op1##=f6 op2 f7;\
    f6 op1##=f7 op2 f0;f7 op1##=f0 op2 f1;

#define addops fops(,+) fops(,+)
#define divops fops(,/) fops(,/)
#define mulops fops(,*) fops(,*)
#define muladdops fops(+,*)

TEST(int,mulint,mulops)
TEST(float,mulfloat,mulops)
TEST(double,muldouble,mulops)
TEST(int,addint,addops)
TEST(float,addfloat,addops)
TEST(double,adddouble,addops)
TEST(int,divint,divops)
TEST(float,divfloat,divops)
TEST(double,divdouble,divops)
TEST(int,muladdint,muladdops)
TEST(float,muladdfloat,muladdops)
TEST(double,muladddouble,muladdops)

void timeit(char *name,void fn(void)) {
    vTaskDelay(1);
    tv[0]=tv[1]=tv[2]=tv[3]=tv[4]=tv[5]=tv[6]=tv[7]=1;
    // get time since boot in microseconds
    uint64_t time=esp_timer_get_time();
    unsigned ccount,icount,ccount_new;
    RSR(CCOUNT,ccount);
    WSR(ICOUNT, 0);
    WSR(ICOUNTLEVEL,2);
    fn();
    RSR(CCOUNT,ccount_new);
    RSR(ICOUNT,icount);
    time=esp_timer_get_time()-time;
    float cpi=(float)(ccount_new-ccount)/icount;
    printf ("%s \t %f MOP/S   \tCPI=%f\n",name, (float)N/time,cpi);
}

void app_main() {
    timeit("Integer Addition",addint);
    timeit("Integer Multiply",mulint);
    timeit("Integer Division",divint);
    timeit("Integer Multiply-Add",muladdint);

    timeit("Float Addition ", addfloat);
    timeit("Float Multiply ", mulfloat);
    timeit("Float Division ", divfloat);
    timeit("Float Multiply-Add", muladdfloat);

    timeit("Double Addition", adddouble);
    timeit("Double Multiply", muldouble);
    timeit("Double Division", divdouble);
    timeit("Double Multiply-Add", muladddouble);
}

User avatar
thefury
Posts: 20
Joined: Thu Sep 05, 2019 5:25 pm

Re: Unexpectedly low floating-point performance in C

Postby thefury » Sat Mar 05, 2022 11:58 pm

MartinJ wrote:
Thu Aug 19, 2021 12:54 am
An update to this. I realised why my previous test code wasn't always running as fast as it could. The results of a fp operation aren't available for a few instructions so the test sequences weren't well optimised.
Here's some new code (C instead of C++).

This gives the following results:

Code: Select all

Integer Addition         239.772217 MOP/S       CPI=1.000482
Integer Multiply         239.826126 MOP/S       CPI=1.000561
Integer Division         119.944527 MOP/S       CPI=2.000431
Integer Multiply-Add     239.826126 MOP/S       CPI=1.000509
Float Addition           239.808151 MOP/S       CPI=1.000586
Float Multiply           239.808151 MOP/S       CPI=1.000573
Float Division           5.043214 MOP/S         CPI=1.394347
Float Multiply-Add       479.544434 MOP/S       CPI=1.000682
Double Addition          6.629294 MOP/S         CPI=1.760988
Double Multiply          2.260351 MOP/S         CPI=1.620783
Double Division          0.487446 MOP/S         CPI=1.590582
Double Multiply-Add      5.771100 MOP/S         CPI=1.858178
ESP32-S3 results:

Code: Select all

Integer Addition 	 239.826126 MOP/S   	CPI=1.000361
Integer Multiply 	 239.862076 MOP/S   	CPI=1.000316
Integer Division 	 119.944527 MOP/S   	CPI=2.000295
Integer Multiply-Add 	 159.904053 MOP/S   	CPI=1.500432
Float Addition  	 239.862076 MOP/S   	CPI=1.000338
Float Multiply  	 239.862076 MOP/S   	CPI=1.000328
Float Division  	 4.504891 MOP/S   	CPI=1.434796
Float Multiply-Add 	 479.760132 MOP/S   	CPI=1.000149
Double Addition 	 6.232046 MOP/S   	CPI=1.485305
Double Multiply 	 2.388438 MOP/S   	CPI=1.480190
Double Division 	 0.548398 MOP/S   	CPI=1.403033
Double Multiply-Add 	 5.652522 MOP/S   	CPI=1.522879

possan
Posts: 3
Joined: Tue Jul 04, 2017 1:16 pm

Re: Unexpectedly low floating-point performance in C

Postby possan » Wed Mar 30, 2022 8:33 am

And some ESP32-C3 results (without working instruction counters)

Code: Select all

Integer Addition 	 127.907906 MOP/S   	CPI=42.000000
Integer Multiply 	 127.907906 MOP/S   	CPI=42.000000
Integer Division 	 4.660063 MOP/S   	CPI=42.000000
Integer Multiply-Add 	 121.821228 MOP/S   	CPI=42.000000
Float Addition  	 1.944126 MOP/S   	CPI=42.000000
Float Multiply  	 1.246818 MOP/S   	CPI=42.000000
Float Division  	 0.798524 MOP/S   	CPI=42.000000
Float Multiply-Add 	 1.771781 MOP/S   	CPI=42.000000
Double Addition 	 1.525539 MOP/S   	CPI=42.000000
Double Multiply 	 0.707934 MOP/S   	CPI=42.000000
Double Division 	 0.415843 MOP/S   	CPI=42.000000
Double Multiply-Add 	 1.427685 MOP/S   	CPI=42.000000

User avatar
thefury
Posts: 20
Joined: Thu Sep 05, 2019 5:25 pm

Re: Unexpectedly low floating-point performance in C

Postby thefury » Wed Aug 14, 2024 6:32 pm

ESP32-P4 results at 360 MHz

Code: Select all

Integer Addition      319.520721 MOP/S       CPI=1.001140
Integer Multiply      319.680328 MOP/S       CPI=1.000406
Integer Division      319.680328 MOP/S       CPI=1.000401
Integer Multiply-Add      319.744202 MOP/S       CPI=1.000381
Float Addition       318.249634 MOP/S       CPI=1.004953
Float Multiply       319.648376 MOP/S       CPI=1.000538
Float Division       89.943222 MOP/S       CPI=3.554117
Float Multiply-Add      575.436096 MOP/S       CPI=1.000296
Double Addition      13.113088 MOP/S       CPI=1.196955
Double Multiply      4.381245 MOP/S       CPI=1.069479
Double Division      1.527300 MOP/S       CPI=1.043547
Double Multiply-Add      10.800448 MOP/S       CPI=1.171753
With RISC-V cycle/instruction counters. Code: https://www.irccloud.com/pastebin/2QIVt ... nchmarks.c

Interesting to note it does not implement zero-overhead hardware loops as the xtensa-esp32 (& s3) compiler does, so there is some optimization that could be had in the compiler for the P4. This benchmark loses about 1/9th of its performance to checking if the for loop is done yet.

DedeHai
Posts: 1
Joined: Thu Oct 17, 2024 4:47 am

Re: Unexpectedly low floating-point performance in C

Postby DedeHai » Thu Oct 17, 2024 4:55 am

Very interesting results, thank you!

What is the reason the performance for "Integer Division" on the ESP32 C3 is so bad? According to the datasheet the C3 CPU has "32-bit multiplier and 32-bit divider". Is this not (yet) supported by the compiler or is this intentional due to a known hardware bug?

tizio1234
Posts: 2
Joined: Mon Oct 02, 2023 6:50 pm

Re: Unexpectedly low floating-point performance in C

Postby tizio1234 » Sat Oct 19, 2024 7:57 pm

Very useful data, has anyone done this kind of test on an esp32-c6? I'm choosing the esp32 for my RC car, and I'd like to have assisted steering with a gyro in the straights, is an esp32-c3 fine considered its floating point performance? Should i opt for an esp32-s3? What about the esp32-c6?

ESP_Sprite
Posts: 9757
Joined: Thu Nov 26, 2015 4:08 am

Re: Unexpectedly low floating-point performance in C

Postby ESP_Sprite » Mon Oct 21, 2024 1:17 am

tizio1234 wrote:
Sat Oct 19, 2024 7:57 pm
Very useful data, has anyone done this kind of test on an esp32-c6? I'm choosing the esp32 for my RC car, and I'd like to have assisted steering with a gyro in the straights, is an esp32-c3 fine considered its floating point performance? Should i opt for an esp32-s3? What about the esp32-c6?
From what I recall the CPU core of the C3 and the C6 are the same, so they should give similar performance numbers.

Who is online

Users browsing this forum: No registered users and 30 guests