Integer division performance ????

ESP_Angus
Posts: 2344
Joined: Sun May 08, 2016 4:11 am

Re: Integer division performance ????

Postby ESP_Angus » Tue Dec 10, 2019 1:51 am

Baldhead wrote:
Tue Dec 10, 2019 12:53 am
"Built with default config (ie default compiler optimisation level which is -Os, if you want max performance then set this to Performance)."
Where i can change to "The “Performance” setting will add the -O2 flag to CFLAGS".

In menuconfig -> compiler options -> Optimization Level -> Debug (-Og) or Release (-Os) options only.
Right, sorry. This optimization level was only recently added on master branch of IDF, it's not in any stable release version yet.

Configuring for Release/-Os is almost as good and may be even better in some situations where smaller code size helps with flash cache utilization. The default on master is actually -Og not -Os, I wrote the wrong thing in my post - will fix.
Baldhead wrote:
Tue Dec 10, 2019 12:53 am
Also in-O2 mode i don't know if i can use printf to debug.
No... As long as your C code doesn't invoke C language Undefined Behaviour, everything should function the same regardless of optimisation level.

Baldhead
Posts: 440
Joined: Sun Mar 31, 2019 5:16 am

Re: Integer division performance ????

Postby Baldhead » Tue Dec 10, 2019 2:18 am

Hi ESP_Angus,

Below is the function i want to optimize processing time.
It can be called 600 times per second.


When len = 15360 bytes the function takes 4,1 us.

I'm finding a lot of time for 240 Mhz.

Code: Select all

static DMA_ATTR lldesc_t dma_desc_buf_a[ 16 ];

static inline int fill_dma_descriptor_a ( uint32_t len )  //  uint32_t len in bytes. When len = 15360 bytes the function takes 4,1 us.    
{
    static uint32_t length = 0;
    
    // size_in_bytes = 15360 bytes.

    if ( len > size_in_bytes )  return -1;     //  if remove this 2 if's saves 62 ns of processing.     
    if ( len == 0 )                   return -2;    //  May be remove return of function to save some processing.


    length = 4 * len;    //  ( 4 * len ) = convert byte(8 bits) to word(32bits).  


    if ( length <= 4092 )    // Only need one single descriptor.   
    {          
        dma_desc_buf_a[0].length = length;	
	dma_desc_buf_a[0].eof = 1;	    // indicate that are the last node of linked list.	    		
	  
        return 1;   
    }

    // if ( length > 4092 )  // Need more that a single descriptor. 

    static uint32_t fullBufferNum;                    // Descriptors number with buffer full(4092 bytes).     
    fullBufferNum = (uint32_t) length / 4092;   // Integer division.

    static uint32_t remainderBufferNum;          
    remainderBufferNum = ( length % 4092 );  // Returns the remainder of the division length/4092 ( integer number ).
                                                 // Example: length/4092 = 14002 / 4092 = 3 and remainder = 1726.  3 * 4092 + 1726 = 14002.     


    static uint32_t i = 0;    // for iteration. 

    for ( i = 0 ; i < fullBufferNum ; i++ )     
    {	
        dma_desc_buf_a[i].length = 4092; 	
	dma_desc_buf_a[i].eof = 0;	      // indicate that are not the last node of linked list. 		    
    }

    if ( remainderBufferNum == 0 )    // Remainder of division are 0. Don't need to allocate more one descriptor.      
    {
        dma_desc_buf_a[ fullBufferNum - 1 ].eof = 1;    // indicate that are the last node of linked list.
        
        return 2;  
    }
    else    // Need to allocate (statically) 1 more descriptor.
    {        
        dma_desc_buf_a[fullBufferNum].length = remainderBufferNum;	
	dma_desc_buf_a[fullBufferNum].eof = 1;	    // indicate that are the last node of linked list. 	 	    

        return 3;       
    }   
}

Thank's.

ESP_Sprite
Posts: 9599
Joined: Thu Nov 26, 2015 4:08 am

Re: Integer division performance ????

Postby ESP_Sprite » Tue Dec 10, 2019 3:28 am

Erm... 600 times per second, 4.1uS per iteration, is 2.4 mS per second that is used by this routine, or 0.2% of your CPU time of one core. Are you *sure* you're optimizing the right thing here?

ESP_Angus
Posts: 2344
Joined: Sun May 08, 2016 4:11 am

Re: Integer division performance ????

Postby ESP_Angus » Tue Dec 10, 2019 4:02 am

Here's some stuff to get you started:
  • I also get ~919 cycles with Debug (-Og) & 240MHz, this is ~4.2us
  • Reduces to 516 cycles when built with Release (-Os)
  • Reduces to 495 cycles by removing unnecessary "static" declarations from all local variables
  • Reduces to 255 cycles by rewriting all of the lldesc_t operations as single writes (instead of two read-modify-write cycles), like this:

    Code: Select all

            dma_desc_buf_a[i] = (lldesc_t) {
                .length = 4092,
                .eof = 0,
            };
    
    (Note this will clear any other values in the struct, but that doesn't matter here.)
Also note that calling a "static inline" function with a constant argument allows the compiler to perform additional optimisations that it can't do in the generic case, so some code paths will be optimised out entirely. Performance for all of the above will be worse if the value of "len" is not known at compile time.

I tried rewriting the function more concisely and without the divide or modulo operations, but benchmarking showed the same performance with "len" known at compile time and that your implementation was faster if "len" was not known at compile time. The compiler is good at its job.

BTW, if this function needs to be called at 600Hz and the unoptimised version took 4.1us then it's unclear why it necessarily needs optimising. 600Hz * 4.1us = 2.4ms in each second. This is 0.24% of your available CPU time. Probably something else will use more CPU time.

BTW 2, you may find you need to mark the dma_desc_buf_a array as "volatile" if it's going to be concurrently accessed by hardware. You'll find things get noticeably slower when you do that.

ESP_Angus
Posts: 2344
Joined: Sun May 08, 2016 4:11 am

Re: Integer division performance ????

Postby ESP_Angus » Tue Dec 10, 2019 4:03 am

Sprite got to the point much quicker than me. :)

Baldhead
Posts: 440
Joined: Sun Mar 31, 2019 5:16 am

Re: Integer division performance ????

Postby Baldhead » Tue Dec 10, 2019 5:50 am

Hi,

Thank's by the answers.

Maybe 2.4 ms are ok, but it is to atualize a lcd display, so the faster the better.
320x480x2 bytes in 30 hz.

With respect to static variables i thought that it saves time because don't have to create that variable through stack all time that function are called.

Baldhead
Posts: 440
Joined: Sun Mar 31, 2019 5:16 am

Re: Integer division performance ????

Postby Baldhead » Tue Dec 10, 2019 6:58 pm

Hi ESP_Angus,

You wrote:
"(Note this will clear any other values in the struct, but that doesn't matter here.)".

it does matter yes, because the "lldesc_t dma_desc_buf_a[ i ].xxxx" are only initialized at esp32 startup and after startup i only change .length and .eof for the sake of performance.

Thank's.

Baldhead
Posts: 440
Joined: Sun Mar 31, 2019 5:16 am

Re: Integer division performance ????

Postby Baldhead » Tue Dec 10, 2019 7:21 pm

Hi,

I changed the static variable to no static variable.

ie:
static uint32_t length = 0;
uint32_t length = 0;

And the time saved was:

From 4.1us to 3us ( -Og compiler optimization ).
From 4.1us to 1,9 us ( -Os compiler optimization ).

len are not constant.
When len = 15360 bytes.

I always thought static variable was faster than stack variable, because it didn't need to use the stack.


Thank's.

ESP_Angus
Posts: 2344
Joined: Sun May 08, 2016 4:11 am

Re: Integer division performance ????

Postby ESP_Angus » Tue Dec 10, 2019 10:56 pm

Baldhead wrote:
Tue Dec 10, 2019 6:58 pm
it does matter yes, because the "lldesc_t dma_desc_buf_a[ i ].xxxx" are only initialized at esp32 startup and after startup i only change .length and .eof for the sake of performance.
Are the other values constant? If so it's probably faster to write them each time in the same assignment, rather than read-modify-write each value twice each time.
Baldhead wrote:
Tue Dec 10, 2019 7:21 pm
I always thought static variable was faster than stack variable, because it didn't need to use the stack.
No, local variables can be stored in CPU registers.

ESP_Angus
Posts: 2344
Joined: Sun May 08, 2016 4:11 am

Re: Integer division performance ????

Postby ESP_Angus » Tue Dec 10, 2019 11:11 pm

Probably the very fastest way to write this routine is to set every entry to .length = 4092, .eof = 0, etc at initialization time, then set a variable "static int lastEndIndex = -1;".

Each time the function is called:

- Calculate the index of the last descriptor entry (int index) based on the length of the data (ie length / 4092).
- If (lastEndIndex != index) then write the lastEndIndex entry back to length = 4092, eof = 0 .
- On the entry "index", set .eof = 1 and set length to the final descriptor length for just this entry (ie length % 4092)
- Update "lastEndIndex = index;" ready for next time

This way you only need to write at most two descriptors each time you update the length.

However, like Sprite says, it's pretty unclear that this function is a bottleneck. "Premature optimisation is the root of all evil", etc.

Who is online

Users browsing this forum: No registered users and 117 guests