ESP32 Forum

Posted: **Sun Apr 28, 2024 6:46 pm**

Since Xtensa cores do not have usual FLAGS (or somwhere called status) register, which usually contains C - carry bit (for add instructions)/ B -(borrow bit for sub instructions) O - (overflow bit) S - (sign bit) etc which are affected after any integer arithmetic operation, writting of routines of 64bit arithmetic on 32bit cores which always take the same execution path (not involving any branching instruction) is quite challanging. I found that C code which involve two 64bit arrguments will always be compiled using some of those branch instructions at place where carry (in the case of addition/subtraction) has to be set to 1 or cleared to 0. It means that such routine executes differently depending on values of two arguments which are involved in addition. For example:

https://godbolt.org/z/6WfPTdh96

Is there a way to force compiler to generate "always the same execution path" code without branch instructions for 64bit arithmetic. In that example, a14 register is always preloaded with value 1. Then comparation in between a10 and a9 is performed, telling should value of a14 has to stay at 1 by skipping next instruction or it should be 0 by executing next instruction. And that makes divergent paths of execution depending on input values, because for some pair of values that SKIP is going to be performed while for other pairs skip is not performed but one extra instruction is executed. That means, execution time of routine, depends on input values which is not acceptable in critical synchronization routines which involve 64bit math.

I know I can write "by hand" such a code by replacing that part with something like:

Code: Select all

sub a14, a10, a9 // if a10 is less than a9 highest bit in a14 is going to be set, otherwise it will be reset after subtraction
srl  a14, a14, 31 // logical shift it to first bit position (funneling zeroes into higher bits) 
			 // that is the full replacement leaving 0 or 1 in a14

Off course, for performance reasons, those two instructions should not appear one after another to avoid stale or lock states of srl instruction which has to wait the result of previous operation. Previous example allows us to nest them like this

Code: Select all

main:
        entry   sp, 48
        l32r    a8, .LC0
        memw
        l32i.n  a12, sp, 0
        memw
        l32i.n  a10, a8, 0     // modified, exchanged order of loading registers
        memw
        l32i.n  a13, sp, 4     // modified, exchanged order of loading registers
        sub     a9, a10, a12  // so we can execute this here
        memw
        l32i.n  a11, a8, 4
        sub     a14, a10, a9       // and here is our replacement (a9 is modified 3 instructions earlier so no stalls) 
//        movi.n  a14, 1
//        bltu    a10, a9, .L2
//        movi.n  a14, 0
//.L2:
        sub     a12, a11, a13  
        srl      a14, a14, 31      // and here is the shift          
        memw
        s32i.n  a9, a8, 0
        sub     a11, a12, a14   // here is the actual usage of a14
        memw
        s32i.n  a11, a8, 4
        memw
        l32i.n  a12, a8, 0
        l32r    a10, .LC2
        memw
        l32i.n  a13, a8, 4
        call8   printf
        retw.n

(I wrote this as an example while writing this question just to illustrate the question. I didn't test it, however there is no reason of why it shouldn't work)

But, is there a way to force the compiler to generate such a non-branching code of int64 arithmetic keeping routine execution time constant and independent on input values?

Posted: **Sun Apr 28, 2024 8:27 pm**

Probably not.
Even if gcc knows an alternative, branchless variant of an algorithm, there's not much you can do to force it to use one algorithm or the other.

However, it seems that only recently the SALT[ U ] instruction ("Set AR if Less Than [Unsigned]") was added to the ISA (core architecture) documentation. Might be worth checking out to replace

Code: Select all

movi.n  a8, 1
bltu    a2, a4, .L2
movi.n  a8, 0
.L2:

with

Code: Select all

saltu a8, a2, a4

as long as gcc isn't yet aware of that instruction.

I recommend to actually measure execution times though instead of guessing which instruction sequence would be faster. And maybe look into finding an algorithm which doesn't depend on +/- 1 clock cycle of runtime.

Edit:
You can also try 'manually' coding in 32 bits. gcc is smart enough to not generate any extra instructions for things like "(uint32_t)(u64 >> 32)" or "((uint64_t)u32)<<32".

Posted: **Sun Apr 28, 2024 11:31 pm**

Thanks for fast response.
Yes, sgtu on RISC-V (Store Great Than) [unsigned] is conditional MOV and if you add other 32bit targets, gcc compiler of the same C code generates code without branch instruction for all of them. The same is on RISC-V or MIPS, it uses that conditional move. On x86-32 there is borrow flag and it uses sub/sbb instructions. So BRANCH exists only on Xtensa even both LX6 and LX7 cores are also cappable of conditional mov instructions.

https://godbolt.org/z/s1sYPdE38

In Xtensa documentation about conditional MOV

The conditional move instructions shown in Table 3–19 are used for branch avoidance.

MOVEQZ RRR Conditional move if zero
MOVNEZ RRR Conditional move if non-zero
MOVLTZ RRR Conditional move if less than zero
MOVGEZ RRR Conditional move if greater than or equal to zero

but there is no explicit timing mentioned.

So why gcc doesn't use those conditional move instructions (like on other 32bit platforms) but place branch (which produce divergent path, dependent on input values)? Is there any switch in compiling to turn on usage of those conditional move instructions? Or some settings in xtensa-config.h?

Posted: **Mon Apr 29, 2024 8:39 am**

djixon wrote: ↑
Sun Apr 28, 2024 11:31 pm
So why gcc doesn't use those conditional move instructions (like on other 32bit platforms) but place branch (which produce divergent path, dependent on input values)? Is there any switch in compiling to turn on usage of those conditional move instructions? Or some settings in xtensa-config.h?

In the case of "u64 - u64" there's probably no "fast" way to use the conditional move instructions because they only compare against 0.
gcc does know/use them in other cases though: https://godbolt.org/z/jhcvhK5av

P.S.: gcc's instruction cost model for both Xtensa and RISC-V in places seems to be somewhat off for the ESP32s, which makes gcc sometimes 'optimize' in the wrong direction. (https://godbolt.org/z/xKWq91jcP)

Posted: **Sun Jun 02, 2024 12:38 pm**

In the case of "u64 - u64" there's probably no "fast" way to use the conditional move instructions because they only compare against 0.

Well, in the case of 64 bit addition/subtraction operations, the value of the carry/borrow bit can be only 0 or 1 (depends only on lower 32 bits of both operands), so that should be an ideal case to use conditional move instruction instead those compare-and-branch which produce divergent execution paths for different operands values.

Posted: **Sun Jun 02, 2024 4:37 pm**

djixon wrote: ↑
Sun Jun 02, 2024 12:38 pm

In the case of "u64 - u64" there's probably no "fast" way to use the conditional move instructions because they only compare against 0.
Well, in the case of 64 bit addition/subtraction operations, the value of the carry/borrow bit can be only 0 or 1 (depends only on lower 32 bits of both operands), so that should be an ideal case to use conditional move instruction

Uhm... no.
How would one of Xtensa's conditional moves help in determining if the carry is 1 or 0?

ESP32 Forum

(edited) How to force compiler of C code with 64bit arithmetics to generate routines with non-divergent execution path?

(edited) How to force compiler of C code with 64bit arithmetics to generate routines with non-divergent execution path?

Re: (edited) How to force compiler of C code with 64bit arithmetics to generate routines with non-divergent execution pa

Re: (edited) How to force compiler of C code with 64bit arithmetics to generate routines with non-divergent execution pa

Re: (edited) How to force compiler of C code with 64bit arithmetics to generate routines with non-divergent execution pa

Re: (edited) How to force compiler of C code with 64bit arithmetics to generate routines with non-divergent execution pa

Re: (edited) How to force compiler of C code with 64bit arithmetics to generate routines with non-divergent execution pa