(edited) How to force compiler of C code with 64bit arithmetics to generate routines with non-divergent execution path?
Posted: Sun Apr 28, 2024 6:46 pm
Since Xtensa cores do not have usual FLAGS (or somwhere called status) register, which usually contains C - carry bit (for add instructions)/ B -(borrow bit for sub instructions) O - (overflow bit) S - (sign bit) etc which are affected after any integer arithmetic operation, writting of routines of 64bit arithmetic on 32bit cores which always take the same execution path (not involving any branching instruction) is quite challanging. I found that C code which involve two 64bit arrguments will always be compiled using some of those branch instructions at place where carry (in the case of addition/subtraction) has to be set to 1 or cleared to 0. It means that such routine executes differently depending on values of two arguments which are involved in addition. For example:
https://godbolt.org/z/6WfPTdh96
Is there a way to force compiler to generate "always the same execution path" code without branch instructions for 64bit arithmetic. In that example, a14 register is always preloaded with value 1. Then comparation in between a10 and a9 is performed, telling should value of a14 has to stay at 1 by skipping next instruction or it should be 0 by executing next instruction. And that makes divergent paths of execution depending on input values, because for some pair of values that SKIP is going to be performed while for other pairs skip is not performed but one extra instruction is executed. That means, execution time of routine, depends on input values which is not acceptable in critical synchronization routines which involve 64bit math.
I know I can write "by hand" such a code by replacing that part with something like:
Off course, for performance reasons, those two instructions should not appear one after another to avoid stale or lock states of srl instruction which has to wait the result of previous operation. Previous example allows us to nest them like this
(I wrote this as an example while writing this question just to illustrate the question. I didn't test it, however there is no reason of why it shouldn't work)
But, is there a way to force the compiler to generate such a non-branching code of int64 arithmetic keeping routine execution time constant and independent on input values?
https://godbolt.org/z/6WfPTdh96
Is there a way to force compiler to generate "always the same execution path" code without branch instructions for 64bit arithmetic. In that example, a14 register is always preloaded with value 1. Then comparation in between a10 and a9 is performed, telling should value of a14 has to stay at 1 by skipping next instruction or it should be 0 by executing next instruction. And that makes divergent paths of execution depending on input values, because for some pair of values that SKIP is going to be performed while for other pairs skip is not performed but one extra instruction is executed. That means, execution time of routine, depends on input values which is not acceptable in critical synchronization routines which involve 64bit math.
I know I can write "by hand" such a code by replacing that part with something like:
Code: Select all
sub a14, a10, a9 // if a10 is less than a9 highest bit in a14 is going to be set, otherwise it will be reset after subtraction
srl a14, a14, 31 // logical shift it to first bit position (funneling zeroes into higher bits)
// that is the full replacement leaving 0 or 1 in a14
Code: Select all
main:
entry sp, 48
l32r a8, .LC0
memw
l32i.n a12, sp, 0
memw
l32i.n a10, a8, 0 // modified, exchanged order of loading registers
memw
l32i.n a13, sp, 4 // modified, exchanged order of loading registers
sub a9, a10, a12 // so we can execute this here
memw
l32i.n a11, a8, 4
sub a14, a10, a9 // and here is our replacement (a9 is modified 3 instructions earlier so no stalls)
// movi.n a14, 1
// bltu a10, a9, .L2
// movi.n a14, 0
//.L2:
sub a12, a11, a13
srl a14, a14, 31 // and here is the shift
memw
s32i.n a9, a8, 0
sub a11, a12, a14 // here is the actual usage of a14
memw
s32i.n a11, a8, 4
memw
l32i.n a12, a8, 0
l32r a10, .LC2
memw
l32i.n a13, a8, 4
call8 printf
retw.n
(I wrote this as an example while writing this question just to illustrate the question. I didn't test it, however there is no reason of why it shouldn't work)
But, is there a way to force the compiler to generate such a non-branching code of int64 arithmetic keeping routine execution time constant and independent on input values?