Hi and thanks for the quick reply.
Do you mean that while performing calculations, the RSA peripheral internally repeatedly accesses the memory and that is what causes the slowdown?
I have compared with the original ESP32 as well, but apart from the new modular exponention settings that can significantly speed up such operations, I see normal modular multiplications are around 50% slower on the ESP32-S2 than the ESP32, which then means 75% slower on the ESP32-S3/C3 than the ESP32. On ESP32, a 64 word modular multiplication (2048 bits) runs in 28 microseconds at 160 MHz and 19 microseconds at 240 MHz.
Let me explain some more background info. I'm working on an ECC library for the ESP32 that I intend to make public since I see a lack of efficient ECC implementations both in esp-idf and publicly available online. Mbedtls with the "hw-accelerated" ports in the esp-idf are absurdly slow (like 100 ms or so for Curve25519). Right now I have variable base implementations (i.e. shared secret calculations) for Curve25519 and P-256 written that runs in 7 ms and 8 ms respectively on the ESP32-S2 and fixed base implementations (i.e. keygen) that are ~60% faster. I am now about to write implementations for the original ESP32 as well, which I expect to be ~50% faster after making some timing experiments.
There are three reasons the original ESP32 is faster:
First, on S2/S3/C3, the modular multiplication operation always performs two modular multiplications instead of one (including two montgomery reductions so the result is also multiplied by R^-2). So 50% of the multiplications will be "dummy" multiplications. On the ESP32 we can perform one single modular multiplication with Montgomery reduction (so the result will be multiplied by R^-1). According to the manual, I assume the rationale is to make it easier for users to multiply numbers without having to understand how to deal with "Montgomery classes", but this unfortunately takes a 50% performance hit. Because normally when you work with algorithms requiring modular multiplications, you multiply by R^2 initially to transform the numbers into "Montgomery classes" and then perfom thousands of modular multiplications with montgomery reduction (which creates a R^-1 factor every time), and at the end of the computation perform the final reduction to bring the number to a normal one. The approach described in the TRMs is to do this for every multiplication, which require twice as many multiplications. In the ESP32 manual, the user is told to manually perform these two steps, but for S2/S3/C3 this has unfortunately been hardcoded so that always two multiplications are performed. Fortunately, in a few cases in the ECC formulas there are A*B*C calculations performed where hence no dummy multiplication is needed, but these are few.
The second issue I have not found any solution to is that the time for transferring words to/from the RSA peripheral is significantly slower with S2/S3/C3 compared to the original ESP32. Writing/reading one word to/from the RSA_X_MEM, RSA_Y_MEM and so on takes something like 20 cpu cycles here vs 1 cpu cycle or so on ESP32. Since these memory locations are not double buffered, a lot of the time the RSA engine must stay idle while we transfer data. Around 40% of the time during an ECC calculation in ESP32-S2 is spent transferring data to/from the RSA peripheral. This time is even slower in S3/C3 since there we don't have the "PeriBus1" which seems slightly faster on S2 when reading from RSA_Z_MEM. On ESP32, the time spent during memory transfers is significantly lower, leading to a speedup of the overall ECC operation.
The third reason is that the RSA peripheral on ESP32 seems to change with the CPU clock frequency, so at 240 MHz, it will be 50% faster.
The only thing that makes ESP32 worse for ECC operations than the newer S2/S3/C3 is that the operand size is limited to multiples of 512 bits, but despite this fact, 256-bit ECC operations are still much faster on this platform than the newer chips.
Please let me know if there are some clock/peripheral bus settings I have missed, secret registers that perform only one modular multiplication instead of two, or whatever that could make my ECC calculations faster.
Thanks