RSA peripheral 50% slower on ESP32-S3/C3 than S2?
RSA peripheral 50% slower on ESP32-S3/C3 than S2?
Hi. I'm trying to use the RSA peripheral, as specified in the Technical Reference Manuals.
I'm enabling it via the esp_mpi_enable_hardware_hw_op() function that calls periph_module_enable(PERIPH_RSA_MODULE).
The issue/question I'm having is why the performance is ~50% slower on ESP32-S3 and ESP32-C3 compared to ESP32-S2. In the technical reference manual I see that as long as the CPU clock (for example 160 MHz) is based on the PLL frequency, the clock for the CRYPTO peripheral is 160 MHz, and for ESP32-S2 it is always 160 MHz.
For example, executing one Modular Multiplication operation with size=64 words takes 56 microseconds on ESP32-S2 but 108 microseconds on ESP32-S3. The RSA_DATE_REG contains the same value for both the S3 and S2 acording to the technical reference manual, so I assume that means it contains the exact same peripheral.
What could be the cause for the time differences?
I'm enabling it via the esp_mpi_enable_hardware_hw_op() function that calls periph_module_enable(PERIPH_RSA_MODULE).
The issue/question I'm having is why the performance is ~50% slower on ESP32-S3 and ESP32-C3 compared to ESP32-S2. In the technical reference manual I see that as long as the CPU clock (for example 160 MHz) is based on the PLL frequency, the clock for the CRYPTO peripheral is 160 MHz, and for ESP32-S2 it is always 160 MHz.
For example, executing one Modular Multiplication operation with size=64 words takes 56 microseconds on ESP32-S2 but 108 microseconds on ESP32-S3. The RSA_DATE_REG contains the same value for both the S3 and S2 acording to the technical reference manual, so I assume that means it contains the exact same peripheral.
What could be the cause for the time differences?
-
- Posts: 9757
- Joined: Thu Nov 26, 2015 4:08 am
Re: RSA peripheral 50% slower on ESP32-S3/C3 than S2?
I've poked a colleague who knows more about this to take a look... Meanwhile, would you be able to post your testing code somewhere?
-
- Posts: 74
- Joined: Wed Oct 23, 2019 1:49 am
Re: RSA peripheral 50% slower on ESP32-S3/C3 than S2?
You are correct that the peripheral is the same, but the way we access the memory has changed and unfortunately this means the RSA on C3/S3 will be slower than S2, but still an improvement from ESP32. 50% seems to be within the expected range.
I see that the TRM RSA performance data for S3/C3 do not reflect this, I'll make sure we this.
I see that the TRM RSA performance data for S3/C3 do not reflect this, I'll make sure we this.
Re: RSA peripheral 50% slower on ESP32-S3/C3 than S2?
Hi and thanks for the quick reply.
Do you mean that while performing calculations, the RSA peripheral internally repeatedly accesses the memory and that is what causes the slowdown?
I have compared with the original ESP32 as well, but apart from the new modular exponention settings that can significantly speed up such operations, I see normal modular multiplications are around 50% slower on the ESP32-S2 than the ESP32, which then means 75% slower on the ESP32-S3/C3 than the ESP32. On ESP32, a 64 word modular multiplication (2048 bits) runs in 28 microseconds at 160 MHz and 19 microseconds at 240 MHz.
Let me explain some more background info. I'm working on an ECC library for the ESP32 that I intend to make public since I see a lack of efficient ECC implementations both in esp-idf and publicly available online. Mbedtls with the "hw-accelerated" ports in the esp-idf are absurdly slow (like 100 ms or so for Curve25519). Right now I have variable base implementations (i.e. shared secret calculations) for Curve25519 and P-256 written that runs in 7 ms and 8 ms respectively on the ESP32-S2 and fixed base implementations (i.e. keygen) that are ~60% faster. I am now about to write implementations for the original ESP32 as well, which I expect to be ~50% faster after making some timing experiments.
There are three reasons the original ESP32 is faster:
First, on S2/S3/C3, the modular multiplication operation always performs two modular multiplications instead of one (including two montgomery reductions so the result is also multiplied by R^-2). So 50% of the multiplications will be "dummy" multiplications. On the ESP32 we can perform one single modular multiplication with Montgomery reduction (so the result will be multiplied by R^-1). According to the manual, I assume the rationale is to make it easier for users to multiply numbers without having to understand how to deal with "Montgomery classes", but this unfortunately takes a 50% performance hit. Because normally when you work with algorithms requiring modular multiplications, you multiply by R^2 initially to transform the numbers into "Montgomery classes" and then perfom thousands of modular multiplications with montgomery reduction (which creates a R^-1 factor every time), and at the end of the computation perform the final reduction to bring the number to a normal one. The approach described in the TRMs is to do this for every multiplication, which require twice as many multiplications. In the ESP32 manual, the user is told to manually perform these two steps, but for S2/S3/C3 this has unfortunately been hardcoded so that always two multiplications are performed. Fortunately, in a few cases in the ECC formulas there are A*B*C calculations performed where hence no dummy multiplication is needed, but these are few.
The second issue I have not found any solution to is that the time for transferring words to/from the RSA peripheral is significantly slower with S2/S3/C3 compared to the original ESP32. Writing/reading one word to/from the RSA_X_MEM, RSA_Y_MEM and so on takes something like 20 cpu cycles here vs 1 cpu cycle or so on ESP32. Since these memory locations are not double buffered, a lot of the time the RSA engine must stay idle while we transfer data. Around 40% of the time during an ECC calculation in ESP32-S2 is spent transferring data to/from the RSA peripheral. This time is even slower in S3/C3 since there we don't have the "PeriBus1" which seems slightly faster on S2 when reading from RSA_Z_MEM. On ESP32, the time spent during memory transfers is significantly lower, leading to a speedup of the overall ECC operation.
The third reason is that the RSA peripheral on ESP32 seems to change with the CPU clock frequency, so at 240 MHz, it will be 50% faster.
The only thing that makes ESP32 worse for ECC operations than the newer S2/S3/C3 is that the operand size is limited to multiples of 512 bits, but despite this fact, 256-bit ECC operations are still much faster on this platform than the newer chips.
Please let me know if there are some clock/peripheral bus settings I have missed, secret registers that perform only one modular multiplication instead of two, or whatever that could make my ECC calculations faster.
Thanks
Do you mean that while performing calculations, the RSA peripheral internally repeatedly accesses the memory and that is what causes the slowdown?
I have compared with the original ESP32 as well, but apart from the new modular exponention settings that can significantly speed up such operations, I see normal modular multiplications are around 50% slower on the ESP32-S2 than the ESP32, which then means 75% slower on the ESP32-S3/C3 than the ESP32. On ESP32, a 64 word modular multiplication (2048 bits) runs in 28 microseconds at 160 MHz and 19 microseconds at 240 MHz.
Let me explain some more background info. I'm working on an ECC library for the ESP32 that I intend to make public since I see a lack of efficient ECC implementations both in esp-idf and publicly available online. Mbedtls with the "hw-accelerated" ports in the esp-idf are absurdly slow (like 100 ms or so for Curve25519). Right now I have variable base implementations (i.e. shared secret calculations) for Curve25519 and P-256 written that runs in 7 ms and 8 ms respectively on the ESP32-S2 and fixed base implementations (i.e. keygen) that are ~60% faster. I am now about to write implementations for the original ESP32 as well, which I expect to be ~50% faster after making some timing experiments.
There are three reasons the original ESP32 is faster:
First, on S2/S3/C3, the modular multiplication operation always performs two modular multiplications instead of one (including two montgomery reductions so the result is also multiplied by R^-2). So 50% of the multiplications will be "dummy" multiplications. On the ESP32 we can perform one single modular multiplication with Montgomery reduction (so the result will be multiplied by R^-1). According to the manual, I assume the rationale is to make it easier for users to multiply numbers without having to understand how to deal with "Montgomery classes", but this unfortunately takes a 50% performance hit. Because normally when you work with algorithms requiring modular multiplications, you multiply by R^2 initially to transform the numbers into "Montgomery classes" and then perfom thousands of modular multiplications with montgomery reduction (which creates a R^-1 factor every time), and at the end of the computation perform the final reduction to bring the number to a normal one. The approach described in the TRMs is to do this for every multiplication, which require twice as many multiplications. In the ESP32 manual, the user is told to manually perform these two steps, but for S2/S3/C3 this has unfortunately been hardcoded so that always two multiplications are performed. Fortunately, in a few cases in the ECC formulas there are A*B*C calculations performed where hence no dummy multiplication is needed, but these are few.
The second issue I have not found any solution to is that the time for transferring words to/from the RSA peripheral is significantly slower with S2/S3/C3 compared to the original ESP32. Writing/reading one word to/from the RSA_X_MEM, RSA_Y_MEM and so on takes something like 20 cpu cycles here vs 1 cpu cycle or so on ESP32. Since these memory locations are not double buffered, a lot of the time the RSA engine must stay idle while we transfer data. Around 40% of the time during an ECC calculation in ESP32-S2 is spent transferring data to/from the RSA peripheral. This time is even slower in S3/C3 since there we don't have the "PeriBus1" which seems slightly faster on S2 when reading from RSA_Z_MEM. On ESP32, the time spent during memory transfers is significantly lower, leading to a speedup of the overall ECC operation.
The third reason is that the RSA peripheral on ESP32 seems to change with the CPU clock frequency, so at 240 MHz, it will be 50% faster.
The only thing that makes ESP32 worse for ECC operations than the newer S2/S3/C3 is that the operand size is limited to multiples of 512 bits, but despite this fact, 256-bit ECC operations are still much faster on this platform than the newer chips.
Please let me know if there are some clock/peripheral bus settings I have missed, secret registers that perform only one modular multiplication instead of two, or whatever that could make my ECC calculations faster.
Thanks
-
- Posts: 74
- Joined: Wed Oct 23, 2019 1:49 am
Re: RSA peripheral 50% slower on ESP32-S3/C3 than S2?
Sorry, as far as I know there is no secret registers that allow you to perform just one modular multiplication instead of two on these newer chips.
I'll forward the question to the guy on our digital team responsible for this peripheral and see if he has any ideas, but most likely there isn't any way to speed this up.
I'll forward the question to the guy on our digital team responsible for this peripheral and see if he has any ideas, but most likely there isn't any way to speed this up.
Re: RSA peripheral 50% slower on ESP32-S3/C3 than S2?
Hi EmilenL, I'm very interested in your ECC optimised library!
Currently I'm using the Trezor C library for my EC math in my ESP32 Ethereum bridge library (https://github.com/AlphaWallet/Web3E), and while it's optimised for embedded devices it only does ECDSA. Would be great to add ED25519 for Corda/Attestation crypto maybe also the BNE curves too.
What stage of development are you at? Would you be targeting Secp256k1?
Currently I'm using the Trezor C library for my EC math in my ESP32 Ethereum bridge library (https://github.com/AlphaWallet/Web3E), and while it's optimised for embedded devices it only does ECDSA. Would be great to add ED25519 for Corda/Attestation crypto maybe also the BNE curves too.
What stage of development are you at? Would you be targeting Secp256k1?
Re: RSA peripheral 50% slower on ESP32-S3/C3 than S2?
I've done some sample implementations for secp256r1 (P-256), Curve25519 and Ed25519 on the newer ESP32-XX devices (S2/C3/S3) so far. I'm planning to implement secp256k1 too (even though I don't really know why people use that curve when the Curve25519/Ed25519 is much more superior), Curve448, secp384r1, secp521r1 and implement ESP32 support.Elemental wrote: ↑Tue Oct 26, 2021 11:20 amHi EmilenL, I'm very interested in your ECC optimised library!
Currently I'm using the Trezor C library for my EC math in my ESP32 Ethereum bridge library (https://github.com/AlphaWallet/Web3E), and while it's optimised for embedded devices it only does ECDSA. Would be great to add ED25519 for Corda/Attestation crypto maybe also the BNE curves too.
What stage of development are you at? Would you be targeting Secp256k1?
What I'm thinking about right now is how the API should look like and how memory should be dealt with. For example use stack allocation, let the user pass in a buffer, or allocate dynamically inside the lib. I would like to keep the API simple but still avoid to have return codes for "memory allocation failure". I'm also thinking about how this could one day be integrated into or to be used with esp-idf so that standard TLS connections can use this. I'm also waiting a bit for the "digital team responsible for this peripheral" before continuing to make sure I do it the best way.
Let me know if you have any ideas!
Re: RSA peripheral 50% slower on ESP32-S3/C3 than S2?
After doing some experiments I found out that the Modular Exponentiation operation seems to always initially multiply X by Z, perform a montgomery reduction and then put that result in X. If we set Y to 0, RSA_CONSTANT_TIME_REG to 0, RSA_SEARCH_ENABLE_REG to 1 and RSA_SEARCH_POS_REG to 0, the RSA peripheral will then stop immediately after writing X*Z*(R^-1) to X and writing 1 to Z. For sufficiently large operands, the time saving is almost 50% compared to a standard modular multiplication operation (which in fact performs two modular multiplications). On ESP32-S3, for a 256-bit operation, this reduces the time from 2.5 microseconds to 1.5 microseconds. However, additionally 1.5 microseconds are still needed for the CPU to write the operands and read the result while the RSA peripheral is idling, so ~50% of the time is still "wasted" unfortunately. Is there maybe any way to use DMA to speed this up somehow?ESP-Marius wrote: ↑Fri Oct 22, 2021 3:20 amSorry, as far as I know there is no secret registers that allow you to perform just one modular multiplication instead of two on these newer chips.
I'll forward the question to the guy on our digital team responsible for this peripheral and see if he has any ideas, but most likely there isn't any way to speed this up.
Re: RSA peripheral 50% slower on ESP32-S3/C3 than S2?
Hello! Is any news on it? Did you release any code on GitHub ? Thanks!
Re: RSA peripheral 50% slower on ESP32-S3/C3 than S2?
I've implemented some more curves, such as Curve448, secp384r1 and secp521r1 but have a few things to complete before I will put it on Github. What curves are you interested in and what ESP platform would you like to run it on?
Who is online
Users browsing this forum: No registered users and 195 guests