ESP32 TWAI/CAN bus errors cause bus off state
Posted: Mon May 08, 2023 11:57 pm
I have multiple devices using ESP32-WROOM-32E and TWAI (CAN). There was an issue where they would have bus errors and then fail to recover via twai_initiate_recovery(). I solved the recovery issue by enabling all "CONFIG_TWAI_ERRATA_FIX*" fixes, however, I still have the issue that causes bus off in the first place.
The issue seems to be rare but most likely related to electrical noise. I have one device that would go bus off around every 3-4 months. One device that would go bus off after about 2 weeks and another device that generated about 44k errors and then went bus off just last week. These devices operate near generators and electrical equipment. It is a noisy environment, however, the cables are all shielded, power supplies isolated and transceivers (ISO1042) rated for high common mode voltage.
This last failure showed about 44K bus errors out of 150M frames, no tx errors and no rx errors. Almost all of the errors happened when the twai went to bus off state. After a 3 second delay, recovery was initiated and communication continued as if nothing had happened. I had set a delay before starting recovery of 3 seconds as that was what was in the esp-idf example. During this time of bus off no control was available, which is not acceptable for the application this device is used in.
I believe that the issue may be down to the default timings for 250kbps TWAI. (.brp = 16, .tseg_1 = 15, .tseg_2 = 4, .sjw = 3, .triple_sampling = false) By my calculations, this would put the sample point at 80% and the sjw would allow re-sync by up to 3 Tq. J1939 recommends sample point at 87.5% and sjw of 1.
Questions:
1. Is it possible that the default timings could allow the controller to get out of sync and create bus errors until it went bus off?
2. Is the 80% sample time to allow for propagation delay on the bus and in the transceiver so that it's closer to 87.5% once taken into account? (ISO1042 loop time = 152ns, 1 Tq = 200ns)
3. Can I shorten the bus recovery time to limit how long the device is in bus off? (try instant recovery, if that fails, try again in 100ms, double the time until recovered with a max of say 5 seconds)
Thanks for taking the time to read, let me know if you have any questions.
The issue seems to be rare but most likely related to electrical noise. I have one device that would go bus off around every 3-4 months. One device that would go bus off after about 2 weeks and another device that generated about 44k errors and then went bus off just last week. These devices operate near generators and electrical equipment. It is a noisy environment, however, the cables are all shielded, power supplies isolated and transceivers (ISO1042) rated for high common mode voltage.
This last failure showed about 44K bus errors out of 150M frames, no tx errors and no rx errors. Almost all of the errors happened when the twai went to bus off state. After a 3 second delay, recovery was initiated and communication continued as if nothing had happened. I had set a delay before starting recovery of 3 seconds as that was what was in the esp-idf example. During this time of bus off no control was available, which is not acceptable for the application this device is used in.
I believe that the issue may be down to the default timings for 250kbps TWAI. (.brp = 16, .tseg_1 = 15, .tseg_2 = 4, .sjw = 3, .triple_sampling = false) By my calculations, this would put the sample point at 80% and the sjw would allow re-sync by up to 3 Tq. J1939 recommends sample point at 87.5% and sjw of 1.
Questions:
1. Is it possible that the default timings could allow the controller to get out of sync and create bus errors until it went bus off?
2. Is the 80% sample time to allow for propagation delay on the bus and in the transceiver so that it's closer to 87.5% once taken into account? (ISO1042 loop time = 152ns, 1 Tq = 200ns)
3. Can I shorten the bus recovery time to limit how long the device is in bus off? (try instant recovery, if that fails, try again in 100ms, double the time until recovered with a max of say 5 seconds)
Thanks for taking the time to read, let me know if you have any questions.