Page 1 of 2

SPI re-entrancy

Posted: Wed Nov 07, 2018 10:55 am
by PeterR
I have started integrating my device drivers.
I have two SPI device. A CAN controller on VSPI and a FLASH on HSPI.

The CAN transactions use spi_device_transmit() (DMA channel = 0). CAN SPI access is driven on core 1 (to ensure low latency).
The SPI test harness runs on core 0 and sends commands to core 1 using a queue.

The FLASH transaction use DMA via spi_device_queue_trans() and spi_device_get_trans_result(). The device was created with flag SPI_DEVICE_HALFDUPLEX and transactions use SPI_TRANS_VARIABLE_ADDR | SPI_TRANS_MODE_DIO.
FLASH SPI is accessed and the test harness runs on core 0. https://github.com/lllucius/esp32_extflash for driver details.

Each driver has a stress test which runs within its own task. The drivers have been stress tested independently for many days without issue.

When I put the two tasks together the CAN stress test quickly fails. Its always the CAN stress test which fails. It seems that the CAN stress test is hung waiting for a SPI transaction to complete (each task does sleep!).

The test harness also profiles PSRAM performance. If I remove the FLASH test then the CAN test works whilst PSRAM is being accessed.

Clearly I have some debugging to do but wanted to check that what I am doing is supported & especially if anyone else is doing similar.

EDIT: Copying spi_device_transmit() and reducing the timeouts to 5 seconds and I can see that spi_device_get_trans_result() times out.
Before adding this code spi_device_transmit() would not return.
The CAN SPI transaction is started from a task on cpu 1. The task is sent a 'service device' command from a GPIO interrupt connected to the CAN device.
CAN interrupts were enabled with gpio_config(), gpio_install_isr_service(0) and gpio_isr_handler_add() run from cpu 1 after the SPI device was created.

So I guess that somehow my SPI CAN interrupt has been masked or forgotten.

Re: SPI re-entrancy

Posted: Thu Nov 08, 2018 2:53 am
by ESP_Sprite
Not sure if it's an issue here, but the SPI driver has a very simple priority scheme: it'll always service the earliest-registered SPI device that needs servicing first. So, if you register the flash device first and make sure its transaction queue is always filled, you'll effectively starve the CAN driver.

Re: SPI re-entrancy

Posted: Thu Nov 08, 2018 3:31 am
by WiFive
@ESP_Sprite They are on different hosts though

Re: SPI re-entrancy

Posted: Thu Nov 08, 2018 8:58 am
by ESP_Sprite
Ah, derp, read over that. Then I don't really have a suggestion.

Re: SPI re-entrancy

Posted: Thu Nov 08, 2018 10:22 am
by PeterR
Thanks for the interest.
I have started adding xPortGetCoreID() to debug statements so that I am sure.
There are sleeps in each harness however.

I should add that the CAN test runs the SPI controller back to back with the ESP32 CAN.
The program is also >> bigger than IRAM and I have not set IRAM properties on interrupt handlers (yet) so I guess that interrupt latency could be quite high from time to time.

Re: SPI re-entrancy

Posted: Thu Nov 08, 2018 11:25 am
by ESP_Sprite
Possibly, yes, however high interrupt latency should not result in missed SPI transactions. Also note that with 'host' we mean SPI hosts (HSPI/VSPI), not CPU cores. Unless there's a bug in the SPI driver, the core you run your code on should not matter.

Re: SPI re-entrancy

Posted: Thu Nov 08, 2018 11:40 am
by PeterR
Agreed. SPI latency should not matter except that it opens a larger concurrent event window.
Yes, 'host' means SPI host. 'core' means CPU.
Note also that the CAN SPI device was created before the SPI FLASH device (not that that should matter as different hosts).

Moved CAN SPI SCHEDULER to core 0 (all my activity is now on core 0) & tested again.

Task priorities were/are (highest first): SPI CAN SCHEDULER (only task which accesses the CAN SPI bus), SPI CAN TEST (does not exercise hardware), FLASH SPI test (uses FLASH bus), PS RAM profiler

This seems much more solid. I guess that the tests (& so SPI, INTs and DMA) will be much more sequential in this configuration.
There is now only a slim chance of SPI CAN SCHEDULER events being concurrent with FLASH SPI.

The CAN SPI transactions are quite short, around 120 uS.
I have reduce the SPI FLASH clock to 5Mhz and transfer 4KB at a time. So about 6 mS.

So in making the change I think that I have lost concurrent SPI servicing as CAN SPI is highest priority and has very short transactions.
I need to read more about SPI interrupts and interrupt re-entrancy.

First though I am going to go back & remove ESP32 CAN as that is new.

Re: SPI re-entrancy

Posted: Thu Nov 08, 2018 2:00 pm
by PeterR
I removed the ESP32 CAN driver and the PSRAM profiler.

The application stills locks up waiting for a CAN SPI response (VSPI).
I believe that VSPI and HSPI access queues from their respective interrupt handlers. That said then I don't see how the conflict may arise.
Stacks are 4096 so I doubt that's an issue.

Re: SPI re-entrancy

Posted: Thu Nov 08, 2018 2:54 pm
by vonnieda
See this bug report I filed back in February: https://github.com/espressif/esp-idf/issues/1651

I had the exact same issue, aside from mine is Flash on one and an LCD on another. I have never been able to find an answer and have had to disable the second core for my ESP to remain stable.

There seems to be some kind of SPI deadlock issue when used with multiple cores.

I was not able to provide a self contained example that would demonstrate the behavior. Perhaps you can?

Jason

Re: SPI re-entrancy

Posted: Fri Nov 09, 2018 10:12 am
by PeterR
Thanks, yes your report seems very similar but my application still fails when SPI activity was pinned to core 0 (achieved 30 min which is longest run so far).
I will go back & ensure that the SPI bus & devices are created from core 0.

I also use Ethernet RMII. There are some broadcasts on my test network and also my application logs via UDP - there will be regular Ethernet interrupts and DMA during normal testing.
With Ethernet removed & all SPI transaction on core 0 I achieved 18 hrs last night & still running.

Did you have any other interrupt and/or DMA activity in your test program?

EDIT:
HSPI on core 0, VSPI on core 1, No Ethernet, Main Application - seems to work
HSPI on core 0, VSPI on core 0, Ethernet, Main application - seems to work
HSPI on core 0, VSPI on core 1, Ethernet, Main application, fails quickly
HSPI on core 0, VSPI on core 1, Ethernet, MWE, seems to work

Thinking that cache may be related I created a large IRAM_ATTR array and tried to read from it.
I get '0x40089a1c: _xt_nmi at ??:?'
How would I use cache & so flush the program out of cache & simulate 'real life'?
EDIT: Fixed. Needed to use 32 bit access.