SPI Rx chunk size in DMA mode

powerbroker · Postby **powerbroker** » Sun Jun 09, 2024 6:45 am

hi,

looking into SPI Master driver API i see term "Written by 4 bytes-unit if DMA is used" near "void *rx_buffer" structure member description.

in practice, if rx_buffer is declared e.g. as 'unsigned char Rx[2]', data is NOT being transferred into it from MISO line. but 'unsigned char Rx[4]' does the trick - even if only a single Rx byte is required.

is it SPI driver issue or DMA hardware limitation? for example, STM32 DMA is capable of byte-by-byte transfer.

the question is that there are structure fields, aligned exactly in the order SPI MISO data arrives. so God himself ordered to say: "do DMA from SPI into RAM starting at that structure address and be happy".

but... the structure size is not always a multiple of 4 bytes.
so what - declare even 1-byte structure(s) 4-bytes-multiple-sized to enable DMA on them? or intermediate 4-bytes-multiple-sized buffer is a must for DMA Rx(does malloc align data the way DMA work fine even on 1- 2- 3- 5- 7-bytes buffers btw)?

Postby **ESP_Sprite** » Sun Jun 09, 2024 7:01 am

What I think that remark says is that DMA always writes out 32 bits, so e.g. if you receive 8 bits in a buf[4], whatever was in buf[1..3] will also be overwritten even if only buf[0] contains the actual received data.

Aside from that, what you say doesn't make much sense to me - the DMA subsystem has no way of knowing how you happened to declare your buffer. Can you post some code to illustrate what you mean?

powerbroker · Postby **powerbroker** » Sun Jun 09, 2024 11:15 am

ESP_Sprite wrote: ↑
Sun Jun 09, 2024 7:01 am
What I think that remark says is that DMA always writes out 32 bits, so e.g. if you receive 8 bits in a buf[4], whatever was in buf[1..3] will also be overwritten even if only buf[0] contains the actual received data.

Aside from that, what you say doesn't make much sense to me - the DMA subsystem has no way of knowing how you happened to declare your buffer. Can you post some code to illustrate what you mean?

well, i try and try, but still cannot reproduce the initial behavior.
and what i see now:

Code: [Select all] [Expand/Collapse]

/**
 * Transmits-and-receives through SPI. OWERWRITES *data with Rx values,
 *  starting from data[0]. REQUIRES room for 4-bytes chunks in data buffer
 *  to Rx successfully through DMA(to Rx 1 byte buffer muzz have room for 4).
 * @param spiDevice Preconfigured by spi_bus_add_device(...) SPI device handle.
 * @param data      Data buffer. Note 4-bytes chunk size to use DMA
 * @param rxSize    Actual amount of bytes to receive
 * @param totalSize Total amount of bytes to transmit-receive
 * @return SPI IO result(ESP_OK == 0, etc...)
 */
static int SPI_IO(spi_device_handle_t spiDevice, unsigned char *data, unsigned int rxSize, unsigned int totalSize){
 
    int result;
 
    spi_transaction_t t;
    memset(&t, 0, sizeof(t));       //Zerofill the transaction
 
    t.length = 8 * totalSize;
    t.tx_buffer = data;
 
    if(rxSize){
        t.rxlength = 8 * rxSize;
        t.rx_buffer = data;
    }
 
    result = spi_device_polling_transmit(spiDevice, &t);
 
    return result;
}
 
**
 * Read single register of BME280
 * @param hwIdx Device index in BME_Dev[x] array.
 * @param reg Regisdter address
 * @param data 1-byte variable to read to
 * @return SPI IO result
 */
int BME_reg_read(unsigned char hwIdx, unsigned char reg, unsigned char *data){
    struct {
        unsigned char ioBuffer[4];    // DMA Rx requires 4-byte chunks, may fail!!!
        int8_t result;
    } wc;
    wc.ioBuffer[0] = BME_register(reg, false);
    wc.result = 0xFF;
    SPI_IO((spi_device_handle_t)BME_Dev[hwIdx].ioId, wc.ioBuffer, 1, 2);
    ESP_LOGI("SPI Rx test", "Rx buffer: %d %d", wc.ioBuffer[0], wc.ioBuffer[1]);
    ESP_LOGI("SPI Rx test", "Result: %d", wc.result);
    if(wc.result == ESP_OK){
        data[0] = wc.ioBuffer[1];
    }
    return wc.result;
}

GeSHi © Codebox Plus Extension

SPI IO suceeds and with 'unsigned char ioBuffer[4]' structure member:

Code: Select all

I (15043) SPI Rx test: Rx buffer: 0 96
I (15043) SPI Rx test: Result: -1

which is expected(SPI device responds '96') and 100% correct. but if i declare 'unsigned char ioBuffer[2]' instead, i see:

Code: Select all

I (15043) SPI Rx test: Rx buffer: 0 96
I (15043) SPI Rx test: Result: 0

i.e. as you say, 'wc.result' value gets destroyed by DMA, writing 4-bytes chunk to 'wc.ioBuffer' address.

so, there is a question still remaining: if we malloc(2), is it DWORD-alligned so DMA can handle, and is there enough blank space after it to do DMA safely?
and one more question: when we use same buffer for Tx and Rx, doesn't DMA destroy the remaining 3 bytes in chunk after transmitting/receiving first one, does it? e.g we transmitting '10 2F EE 04' - is it overwritten after 4-th byte Tx with received 4-bytes result, or turns into '00 00 00 00' after 1-st byte Tx/Rx

ok-home · Postby **ok-home** » Sun Jun 09, 2024 12:56 pm

Hi
it makes no sense to use dma mode when the data length is less than 32 bits
you will incur unnecessary overhead on dma initialization plus dma reads/writes memory always says 32 bits
spi always works through its internal buffer
set the flags SPI_TRANS_USE_TXDATA , SPI_TRANS_USE_RXDATA and the driver itself will align the data internally and give you exactly the number of bytes you requested without changing the high bytes in the word,
the transaction processing speed will decrease,
the CPU load will not change because everything will be processed in the spi buffer

powerbroker · Postby **powerbroker** » Sun Jun 09, 2024 4:23 pm

ok-home wrote: ↑
Sun Jun 09, 2024 12:56 pm
Hi
it makes no sense to use dma mode when the data length is less than 32 bits
you will incur unnecessary overhead on dma initialization plus dma reads/writes memory always says 32 bits
spi always works through its internal buffer
set the flags SPI_TRANS_USE_TXDATA , SPI_TRANS_USE_RXDATA and the driver itself will align the data internally and give you exactly the number of bytes you requested without changing the high bytes in the word,
the transaction processing speed will decrease,
the CPU load will not change because everything will be processed in the spi buffer

indeed, thanks

the thing is would be great to have a single code for SPI IO of any size - used everywhere in the app; the only place SPI IO can be broken and only place containing SPI-related bugs, if any

was wondering, if internal buffer, why driver writes only 4-bytes chunks...
well, until saw this internal buffers magic in spi-master.h

Code: [Select all] [Expand/Collapse]

struct spi_transaction_t {
    uint32_t flags;
    uint16_t cmd;
    uint64_t addr;
    size_t length; 
    size_t rxlength;
    void *user;
    union {
        const void *tx_buffer;      ///< Pointer to transmit buffer, or NULL for no MOSI phase
        uint8_t tx_data[4];         ///< If SPI_TRANS_USE_TXDATA is set, data set here is sent directly from this variable.
    };
    union {
        void *rx_buffer;            ///< Pointer to receive buffer, or NULL for no MISO phase. Written by 4 bytes-unit if DMA is used.
        uint8_t rx_data[4];         ///< If SPI_TRANS_USE_RXDATA is set, data is received directly to this variable
    };
} ;

GeSHi © Codebox Plus Extension

of course, driver aligns nothing: transmitting length = 2 and rxLength = 1 results in '00 96 00 00' in rx_data when SPI device responds '96' after getting first byte.
and why am i not surprised...

Postby **ESP_Sprite** » Mon Jun 10, 2024 12:37 am

powerbroker wrote: ↑
Sun Jun 09, 2024 11:15 am
i.e. as you say, 'wc.result' value gets destroyed by DMA, writing 4-bytes chunk to 'wc.ioBuffer' address.

Yeah, that sounds like it; seems whatever your struct is squashes multiple fields into one 32-bit word.

so, there is a question still remaining: if we malloc(2), is it DWORD-alligned so DMA can handle, and is there enough blank space after it to do DMA safely?

Yes. Malloc always allocates on a 32-bit boundary and you get a size back that is rounded up to the next 32-bit-aligned byte.

and one more question: when we use same buffer for Tx and Rx, doesn't DMA destroy the remaining 3 bytes in chunk after transmitting/receiving first one, does it? e.g we transmitting '10 2F EE 04' - is it overwritten after 4-th byte Tx with received 4-bytes result, or turns into '00 00 00 00' after 1-st byte Tx/Rx

I'd generally advice against using the same buffer for Tx and Rx. In practice, DMA has a bunch of FIFOs and buffering, so I think it'd only start out writing data after it already read up to 16 words or something into its internal buffers, so it would work out, but I don't think we'd guarantee that works the same in all future chips.

powerbroker · Postby **powerbroker** » Mon Jun 10, 2024 5:18 am

ESP_Sprite wrote: ↑
Mon Jun 10, 2024 12:37 am
Malloc always allocates on a 32-bit boundary and you get a size back that is rounded up to the next 32-bit-aligned byte.

I'd generally advice against using the same buffer for Tx and Rx. In practice, DMA has a bunch of FIFOs and buffering, so I think it'd only start out writing data after it already read up to 16 words or something into its internal buffers, so it would work out, but I don't think we'd guarantee that works the same in all future chips.

thanks: malloc, allocating intermediate buffer is lifesaver here

regarding future chips: it would be absolutely terrific to have byte- and word-length capable DMA as well(like e.g STM32 has). so, users don't need to care about all this DWORD-alignment and nearby data damage because of minimal chunk size.
in many cases it's as much, as one intermediate memory copy operation less.

MicroController · Postby **MicroController** » Mon Jun 10, 2024 10:37 am

I am 100% positively not sure what you're actually saying while throwing several unrelated things in. (What do the unions declared inside struct spi_transaction_t have to do with anything?)
Are you saying that the SPI driver writes more data to the user-provided RX memory than what documentation says it should? If so, I guess that'd potentially be a bug in the driver (or documentation), but I am under the impression that the SPI driver will handle all cases correctly, or at least as doumented. Can't tell though because, well, your posts are confusing and lacking essential information. (E.g. if you're using half- or full-duplex mode.)

If you want a certain memory alignment, you should explicitly ask for it, e.g.
a) uint8_t my_buffer[BUF_SIZE] __attribute__((aligned( 4 )))
b) uint8_t* my_buffer = (uint8_t*) aligned_alloc(4, BUF_SIZE)
c) uint8_t* my_buffer = (uint8_t*) heap_caps_aligned_alloc(4, BUF_SIZE, MALLOC_CAP_DMA)

eriksl · Postby **eriksl** » Tue Jun 11, 2024 9:50 am

Are you replying to what ESP_Sprite says? I think he was quite clear.

Anyway, I don't think it makes sense to start up DMA for just two (or four) bytes. The overhead will be huge and most of the handling will be done by the SPI module anyway. I remember having seen a table in the documentation with break-even points for amounts of data between handling the data directly from/to the module or using DMA and I also remember the break-even point wasn't very small, somehing around 128 bytes.

Also ESP_sprite already said that malloc (and variants) already return aligned memory blocks. This is by concept a requirement of malloc. It's will always return an address aligned to the largest alignment requirement of the processor.

If you want a static or auto (stack) to be aligned, you should indeed use the alignment attributes OR use the dirty trick to declare a "large" type (like uint32_t) and cast it to a char or small char array [sizeof(uint32_t)].

powerbroker · Postby **powerbroker** » Sat Jun 15, 2024 3:24 pm

MicroController wrote: ↑
Mon Jun 10, 2024 10:37 am
I am 100% positively not sure what you're actually saying while throwing several unrelated things in. (What do the unions declared inside struct spi_transaction_t have to do with anything?)
Are you saying that the SPI driver writes more data to the user-provided RX memory than what documentation says it should? If so, I guess that'd potentially be a bug in the driver (or documentation), but I am under the impression that the SPI driver will handle all cases correctly, or at least as doumented. Can't tell though because, well, your posts are confusing and lacking essential information. (E.g. if you're using half- or full-duplex mode.)

If you want a certain memory alignment, you should explicitly ask for it, e.g.
a) uint8_t my_buffer[BUF_SIZE] __attribute__((aligned( 4 )))
b) uint8_t* my_buffer = (uint8_t*) aligned_alloc(4, BUF_SIZE)
c) uint8_t* my_buffer = (uint8_t*) heap_caps_aligned_alloc(4, BUF_SIZE, MALLOC_CAP_DMA)

about these unions inside spi_transaction_t: our colleague mentioned some intermediate buffer(s) SPI driver uses in case of user-provided ones don't fit dword-alignment. well... if driver uses e.g. rx_buffer pointer room as 4-bytes data buffer, it most likely relies on structure member alignment and most likely don't care if user aligned buffer area in case of rx_buffer-as-pointer(AFAIK there is a DMA failure code, indicating buffer misalignment - so, align-it-yourself again).
the kind of alignment can be imagined here is dropping data, received during transmission stage: e.g. when SPI master transmits 1-byte address or command and slave starts responding at second byte - the first one received has no sense and data can be shifted one byte left. not this one here, as well.

let us ensure correctness of SPI driver behavior here together: 13984.
in case of 9-bytes full-duplex I/O only 8 bytes of user rx buffer are populated. and i could believe this is correct behavior, but hardware DOES 9(since full-duplex and 2-channel scope there is no MOSI line on screen, sorry):

: SPI-IO-9bytes.png (13.43 KiB) Viewed 1448 times

so where is the 9-th byte, which is hardware-processed too?
i think, it's getting lost somewhere inside static inline void spi_ll_read_buffer(...) of spi_ll.h near it's for(...) loop, isn't it?

and all these explicit alignments are really useful, thanks

SPI Rx chunk size in DMA mode

SPI Rx chunk size in DMA mode

Re: SPI Rx chunk size in DMA mode

Re: SPI Rx chunk size in DMA mode

Re: SPI Rx chunk size in DMA mode

Re: SPI Rx chunk size in DMA mode

Re: SPI Rx chunk size in DMA mode

Re: SPI Rx chunk size in DMA mode

Re: SPI Rx chunk size in DMA mode

Re: SPI Rx chunk size in DMA mode

Re: SPI Rx chunk size in DMA mode

Who is online

About Us

Extra

Information