Performance Profile Interpretation

DrMickeyLauer
Posts: 168
Joined: Sun May 22, 2022 2:42 pm

Performance Profile Interpretation

Postby DrMickeyLauer » Tue May 07, 2024 4:30 pm

I don't have a performance problem. Still, in order to analyze whether there is optimization potential, I used OpenOCD to gather a profile of my app during a typical workload:

Code: Select all

gprof ECUconnectOS-s3devkit.elf /tmp/gmon.out

Flat profile:
 
Each sample counts as 0.0625 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
 33.12     21.07    21.07                             esp_cpu_stall
 21.72     34.89    13.82                             esp_cpu_unstall
  7.24     39.49     4.61                             esp_gpio_reserve_pins
  3.34     41.62     2.13                             get_call_stack
  3.10     43.59     1.97                             esp_ptr_dma_ext_capable
  2.07     44.91     1.32                             esp_cpu_wait_for_intr
  1.27     45.72     0.81                             xPortEnterCriticalTimeout
  1.12     46.43     0.71                             spi_hal_setup_trans
  1.04     47.09     0.66                             esp_event_isr_post_to
  1.04     47.75     0.66                             spi_hal_prepare_data
  0.70     48.20     0.45                             esp_cpu_compare_and_set
  0.67     48.63     0.43                             _WindowOverflow8
  0.67     49.05     0.43                             _WindowUnderflow8
  0.66     49.47     0.42                             vPortExitCritical
  0.52     49.80     0.33                             esp_ptr_byte_accessible
  0.51     50.13     0.32                             _xt_alloca_exc
  0.46     50.42     0.29                             _WindowOverflow12
  0.41     50.68     0.26                             __MCP251XFD_ReadDataNormal
  0.39     50.93     0.25                             spi_ll_master_set_line_mode
  0.38     51.18     0.24                             xQueueSemaphoreTake
  0.36     51.41     0.23                             prvTaskPriorityRestore
  0.35     51.63     0.22                             pxPortInitialiseStack
  0.34     51.84     0.22                             tlsf_malloc
  0.31     52.04     0.20                             __MCP251XFD_WriteDataNormal
  0.31     52.24     0.20                             spi_ll_write_buffer
  0.31     52.44     0.20                             vApplicationStackOverflowHook
  0.31     52.63     0.19                             heap_caps_malloc_base
  0.26     52.80     0.17                             xTaskGenericNotifyWait
  0.26     52.96     0.16                             tlsf_free
  0.25     53.13     0.16                             spi_hal_user_start
  0.25     53.29     0.16                             _WindowOverflow4
  0.25     53.44     0.16                             dram_alloc_to_iram_addr
  0.23     53.59     0.14                             vTaskDelete
  0.19     53.71     0.12                             systimer_hal_get_counter_value
  0.19     53.83     0.12                             MCP251XFD_TransmitMessageToFIFO
  0.18     53.95     0.12                             void std::vector<unsigned char, std::allocator<unsigned char> >::_M_range_insert<unsigned char*>(__gnu_cxx::__normal_iterator<unsigned char*, std::vector<unsigned char, std::allocator<unsigned char> > >, unsigned char*, unsigned char*, std::forward_iterator_tag)
  ...
So the heaviest 6 functions seem to be

Code: Select all

 33.12     21.07    21.07                             esp_cpu_stall
 21.72     34.89    13.82                             esp_cpu_unstall
  7.24     39.49     4.61                             esp_gpio_reserve_pins
  3.34     41.62     2.13                             get_call_stack
  3.10     43.59     1.97                             esp_ptr_dma_ext_capable
  2.07     44.91     1.32                             esp_cpu_wait_for_intr
I did not check what these are doing, but I guess they're OS and hardware level management functions that ­are already heavily optimized.

If I analyze this correctly, then it looks the device is idle most of the time, waiting for peripheral devices to answer.

The first entry that actually shows my application code is way below...

Code: Select all

  0.41     50.68     0.26                             __MCP251XFD_ReadDataNormal
So with 0.26 seconds of a total of 60 seconds which I've captured, I guess all optimization potential is moot.

MicroController
Posts: 1701
Joined: Mon Oct 17, 2022 7:38 pm
Location: Europe, Germany

Re: Performance Profile Interpretation

Postby MicroController » Tue May 07, 2024 9:23 pm

Stall/unstall/get_call_stack... I'd expect these functions to never be called during normal operation. Makes me doubt the usefulness of the profiling data, because to me it looks like the debugger/profiler is mostly profiling its own instrumentation.

Interesting would be the time spent in esp_cpu_wait_for_intr as that's basically the time the CPU was idle, but you'd likely get more reliable data on that from FreeRTOS's runtime stats for the IDLE task(s).

DrMickeyLauer
Posts: 168
Joined: Sun May 22, 2022 2:42 pm

Re: Performance Profile Interpretation

Postby DrMickeyLauer » Wed May 08, 2024 9:41 am

FWIW, here's another profile where I didn't forget to turn off heap tracing (*cough*) and switched to -Os:

Code: Select all

Flat profile:

Each sample counts as 0.0625 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
 25.40     26.67    26.67                             esp_cpu_unstall
 20.05     47.72    21.05                             esp_ptr_byte_accessible
 14.03     62.46    14.74                             gdma_default_tx_isr
  9.36     72.28     9.82                             esp_gpio_reserve_pins
  8.02     80.70     8.42                             esp_stack_ptr_in_extram
  4.01     84.91     4.21                             esp_ptr_dma_ext_capable
  4.01     89.12     4.21                             esp_ptr_external_ram
  2.67     91.93     2.81                             esp_cpu_wait_for_intr
  0.72     92.69     0.76                             xPortEnterCriticalTimeout
  0.49     93.20     0.51                             spi_hal_setup_trans
  0.26     93.47     0.27                             vTaskSwitchContext
  0.25     93.73     0.26                             tlsf_free
  0.22     93.96     0.23                             esp_cpu_stall
  0.19     94.16     0.20                             vPortExitCritical
  0.16     94.33     0.17                             esp_cpu_compare_and_set
  0.13     94.47     0.13                             pxPortInitialiseStack
  0.11     94.59     0.12                             systimer_hal_set_alarm_period
  0.10     94.69     0.11                             multi_heap_aligned_alloc_impl_offs
  0.10     94.80     0.11                             _WindowOverflow12
  0.10     94.91     0.11                             _WindowOverflow8
  0.10     95.02     0.11                             _WindowUnderflow8
  0.10     95.12     0.10                             multi_heap_aligned_free
  0.09     95.22     0.10                             spi_device_polling_end
  0.09     95.32     0.10                             CANChannelManager::handleIncomingFrame(CANMessage&)
  0.09     95.41     0.09                             systimer_hal_set_alarm_target
  0.09     95.50     0.09                             vApplicationStackOverflowHook
  0.09     95.60     0.09                             systimer_hal_get_counter_value
  0.09     95.69     0.09                             xQueueGenericSend
  0.08     95.77     0.09                             systimer_hal_counter_value_advance
  0.08     95.86     0.09                             multi_heap_malloc
  0.08     95.95     0.09                             spi_device_polling_start
  0.08     96.04     0.09                             vTaskResume
  0.08     96.12     0.08                             CANyonero::ISOTP::Transceiver::parseFlowControlFrame(std::vector<unsigned char, std::allocator<unsigned char> > const&)
  0.08     96.20     0.08                             MCP251XFD_ReadData
  0.08     96.28     0.08                             _xt_alloca_exc
  0.07     96.36     0.08                             multi_heap_realloc_impl
  0.07     96.43     0.07                             systimer_hal_select_alarm_mode
  0.07     96.50     0.07                             tlsf_malloc
  0.07     96.57     0.07                             uxTaskGetSystemState
  0.06     96.64     0.06                             ESP_MCP251XFD::handleIRQ()
  0.06     96.70     0.06                             spi_hal_prepare_data
  0.06     96.76     0.06                             CANyonero::ISOTP::Transceiver::Action::Action(CANyonero::ISOTP::Transceiver::Action&&)
  0.05     96.82     0.06                             prvAddCurrentTaskToDelayedList
  0.05     96.87     0.05                             spi_hal_setup_device
  0.05     96.91     0.05                             netconn_write_vectors_partly
  0.05     96.96     0.05                             vPortYield
  0.04     97.01     0.05                             _frxt_dispatch
  0.04     97.05     0.04                             CANController::send(unsigned long, std::vector<unsigned char, std::allocator<unsigned char> >)
  0.04     97.09     0.04                             spi_hal_user_start
  0.04     97.13     0.04                             CANyonero::ISOTP::Transceiver::parseDataFrame(std::vector<unsigned char, std::allocator<unsigned char> > const&)
  0.04     97.17     0.04                             vPortTCBPreDeleteHook
  0.03     97.21     0.04                             CANyonero::PDU::containsPDU(std::vector<unsigned char, std::allocator<unsigned char> > const&)
    0.03     97.25     0.04                             _xt_lowint1
  0.03     97.28     0.04                             prvIsYieldRequiredSMP
  0.03     97.32     0.03                             multi_heap_aligned_alloc_offs
  0.03     97.35     0.03                             multi_heap_internal_lock
  0.03     97.38     0.03                             xPortInIsrContext
  0.03     97.42     0.03                             esp_event_isr_post_to
  0.03     97.45     0.03                             esp_vfs_register_fd_range
  0.03     97.48     0.03                             add_entry
  0.03     97.52     0.03                             xTaskPriorityDisinherit
  0.03     97.55     0.03                             __gnu_cxx::__mutex::unlock()
  0.03     97.58     0.03                             prvCopyDataToQueue
  0.03     97.61     0.03                             xPortStartScheduler
  0.03     97.64     0.03                             xTaskPriorityInherit
  0.03     97.67     0.03                             recv_tcp
  0.03     97.70     0.03                             heap_caps_malloc_prefer
  0.03     97.73     0.03                             vTaskDelete
  0.03     97.76     0.03                             multi_heap_internal_unlock
  0.03     97.78     0.03                             SysTickIsrHandler
  0.03     97.81     0.03                             _xt_medint2
  0.03     97.84     0.03                             _WindowUnderflow4
  0.03     97.87     0.03                             CANyonero::ISOTP::Transceiver::didReceiveFrame(std::vector<unsigned char, std::allocator<unsigned char> > const&)
  0.03     97.89     0.03                             systimer_hal_init
  0.03     97.92     0.03                             xQueueGetMutexHolder
  0.02     97.94     0.02                             vPortYieldFromInt
  0.02     97.97     0.02                             xTaskGetSchedulerState
  0.02     97.99     0.02                             esp_task_wdt_init
  0.02     98.02     0.02                             vSystimerSetup
  0.02     98.04     0.02                             xQueueGenericCreate
  0.02     98.06     0.02                             lwip_unlink_select_cb
  0.02     98.09     0.02                             xQueueCreateCountingSemaphore
  0.02     98.11     0.02                             tcp_recved
  0.02     98.13     0.02                             heap_caps_malloc_default
  0.02     98.15     0.02                             spi_bus_lock_bg_clear_req
  0.02     98.17     0.02                             uxListRemove
  0.02     98.19     0.02                             systimer_hal_set_tick_rate_ops
  0.02     98.21     0.02                             mcp251x_spi_transfer(void*, unsigned char, unsigned char*, unsigned char*, unsigned int)
  0.02     98.23     0.02                             LittleFS::partition()
  0.02     98.25     0.02                             esp_task_wdt_reset
  0.02     98.26     0.02                             rmt_dma_tx_eof_cb
  0.02     98.28     0.02                             uninstall_priv_desc
  0.02     98.30     0.02                             _GLOBAL__sub_D__ZN13CANControllerC2EP17CANChannelManager
  0.02     98.32     0.02                             alloc_socket
  0.02     98.33     0.02                             eTaskGetState
  0.02     98.35     0.02                             multi_heap_get_allocated_size
  0.02     98.37     0.02                             sha_hal_read_digest
  0.02     98.39     0.02                             tcpip_send_msg_wait_sem
  0.02     98.40     0.02                             tcpip_try_callback
  0.02     98.42     0.02                             tlsf_block_size
  0.02     98.44     0.02                             tlsf_block_size_max
  0.02     98.46     0.02                             vPortYieldOtherCore
  0.02     98.47     0.02                             vTaskPlaceOnEventList
  0.02     98.49     0.02                             esp_task_wdt_print_triggered_tasks
  0.02     98.51     0.02                             _xt_context_restore
  0.02     98.52     0.02                             _xt_coproc_release
  0.01     98.54     0.02                             spi_hal_usr_is_done
  0.01     98.55     0.02                             tcpip_api_call
  0.01     98.57     0.02                             vTaskPlaceOnUnorderedEventList
  . . .
Again, with my code not showing up until the 20-ish-th position, it probably makes just a nice learning experience rather than an opportunity for optimization.

MicroController
Posts: 1701
Joined: Mon Oct 17, 2022 7:38 pm
Location: Europe, Germany

Re: Performance Profile Interpretation

Postby MicroController » Thu May 09, 2024 6:02 pm

DrMickeyLauer wrote:
Wed May 08, 2024 9:41 am
FWIW, here's another profile where I didn't forget to turn off heap tracing (*cough*) and switched to -Os:
;-)
Again, with my code not showing up until the 20-ish-th position, it probably makes just a nice learning experience rather than an opportunity for optimization.
I don't agree with this unconditionally. After all, calling even a perfectly "optimized" function is still slower than not calling a function at all. So if you're after performance/efficiency, you may want to check if/where you may be calling (expensive) functions more often than necessary, like, for instance, when repeatedly malloc'ing+free'ing the 'same' buffer.

DrMickeyLauer
Posts: 168
Joined: Sun May 22, 2022 2:42 pm

Re: Performance Profile Interpretation

Postby DrMickeyLauer » Fri May 10, 2024 12:27 pm

True. My code would definitely benefit from an area allocator, but I'm not sure I want to go into that rabbit hole.

A problem with OpenOCD's profile is that I don't get the actual number a function is called. I need to bring that with the OpenOCD team, perhaps they can fix it.

MicroController
Posts: 1701
Joined: Mon Oct 17, 2022 7:38 pm
Location: Europe, Germany

Re: Performance Profile Interpretation

Postby MicroController » Fri May 10, 2024 12:43 pm

DrMickeyLauer wrote:
Fri May 10, 2024 12:27 pm
My code would definitely benefit from an area allocator, but I'm not sure I want to go into that rabbit hole.
You could look into using one of the many "memory pool" or "object pool" libraries out there, which could also help reduce heap fragmentation.

Bryght-Richard
Posts: 40
Joined: Thu Feb 22, 2024 3:59 pm

Re: Performance Profile Interpretation

Postby Bryght-Richard » Tue Jul 30, 2024 5:58 pm

Mickey, OpenOCD has an internal limiter that causes it to round program counter samples once the sampled memory space exceeds the limit. On ESP and other semi-sparse memory map systems, it can be tricky to get a useful sampling with OpenOCD, because as soon as samples cross two memories(ROM, PSRAM, Flash, IRAM), then the span of samples is very wide, and rounding occurs. This may be the cause of some of your unexpected PC samples.

You can increase this limit inside OpenOCD, here's a patch to do so, though for ESP chips it may need increased further.

Who is online

Users browsing this forum: No registered users and 67 guests