Page 1 of 1

Proper netif/ppp disconnect/close? TCP timeout retry is crashing my app

Posted: Mon Sep 04, 2023 9:44 pm
by tpbedford
Hi, we are using a quectel EC21 modem in nb-IOT mode (slow! dial-up speeds with latency measured in seconds).

What's the proper process for initiating a close/disconnect/destroy of our modem driver task?

Our issue is if we try to close an active data connection and close down DTE, DCE, PPP etc., then about 5s after this the TCPIP thread invariably attempts some kind of a retry after timeout, and tries to access destroyed/freed objects.

We're using ESPIDF5.1 (lwip,ppp,esp-netif) and our modem driver was built on some esp-modem example code that we found.

Our stack trace causing the error is this:

Code: Select all

0x4014021d: pppos_low_level_output at C:/Espressif/frameworks/esp-idf/components/esp_netif/lwip/esp_netif_lwip_ppp.c:195
0x40135048: pppos_output_last at C:/Espressif/frameworks/esp-idf/components/lwip/lwip/src/netif/ppp/pppos.c:878
0x401351e9: pppos_write at C:/Espressif/frameworks/esp-idf/components/lwip/lwip/src/netif/ppp/pppos.c:241
0x40134afa: ppp_write at C:/Espressif/frameworks/esp-idf/components/lwip/lwip/src/netif/ppp/ppp.c:996
0x4013e07b: fsm_sdata at C:/Espressif/frameworks/esp-idf/components/lwip/lwip/src/netif/ppp/fsm.c:796
0x4013e0dc: fsm_timeout at C:/Espressif/frameworks/esp-idf/components/lwip/lwip/src/netif/ppp/fsm.c:282
0x40130f35: sys_check_timeouts at C:/Espressif/frameworks/esp-idf/components/lwip/lwip/src/core/timeouts.c:401
0x4012ac8a: tcpip_timeouts_mbox_fetch at C:/Espressif/frameworks/esp-idf/components/lwip/lwip/src/api/tcpip.c:109
0x4012ad46: tcpip_thread at C:/Espressif/frameworks/esp-idf/components/lwip/lwip/src/api/tcpip.c:142
0x4008cf12: vPortTaskWrapper at C:/Espressif/frameworks/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:162  
The top-most line there is:

Code: Select all

esp_err_t ret = esp_netif_transmit(netif, data, len);
being

Code: Select all

esp_err_t esp_netif_transmit(esp_netif_t *esp_netif, void* data, size_t len)
{
    return (esp_netif->driver_transmit)(esp_netif->driver_handle, data, len);
}
(which is clearly inlined so doesn't show in the stack trace). I believe the esp_netif->driver_transmit dereference is the issue, as the netif instance has been destroyed/freed.

Our close/disconnect sequence is this:

Code: Select all

// we call this from our task
esp_modem_stop_ppp(dte); // which posts ESP_MODEM_EVENT_PPP_STOP
// the abov event leads to these being executed by the driver:
esp_netif_stop()
esp_netif_stop_ppp()
// here we see   --> NETIF_PPP_PHASE_TERMINATE   --> NETIF_PPP_PHASE_NETWORK  --> NETIF_PPP_PHASE_ESTABLISH
// we then call deinit DCE
dce->deinit(dce);
ec21_deinit(dce); // frees DCE instance

// then we destroy netif
esp_modem_netif_clear_default_handlers(modem_netif_adapter);
esp_modem_netif_teardown(modem_netif_adapter);
esp_netif_destroy(esp_netif);
// finally clean up dte
dte->deinit(dte);
... and 5s after the esp_netif_destroy() part we see the TCP thread hit a timeout, and retry, and we crash with the stacktrace from earlier.