SOLVED: Lost TLS connection causes apparent LWIP memory leak

jcolebaker · Postby **jcolebaker** » Mon Jul 31, 2023 4:17 am

Bear with me, I've been trying to debug what looked like a memory leak in LWIP for the last week and it has been a journey.

We're using TLS to connected to an AWS MQTT broker for an IoT device. We've been testing on a cellular Wi-Fi hotspot which sometimes has poor cellular signal, and this has revealed a problem with LWIP consuming heap memory and not releasing it.

Here's what happens:
* Device establishes Wi-Fi connection to cellular Wi-Fi hotspot.
* Device establishes TLS connection to MQTT broker. All works OK.
* Cellular signal fluctuates, we lose the network connection, and MQTT ping is missed (but Wi-Fi is still up).
* We disconnect the TLS connection and re-connect.
* MQTT works again.

This works, but we lose ~1900 bytes of free heap each time the TLS connection is lost and re-established. We don't have a lot of spare heap, so after ~ 5 - 10 reconnections, we don't have enough free heap to start another TLS connection, and are forced to reboot.

Here is the code used to create/open and close the TLS connection. From the header, it looks like this is all we should need to do (i.e. no other cleanup required):

Code: Select all

// Create / Connect:
esp_tls_t* pxTls = esp_tls_init();
esp_tls_conn_new_sync(...);

// Disconnect / Clean up:
esp_tls_conn_destroy(pxTls);

Note that internally, ESP-TLS uses the "sockets" interface to open and close the TCP connection, i.e. "socket()", "connect()", "close()".

Heap tracing and a lot of code analysis shows that we are accumulating the MQTT ping messages sent after the data connection is lost. These are stored as "struct tcp_seg" structures in LWIP (see esp-idf\components\lwip\lwip\src\core\tcp.c, etc). They are stored in the "unacked" segment list, under the Protocol Control Block struct (struct tcp_pcb) for the connection. The PCB is referenced by the global list of connection PCBs stored in "tcp_active_pcbs", and also via the "sockets" list while it is open.

IN the case where the data connection is lost and the MQTT ping times out, we close the connection with esp_tls_conn_destroy as above. I thought these un-acked segments were never freed after that, but in fact they are. When I re-connected and left the new connection running for ~20 to 25 minutes, the old PCB and segments were eventually freed and deleted from the "tcp_active_pcbs" by the "TCP slow timer" function ("tcp_slowtmr"), and the "lost" heap was restored. However, this seems much too long to clean up a dead connection (which has already been closed via socket -> close()!) and in our case, frequently leads to running out of heap due to reconnections on dodgy cellular connections.

I've instrumented the LWIP code to trace the state for the dead connection. When "close()" is called, it is placed in "FIN_WAIT_1" state, and remains in the "tcp_active_pcbs" list. Maybe waiting for FIN from the other end, which it never receives?

In this state, the "tcp_slowtmr" runs every 500 mS and re-sends the un-acked segments periodically, using an exponential retry interval, and (by default) 12 retries. After 12 retries, the PCB is finally cleaned up.

However, the "retry" timeout (pcb->rto) starts at typically 2, 3, or 4 "ticks" (1 tick = 500 mS slow timer) and DOUBLES every retry up to a certain max:

Code: Select all

  static const u8_t tcp_backoff[13] = { 1, 2, 3, 4, 5, 6, 7, 7, 7, 7, 7, 7, 7};
  ...
  int calc_rto = ((pcb->sa >> 3) + pcb->sv) << tcp_backoff[backoff_idx];

The end result is that the dead connection is not cleaned up for at least 15 minutes, and potentially over 30 minutes.

Is this logic correct? It seems like a very long time for LWIP to retry sending segments on a dead connection. We really need it to give up and clean up the old connection after a couple of minutes MAX, so that we don't run out of free heap in the case where the connection is unstable! 15 minute gives us plenty of time to disconnect and reconnect to the TLS server a few times and run out of heap.

Other Info:

We are using ESP-IDF 4.4.4

The LWIP config provides these options which look relevant:

LWIP_TCP_MSL - Maximum segment lifetime (MSL) - Default 60000 mS
LWIP_TCP_FIN_WAIT_TIMEOUT - Maximum FIN segment lifetime - Default 20000 mS

However:
* LWIP_TCP_MSL is only used for connection state "LAST_ACK" or PCBs in the "TIME-WAIT" list.
* LWIP_TCP_FIN_WAIT_TIMEOUT is only used when the connection state is "FIN_WAIT_2", which we never reach with "dead" connections.

jcolebaker · Postby **jcolebaker** » Wed Aug 02, 2023 3:04 am

I have found a solution for this issue.

By default, the "socket linger" socket option implementation is not enabled for LWIP. Instead, the code always tries to re-send un-acked data after the socket is closed (as described above) and this can take 30 minutes or so, so it is a long time before the associated memory is freed.

To fix the problem, I enabled "SO_LINGER" support in the sdkconfig section for LWIP ("CONFIG_LWIP_SO_LINGER=y").

I then set the SO_LINGER time to 0 for the file descriptor used by the TLS connection, using setsockopt, as follows:

Code: Select all

struct linger linger_opt =
{
    .l_onoff = 1,  // Enable linger: Must be 1 for l_linger time to be used.
    .l_linger = 0,
};

if (setsockopt(fd, SOL_SOCKET, SO_LINGER, &linger_opt, sizeof(linger_opt)) != 0) {
    ESP_LOGE(TAG, "Fail to setsockopt SO_LINGER");
    return ESP_ERR_ESP_TLS_SOCKET_SETOPT_FAILED;
}

Note that I wanted to set this option as soon as possible after the socket was created, so I added code to the "esp_tls_set_socket_options" function in esp_tls.

With SO_LINGER set to 0, the LWIP code doesn't try to re-send any un-acked segments after the socket is closed. Instead, it purges the stored segments and frees the RAM immediately, which means our application doesn't run out of heap.

The disadvantage would be that any data in the process of being sent (when the socket is closed) would be lost, and the TCP connection might not be shut down properly if close() is called on an active connection. This doesn't matter for us since we only close the socket if there is a comms error and we have already lost contact with the other end. Setting a small but reasonable SO_LINGER time on the socket (e.g. 10 s) should allow it to shutdown properly in most cases (if my understanding is correct) but I haven't tried this.

GianniM · Postby **GianniM** » Thu Nov 30, 2023 12:23 pm

Thanks for posting the solution! This also solved the issue I was facing in IDF v4.4.5.

SOLVED: Lost TLS connection causes apparent LWIP memory leak

SOLVED: Lost TLS connection causes apparent LWIP memory leak

Re: Lost TLS connection causes apparent LWIP memory leak

Re: SOLVED: Lost TLS connection causes apparent LWIP memory leak

Who is online

About Us

Extra

Information