The context
Several nodes running as APs connecting to each other in a tree-like network; all nodes run the same code.
There's a pretty basic TCP server that listens for small packets of data; the server is running a 'classic' soccket() -> bind() -> listen() -> select() - accept() -> recv() scenario, using the recv() in a separate task; the socket file descriptor is released immediately after the recv completes.
The server processes an average of 2 packets per second. When a packet is received, it is relayed to some of the stations that are connected to the module, or to the 'parent' node depending on data.
The problem
Every now and then, the recv() or send() calls return error code 9, EBADF, and I can't seem to figure out why ...
Sometimes everything goes well for minutes, then there's a 'burst' of EBADFs; sometimes there's one EBADF every minute ort so... there doesn't seem to be a particular pattern.
Any help about debugging this issue is appeciated.
Thanks.
Random ERRNO 9 (EBADF) on recv() or send()
-
- Posts: 9764
- Joined: Thu Nov 26, 2015 4:08 am
Re: Random ERRNO 9 (EBADF) on recv() or send()
Are you sure you're handling your file descriptors correctly? No race conditions there?
Re: Random ERRNO 9 (EBADF) on recv() or send()
Well... That's part of the problem; there none that I could find. Stripping all error handling and other minor stuff, here's how it's done (in a simplified, c-like language)
It's pretty basic. This is how it's supposed to be done, right ?
There is some synchronization done in the processing section, but it does not involve the socket; besides, the socket fd is sent to the task and should become available for another operation after the close occurrs in the recvTask.
Code: Select all
function serverTask() {
server_socket = socket();
bind(server_socket);
listen(server_socket);
while (1) {
if (select(server_socket) >= 0) {
server_accept = accept(server_socket);
createTask(recvTask, server_accept);
}
}
}
function recvTask(server_accept) {
if (recv(server_accept) > 0) {
process(received_data);
}
closeSocket(server_accept)
}
There is some synchronization done in the processing section, but it does not involve the socket; besides, the socket fd is sent to the task and should become available for another operation after the close occurrs in the recvTask.
-
- Posts: 3
- Joined: Thu Oct 24, 2024 12:37 pm
Re: Random ERRNO 9 (EBADF) on recv() or send()
Did you ever get a grip on this issue?
I am seeing something similar with an AP setup on ESP32-S3 using ESP-IDF v5.3.1. I use the BSD sockets api to setup a TCP server:
- assign server socket file descriptor (serverFD) using bind() and then listen() with O_NONBLOCK flag set
- periodically poll file descriptors using select() with non-blocking timeout (0)
- if serverFD is set for reading - attempt to accept() a client connection as client socket file descriptor clientFD. If clientFD == -1 - close serverFD, otherwise assign clientFD
- if clientFD is set for reading - attempt to read data from clientFD using recv(). If zero bytes read - client closed the connection so close clientFD, if less than zero bytes read (and errno != EAGAIN/EINPROGRESS/EWOULDBLOCK) - connection error so close clientFD, if bytesRead - TCP data received from client - handle and queue responseBuffer
- if clientFD is set for writing and responseBuffer has data - attempt to transmit data to clientFD using send(). This is where I intermittently get errno == 9 (EBADF), which does not make sense because I have just received data on that FD and it has not closed yet
With the exact same sockets code using a STA Wifi connection, I DO NOT have the same issue.
I am seeing something similar with an AP setup on ESP32-S3 using ESP-IDF v5.3.1. I use the BSD sockets api to setup a TCP server:
- assign server socket file descriptor (serverFD) using bind() and then listen() with O_NONBLOCK flag set
- periodically poll file descriptors using select() with non-blocking timeout (0)
- if serverFD is set for reading - attempt to accept() a client connection as client socket file descriptor clientFD. If clientFD == -1 - close serverFD, otherwise assign clientFD
- if clientFD is set for reading - attempt to read data from clientFD using recv(). If zero bytes read - client closed the connection so close clientFD, if less than zero bytes read (and errno != EAGAIN/EINPROGRESS/EWOULDBLOCK) - connection error so close clientFD, if bytesRead - TCP data received from client - handle and queue responseBuffer
- if clientFD is set for writing and responseBuffer has data - attempt to transmit data to clientFD using send(). This is where I intermittently get errno == 9 (EBADF), which does not make sense because I have just received data on that FD and it has not closed yet
With the exact same sockets code using a STA Wifi connection, I DO NOT have the same issue.
-
- Posts: 3
- Joined: Thu Oct 24, 2024 12:37 pm
Re: Random ERRNO 9 (EBADF) on recv() or send()
Here is the LWIP Debug logs when this happens:FrancoisB-HEX wrote: ↑Thu Oct 24, 2024 1:03 pmDid you ever get a grip on this issue?
I am seeing something similar with an AP setup on ESP32-S3 using ESP-IDF v5.3.1. I use the berkley sockets api to setup a TCP server:
- assign server socket file descriptor (serverFD) using bind() and then listen() with O_NONBLOCK flag set
- periodically poll file descriptors using select() with non-blocking timeout (0)
- if serverFD is set for reading - attempt to accept() a client connection as client socket file descriptor clientFD. If clientFD == -1 - close serverFD, otherwise assign clientFD
- if clientFD is set for reading - attempt to read data from clientFD using recv(). If zero bytes read - client closed the connection so close clientFD, if less than zero bytes read (and errno != EAGAIN/EINPROGRESS/EWOULDBLOCK) - connection error so close clientFD, if bytesRead - TCP data received from client - handle and queue responseBuffer
- if clientFD is set for writing and responseBuffer has data - attempt to transmit data to clientFD using send(). This is where I intermittently get errno == 9 (EBADF), which does not make sense because I have just received data on that FD and it has not closed yet
With the exact same sockets code using a STA Wifi connection, I DO NOT have the same issue.
Code: Select all
D (174239) lwip: lwip_recvfrom(56): addr=
D (174239) lwip: 192.168.4.2
D (174240) lwip: port=51689 len=2
D (174284) lwip: lwip_select(57, 0x3fca8310, 0x3fca8318, 0x0, tvsec=0 tvusec=0)
D (174285) lwip: lwip_selscan: fd=56 ready for reading
D (174285) lwip: ip_input: iphdr->dest 0xa493214 netif->ip_addr 0x104a8c0 (0x493214, 0x4a8c0, 0xa000000)
D (174286) lwip: ip_input: iphdr->dest 0xa493214 netif->ip_addr 0x100007f (0x14, 0x7f, 0xa493200)
D (174286) lwip: ip4_input: packet not for us.
D (174287) lwip: lwip_selscan: fd=56 ready for writing
D (174287) lwip: lwip_select: nready=2
D (174287) lwip: lwip_recvfrom(56, 0x3fcb7084, 1440, 0x1, ..)
D (174288) lwip: lwip_recv_tcp: top while sock->lastdata=0x0
D (174288) lwip: lwip_recv_tcp: netconn_recv err=-15, pbuf=0x0
D (174289) lwip: lwip_recv_tcp: p == NULL, error is "Connection closed."!
D (174289) lwip: lwip_recvfrom(56): addr=
D (174289) lwip: 192.168.4.2
D (174289) lwip: port=51689 len=0
D (174290) lwip: lwip_close(56)
D (174290) lwip: tcp_close: closing in
D (174290) lwip: State: CLOSE_WAIT
D (174290) lwip: ip4_output_if: ap2
D (174291) lwip: IP header:
D (174291) lwip: +-------------------------------+
D (174291) lwip: | 4 | 5 | 0x00 | 68 | (v, hl, tos, len)
D (174291) lwip: +-------------------------------+
D (174292) lwip: | 179 |000| 0 | (id, flags, offset)
D (174292) lwip: +-------------------------------+
D (174292) lwip: | 64 | 6 | 0xf0ad | (ttl, proto, chksum)
D (174292) lwip: +-------------------------------+
D (174292) lwip: | 192 | 168 | 4 | 1 | (src)
D (174293) lwip: +-------------------------------+
D (174293) lwip: | 192 | 168 | 4 | 2 | (dest)
D (174293) lwip: +-------------------------------+
D (174293) lwip: ip4_output_if: call netif->output()
D (174295) lwip: lwip_send(56, data=0x3fcf21e4, size=4, flags=0x0)
W (174295) ./main/hsm/wifi/tcp-server-hsm.c: tcpServerCb_ClientTransmit: Error sending to client 56 on instance 1: 9
-
- Posts: 3
- Joined: Thu Oct 24, 2024 12:37 pm
Re: Random ERRNO 9 (EBADF) on recv() or send()
OK so my issue was that the client socket was in fact just closed before calling send() on its file descriptor. The sequence was:FrancoisB-HEX wrote: ↑Fri Oct 25, 2024 9:54 amHere is the LWIP Debug logs when this happens:FrancoisB-HEX wrote: ↑Thu Oct 24, 2024 1:03 pmDid you ever get a grip on this issue?
I am seeing something similar with an AP setup on ESP32-S3 using ESP-IDF v5.3.1. I use the berkley sockets api to setup a TCP server:
- assign server socket file descriptor (serverFD) using bind() and then listen() with O_NONBLOCK flag set
- periodically poll file descriptors using select() with non-blocking timeout (0)
- if serverFD is set for reading - attempt to accept() a client connection as client socket file descriptor clientFD. If clientFD == -1 - close serverFD, otherwise assign clientFD
- if clientFD is set for reading - attempt to read data from clientFD using recv(). If zero bytes read - client closed the connection so close clientFD, if less than zero bytes read (and errno != EAGAIN/EINPROGRESS/EWOULDBLOCK) - connection error so close clientFD, if bytesRead - TCP data received from client - handle and queue responseBuffer
- if clientFD is set for writing and responseBuffer has data - attempt to transmit data to clientFD using send(). This is where I intermittently get errno == 9 (EBADF), which does not make sense because I have just received data on that FD and it has not closed yet
With the exact same sockets code using a STA Wifi connection, I DO NOT have the same issue.Code: Select all
D (174239) lwip: lwip_recvfrom(56): addr= D (174239) lwip: 192.168.4.2 D (174240) lwip: port=51689 len=2 D (174284) lwip: lwip_select(57, 0x3fca8310, 0x3fca8318, 0x0, tvsec=0 tvusec=0) D (174285) lwip: lwip_selscan: fd=56 ready for reading D (174285) lwip: ip_input: iphdr->dest 0xa493214 netif->ip_addr 0x104a8c0 (0x493214, 0x4a8c0, 0xa000000) D (174286) lwip: ip_input: iphdr->dest 0xa493214 netif->ip_addr 0x100007f (0x14, 0x7f, 0xa493200) D (174286) lwip: ip4_input: packet not for us. D (174287) lwip: lwip_selscan: fd=56 ready for writing D (174287) lwip: lwip_select: nready=2 D (174287) lwip: lwip_recvfrom(56, 0x3fcb7084, 1440, 0x1, ..) D (174288) lwip: lwip_recv_tcp: top while sock->lastdata=0x0 D (174288) lwip: lwip_recv_tcp: netconn_recv err=-15, pbuf=0x0 D (174289) lwip: lwip_recv_tcp: p == NULL, error is "Connection closed."! D (174289) lwip: lwip_recvfrom(56): addr= D (174289) lwip: 192.168.4.2 D (174289) lwip: port=51689 len=0 D (174290) lwip: lwip_close(56) D (174290) lwip: tcp_close: closing in D (174290) lwip: State: CLOSE_WAIT D (174290) lwip: ip4_output_if: ap2 D (174291) lwip: IP header: D (174291) lwip: +-------------------------------+ D (174291) lwip: | 4 | 5 | 0x00 | 68 | (v, hl, tos, len) D (174291) lwip: +-------------------------------+ D (174292) lwip: | 179 |000| 0 | (id, flags, offset) D (174292) lwip: +-------------------------------+ D (174292) lwip: | 64 | 6 | 0xf0ad | (ttl, proto, chksum) D (174292) lwip: +-------------------------------+ D (174292) lwip: | 192 | 168 | 4 | 1 | (src) D (174293) lwip: +-------------------------------+ D (174293) lwip: | 192 | 168 | 4 | 2 | (dest) D (174293) lwip: +-------------------------------+ D (174293) lwip: ip4_output_if: call netif->output() D (174295) lwip: lwip_send(56, data=0x3fcf21e4, size=4, flags=0x0) W (174295) ./main/hsm/wifi/tcp-server-hsm.c: tcpServerCb_ClientTransmit: Error sending to client 56 on instance 1: 9
- select(readFDs,writeFDs) -> readFD set for clientSocketFD, and writeFD set for clientSocketFD (and writeBuffer has data to be sent)
- read(clientSocketFD) with 0 bytes -> clientSocket closed and clientSocketFD set for dereferencing
- write(clientSocketFD) -> EBADF since socket closed
My underlying issue is still that TCP data is getting adversely delayed when using Soft AP which causes the TCP client to close the connection. With STA WiFi this does not occur. I have also disabled Nagle's algorithm (with TCP_NODELAY) which improves the situation, but does not solve it for this scenario.
Who is online
Users browsing this forum: Baidu [Spider], brushlow18 and 117 guests