Page 1 of 4

Socket issues

Posted: Mon Aug 07, 2017 3:25 pm
by permal
Hi,

Firstly, let me state that this question is about establishing connections on the TCP/IP-level, me mentioning higher level stuff is to describe how I see the issue.

I'm experiencing intermittent issues with connecting to my MQTT broker (mosquitto) from my ESP32 while at the same time I am able to connect to an instance of netcat on another machine (both within the same subnet). Sometimes, if I let it retry enough times the MQTT connection succeeds after ~10 retries over a minute or so (the socket to netcat remains connected and transmitting/receives data during this time).

However, sometimes it never seems to succeed until I either
1) restart the ESP
or
2) disconnect the Wifi link to the ESP32 from within my Wifi AP (which forces re-initialization the Wifi Station and new sockets being created.)

During the time that the MQTT connection fails, I can connect with other MQTT clients to that same broker so there is no issue on that end.

Usually the connection failure messages I get are "Connection reset by peer" or "Software caused connection abort". Has anyone else experienced this kind of behavior, and did you figure out how to solve it?

I don't know if it matters, but I am initiating the connections immediately after receiving the SYSTEM_EVENT_STA_GOT_IP.

Re: Socket issues

Posted: Mon Aug 07, 2017 4:53 pm
by kolban
From a "cold start" does the connection succeed the first time? If yes, this would lead me to look for some state in the environment.

A not uncommon story is when a previous TCP connection is not cleanly shutdown and the TCP layer performs a "timed wait" waiting for potential previously in transit IP packets to arrive. A timed wait can result in a port not being able to be re-allocated for a listener until after that period. There are ways to alleviate this if it turns out to be the problem.

While I hear you loud and clear saying that we are thinking at the sockets level as opposed to the higher level layers, are you actually making TCP/IP sockets calls or are you asking MQTT client code to make the sockets calls for you? Diagnosing the problem at the sockets level without knowing how they are used inside MQTT might be fruitless until we can fully eliminate MQTT as a component. Do you have an ESP32 app that tests the ability to make a TCP connection, send and receive some data and the disconnect that runs (or doesn't) as expected?

Re: Socket issues

Posted: Mon Aug 07, 2017 5:01 pm
by permal
It often fails on the first try as well. I'm doing all the socket-level calls myself so I am certain that they are performed.

Both connections use the exact same code to make the initial TCP connection, but one fails the other doesn't. I'm using raw IPs so no DNS lookups that may interfere.

Re: Socket issues

Posted: Mon Aug 07, 2017 6:30 pm
by kolban
I am assuming the failure occurs when you mat a sockets "connect()" call. Can you confirm what the value of the "errno" global is when connect returns? Which FreeRTOS task are you making the connection request from? If you target another network endpoint that is listening on a port for connections (other than the one that is failing), does it too fail?

Re: Socket issues

Posted: Mon Aug 07, 2017 7:43 pm
by ESP_igrr
capturing the packets using Wireshark / tcpdump would likely help to narrow down the issue.

Re: Socket issues

Posted: Mon Aug 07, 2017 7:54 pm
by permal
ESP_igrr wrote:capturing the packets using Wireshark / tcpdump would likely help to narrow down the issue.
Yes, I'm mainly asking in case someone has experienced this kind of issue previously. I will do analysis on a packet level too.

Re: Socket issues

Posted: Wed Aug 09, 2017 12:33 pm
by permal
Ok, so after spending many hours debugging this issue and making several improvements to my code I've come to a point where I nearly consistently can reproduce the issue. Normally a reproducible problem is easy to fix, but I must admit that I'm out of ideas on how to tackle this.

The setup is now as follows:

- ESP32 at address 192.168.10.24
- A remote endpoint at address 192.168.10.247

Properly working sequence of event:

Code: Select all

1. Remote endpoint is up and listening for connections.
2. ESP32 starts
3. Wifi connected and IP received from DHCP.
4. Connection to remote endpoint initiated.
5. Connected.
6. Remote endpoint shutdown.
7. Socket disconnected and shutdown/closed.
8. Re-connection using a new socket.
9. Connection fails, repeat 7-9 an arbitrary number of times.
10. Remote endpoint brought back up.
11. Re-connection succeeds.
Sequence of events leading to failure:

Code: Select all

1. Remote endpoint is DOWN.
2. ESP32 starts
3. Wifi connected and IP received from DHCP.
4. Connection to remote endpoint initiated.
5. Connection fails
6. Socket shutdown/closed.
7. Re-connection using a new socket.
8. Connection fails, repeat 6-8 an arbitrary number of times.
9. Remote endpoint brought back up.
10. Expected result is that the connection succeeds, but it never does and 6-8 is repeated indefinitely.
Its as if when the remote endpoint is not available on the first connection attempt, it prevents all other connections too. I had a hard time believing this conclusion so I set up another remote endpoint and did the same tests again, which results in the expected behavior. Based on this I looked closer at the connection sequence for the two remote endpoints and found the following:

- On the first endpoint (mosquitto running in a docker container), when mosquitto is shutdown I get the error "Software caused connection abort". Turning on mosquitto after the first connection attempt does not result in a connection success later on.

- On the second endpoint (mosquitto running directly on my Ubuntu), when mosquitto is shutdown I get the error "Connection reset by peer". Turning on mosquitto after the first connection attempt DOES result in a connection success later on.

The behavior seems nearly consistent, because very rarely the first endpoint behaves as is expected too, see below.

Looking at the network layer (using tshark, the CLI version of Wireshark) I noticed that when connections to endpoint one fails, there is a single incoming package to the endpoint from the first connection attempt and no incoming package from any subsequent connect()-call (all of which use a new socket). Also note the total lack of retransmissions and replies to the SYN packet.

Code: Select all

363 458.544800197 192.168.10.24 -> 192.168.10.247 TCP 60 minger > mqtt [SYN] Seq=0 Win=5744 Len=0 MSS=1436
On the rare occasion that things go as expected, i.e. starting the first endpoint results in a connection success, I also see the TCP retransmissions while the endpoint is still off:

Code: Select all

364 626.309045066 192.168.10.24 -> 192.168.10.247 TCP 60 14076 > mqtt [SYN] Seq=0 Win=5744 Len=0 MSS=1436
365 629.189142011 192.168.10.24 -> 192.168.10.247 TCP 60 [TCP Retransmission] 14076 > mqtt [SYN] Seq=0 Win=5744 Len=0 MSS=1436
366 632.071114129 192.168.10.24 -> 192.168.10.247 TCP 60 [TCP Retransmission] 14076 > mqtt [SYN] Seq=0 Win=5744 Len=0 MSS=1436
367 634.948764234 192.168.10.24 -> 192.168.10.247 TCP 60 [TCP Retransmission] 14076 > mqtt [SYN] Seq=0 Win=5744 Len=0 MSS=1436
368 637.832879899 192.168.10.24 -> 192.168.10.247 TCP 60 [TCP Retransmission] 14076 > mqtt [SYN] Seq=0 Win=5744 Len=0 MSS=1436
So, my question is thus "what prevents the underlying TCP layer in the ESP32 from sending SYN (and retransmissions thereof)?" My two tasks are running at priority tskIDLE_PRIORITY + 1 so I don't seeing them starving the Wifi/TCP tasks. What other factors are there? Is the consistent difference in error messages of importance?

Re: Socket issues

Posted: Wed Aug 09, 2017 2:00 pm
by kolban
What is the computer/device that the remote endpoint is running on?
What is the application that is the remote endpoint?
Is the remote endpoint listening on a fixed port number?
During the times when the ESP32 can't connect to the remote endpoint, can another device/PC? (don't test connecting to the endpoint from the machine running the endpoint, introduce a test on a 3rd device)
When the ESP32 starts, is it allocated the same IP address each time? If not try a test with the ESP32 with a static IP address.
Can you try an alternate endpoint application ... eg. "nc" that simply accepts an incoming connection? Maybe test with an endpoint that pre-exists on the Internet?

Re: Socket issues

Posted: Wed Aug 09, 2017 2:44 pm
by permal
kolban wrote:...
The answers to those questions can be found in my previous post. It is always the same IP numbers and the first endpoint is reachable and connectable from other computers when the ESP32 can't connect to it.

Re: Socket issues

Posted: Wed Aug 09, 2017 5:53 pm
by enitalp
I'm playing a lot with TCP and UDP socket on my custom esp32 board, and discovered that i had hardware problem with my custom wifi antenna. So i started to make some unitest test with ESP32, a PC, a Routeur, to identify where my packet repeat and loss where coming from. At first i blamed the ESP32, but after some test. nope. not his fault.

My first test was :

Take time, ESP32 Send "ping" -> Wifi Routeur -> PC When receive "Ping" send "pong" -> Routeur -> ESP32 Receive "pong" Display time between start en now.

i usually did get a 120 ms, with a lot of lag up to 2s,
Used wireshark to see the packet repeat, on the ACK repeats. Repeat was in both direction, not only in one.

Tried a lot of different ESP32 CPU frequencies, Dual Core/Single Core, Socket configs. code change. always the same. Sometimes working with no problem for 1min ( still 100+ms) and suddenly a lot of packet loss and lags.

Then i tried another unit test.

Put the ESP32 in AP, an my PC as client, without going throw my routeur.

Result, 8 to 15 ms !! very very rare, repeat.

Everything work as intended.

My routeur was at the same place as my PC, not connected to internet. only ESP and PC connected to it.

So ESP32 wifi socket works perfectly and my test routeur is crap (linksys/cisco WRT54G).

I've did the test, and sparkfun esp32 thing, Adafruit ESP32 Huzzah, 2 version of my custon board using the ESP32-D2WD with our antenna design.

Also find out, that when switching to a different wifi, wifi config, it's better to erase the ESP32 flash (and that is a bug of the ESP32 for me).