8 second stalls when TCP running, less than 10k/sec throughput

Greg Corson
Posts: 35
Joined: Sun May 20, 2018 9:16 pm

8 second stalls when TCP running, less than 10k/sec throughput

Postby Greg Corson » Thu May 24, 2018 8:49 pm

So I just ported over a small arduino program to ESP32. This program creates a single WiFiServer and WiFiClient object.

A remote program on a PC connects to this server and then the ESP32 sends a 100byte message 100 times per second. The message is sent as a single client.write(buffer, length) call each time through the main loop. The same message is also sent out serial.

This worked great on an Adafruit Feather M0 WiFi board. When I ported it to their ESP32 board the TCP started having big problems.

The main loop has a spin lock, while(micros()<next_run_time); to lock it to 100hz. Sending messages out the serial is fine but when I have it start sending messages out TCP, performance is horrible. The main loop experiences unpredictable stalls of as long as 1-8 SECONDS frequently. If I stop sending TCP messages, performance goes back to normal.

Seems like the ESP32 shouldn't have problems sending just 10k/second (100 packets/second). What could I be doing wrong? The same code was running fine on the other board.

Greg Corson
Posts: 35
Joined: Sun May 20, 2018 9:16 pm

Re: 8 second stalls when TCP running, less than 10k/sec throughput

Postby Greg Corson » Fri May 25, 2018 2:55 pm

Forgot to mention, what I am doing is reading an I2C sensor at 100hz and then writing the values it returns to a TCP stream at 100hz.

This works fine all day long if I leave WiFi enabled but write the data to serial instead of TCP.

Just wanted to add that because I saw a problem mentioned elsewhere where it was a bad interaction between I2C and wifi.

Still digging into this and will try to put a simple reproduce case together.

Greg Corson
Posts: 35
Joined: Sun May 20, 2018 9:16 pm

Re: 8 second stalls when TCP running, less than 10k/sec throughput

Postby Greg Corson » Fri May 25, 2018 5:09 pm

Here's a bit of test code, I chopped it down as much as possible and still have it outputting the same text in the same way. There is no I2C anymore. I just updated to the latest ESP32 arduino stuff this morning. In my case the AP/router is only 6 feet away from the computers and ESP32 and no other devices are using it.

To run the test just add the name of your AP and it's password below, compile and run. You should almost immediately get strings of numbers out the serial port. Then connect to port 90 using a TCP terminal program like Putty, you should get the same stream of numbers there. Let it run and every once in a while the stream of numbers will just stop for 8 seconds or more.

The first number in the string of numbers being printed is the loop time in us, it should be around 10000 but when there is a stall it will be a LOT larger.

The other thing that is odd, but you can't see with Putty, is that even when it's running smoothly the TCP traffic is very bursty. I have a PC telemetry program that ingests and graphs the stream in real time, when I use it with ESP32 the graph stutters and updates in irregular spurts. With a different MCU the graph moves smoothly.

Have I missed some setup thing that needs to be done to get this to work? Maybe buffer sizes, turning nagle mode off or something?

Code: Select all

#define FOR_FEATHER_ESP32
#include <WiFi.h>

//-----------------------------------------------------------------------------------------------------------------
// WiFi Access point name and password.
//-----------------------------------------------------------------------------------------------------------------

const char ssid[] = "yourAP";
const char pass[] = "yourPASS";

#define TIME_BETWEEN_SENSOR_READS  (10000)   // Loop time in microseconds
#define TIME_BETWEEN_PID_COMPUTES     (10)   // Same as above, but in miliseconds, for the PID library


uint8_t
  system_state  = 1;     // current state of the robots state machine

// These are tunable parameters, what works for one bot will probably not work for another, even if the design is
// exactly the same.  For a new bot tt is best to set all these to zero and start tuning from there.
// When you have a tune that works well, put the values below so your bot will start up in-tune.

float
  tp,ti,td;                        // calculated components of the pid for telemetry
double
  balanceSetpoint = 2.7,          // Natural balance point of the robot (usually 0 unless robot is lopsided)
  totalSetpoint = 0.0,
  sensorPitch = 0.0,              // The current pitch value from the sensor
  pitchError = 0.0,               // Calculated pitch error (balanceSetpoint-sensorPitch)
  pidOutput = 0.0;

int
  bias_test_speed = 1;
  
// Timers and stats

unsigned long
  next_blynk,
  next_rssi,
  loop_start,       // Time at the top of the main loop
  last_loop_start,  // Time at the top of the previous main loop
  sensor_frequency, // Time between sensor reads
  last_sensor,      // Time of last sensor read
  pid_frequency,    // Time between PID updates
  last_pid,         // Time of last PID update
  blynk_time,       // Time to execute the blynk processing in the main loop
  sensor_time,      // Time to execute the sensor read portion of the loop
  next_sensor,      // Next time the sensors should be read
  last_loop_time;   // Time spent inside the previous loop (report 1 frame late so we can capture the time required to do telemetry)

// Sensor related globals

uint8_t
  sys           = 0,              // Calibration state of the IMU
  gyro          = 0,              // gyro calibration state
  accel         = 0,              // accelerometer calibration state
  mag           = 0;              // magnitometer calibration state


//-----------------------------------------------------------------------------------------------------------------
// For the TCP based telemetry and control on port 90
// NOTE: if more than one client connects or this will get confused.
//-----------------------------------------------------------------------------------------------------------------
//

WiFiServer 
  server(90);                   // Creates Server on port 90
WiFiClient 
  client;                       // First client connected
bool 
  clientConnected  = false,     // TRUE when client is connected
  clientCommand    = false;     // TRUE when there is a complete command in the buffer
char 
  clientBuffer[80];             // Buffer for incomming commands
int 
  clientBufsize    = 0;         // Size of the current line of text in the buffer

//-----------------------------------------------------------------------------------------------------------------
// For the telemetry system, used for both serial and wifi
//-----------------------------------------------------------------------------------------------------------------

// printBuffer is only initialized for test purposes.  You can delete the initializer to save space.
//                     123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890

uint8_t 
  printBuffer[200]  = "15004,359.94,85.31,-3.94,1,0,3,0,0,15004,2818,15,0,0.00,-83.31,17.00,48.00,0.60,FAKE\n";
int 
  printBufferLen    = 0;          // Length of string in print buffer 85
bool 
  serialTelemetryOn = true,      // true to turn on serial telemetry at startup
  tcpTelemetryOn    = true;      // true to turn on TCP telemetry as soon as a connection is made to port 90

//-----------------------------------------------------------------------------------------------------------------
// void setup(void) --- automatically called by arduino at startup to set up the entire system
//-----------------------------------------------------------------------------------------------------------------
void setup(void)
{
  // Setup Serial, to be used with the USB-serial telemetry app bit rate in the telemetry app needs to match.
  
  Serial.begin(2000000);
  Serial.println(F("BalanceBotMultiplatform7")); 
  
  // Initialize  WiFi

    Serial.println();
    Serial.print("Connecting to ");
    Serial.println(ssid);

    WiFi.begin(ssid, pass);

    while (WiFi.status() != WL_CONNECTED) {
        delay(500);
        Serial.print(".");
    }

    Serial.println("");
    Serial.println("WiFi connected.");
    Serial.println("IP address: ");
    Serial.println(WiFi.localIP());
      

    server.begin();
    Serial.print("connected to wifi at ");
    IPAddress myAddress=WiFi.localIP();
    Serial.println(myAddress);

  
  // Initialize all the time keeping variables.
  // These are setup so the times reported first time through the loops will be close
  // to zero rather than unusually large (if they were set before blink/wifi was initialized
  // they would be very big because it takes blynk a lot of time to connect to wifi)

  last_loop_start = micros();
  last_sensor     = micros();
  last_pid        = micros();
  pid_frequency   = 0;        // Time between PID updates
  blynk_time      = 0,        // Time to execute the blynk processing in the main loop
  sensor_time     = 0,        // Time to execute the sensor portion of the loop
  next_sensor     = 0;        // Next time the sensors should be read

}

//-----------------------------------------------------------------------------------------------------------------
// void loop(void) --- It is very important that this loop executes exactly on time.  The PID library and other
// code assumes this.  Confirm with telemetry that you are not going over time and adjust the TIME_BETWEEN_SENSOR_READS
// to make sure you will always run on time.
//
void loop(void)
{
  // Start with a Spinlock to make sure this runs every TIME_BETWEEN_SENSOR_READS microseconds
  // This could also be scheduled with an interrupt.
  // Microsecond resolution is probably not required, only consistency, so this code could use
  // the milisecond timer to avoid problems with wrap-around.
  // I continue to use microseconds because I need the high resolution to time some of the 
  // execution time stats accurately.
  
  while(micros() < next_sensor);

  // Remember loop start time
  
  loop_start = micros();
  
  // If it's time to get a new sensor event, get it

  sensor_time       = micros();                                 // Start timer on sensor read 
  last_sensor       = sensor_time;
  next_sensor       = sensor_time + TIME_BETWEEN_SENSOR_READS;  // Compute time of next sensor read
  sensor_time       = micros()-sensor_time;                     // Compute how long this read took
    
  // State handling
  
  if(!clientConnected)
  {
    // Check to see if a client has connected
    if(client = server.available())
    {
      clientConnected = true;
      clientCommand   = false;
      clientBufsize   = 0;
      Serial.println("Connected to TCP telemetry/command." );
    }
  }
  
  // Only generate the string if serial or TCP telemetry is on, this speeds up the loop when you aren't using telemetry.
  if(tcpTelemetryOn || serialTelemetryOn)
  {
    // Calculate the MCU battery voltage (for feather M0)
    
    float measuredvbat = 0;
    printBufferLen = sprintf((char*)printBuffer,"%d,%4.2f,%4.2f,%4.2f,%d,%d,%d,%d,%d,%d,%d,%d,%d,%4.2f,%4.2f,%4.2f,%4.2f,%4.2f,%d,%4.2f,%4.2f,%4.2f,%4.2f,%4.2f\n",
      (loop_start - last_loop_start),   // Total loop time, measured from start of previous loop to start of this one
      (1),              // Sensor fusion x
      (2),              // Sensor fusion y
      (3),              // Sensor fusion Z
//      euler.x(),
//      euler.y(),
//      euler.z(),    
      (system_state),                     // System state machine
      (sys),                              // IMU calibration states
      (gyro),
      (accel),
      (mag),
      (sensor_frequency),                 // Time between sensor reads
      (sensor_time),                      // Time to do a sensor read
      (blynk_time),                       // Time to do a blynk call
      (pid_frequency),                    // Time between PID calculations
      (pidOutput),                        // Output of a PID calculation
      (totalSetpoint - sensorPitch),      // Pitch error
      (1),                    // PID tuning parameters
      (2),
      (3),
      (last_loop_time),                   // time since loop start not counting write of stats
      measuredvbat,                       // battery voltage
      totalSetpoint,                      // setpoint including joystick contribution
      tp,
      ti,
      td);
  }

  // Send telemetry buffer out serial using Serial.write

  if(serialTelemetryOn && printBufferLen > 0)
  {
    Serial.write(printBuffer,printBufferLen);
  }

  // send telemetry buffer out TCP

  if(clientConnected && tcpTelemetryOn)
  {
    pid_frequency = micros();
    client.write(printBuffer,printBufferLen);
    pid_frequency = micros()-pid_frequency;
  }

// Update timing stats, note that "last loop time" is the time of the PREVIOUS loop execution


  printBufferLen = 0;
  last_loop_time = micros()-loop_start;
  last_loop_start = loop_start;
 
 }

Greg Corson
Posts: 35
Joined: Sun May 20, 2018 9:16 pm

Re: 8 second stalls when TCP running, less than 10k/sec throughput

Postby Greg Corson » Fri May 25, 2018 10:27 pm

I did some wireshark captures and am seeing what look like bad retransmits due to missing ACKs or timeouts.

It usually starts with a "previous segment not captured" error, followed by a number of duplicate ACKs, a TCP retransmission and then a "spurious retransmission" error.

I need to check more carefully but it appears that either the ESP dropped an ACK or timed out and decided to retransmit, but according to wireshark the PC seemed to have received and acked the data the first time.

Is there any way I can change the ESP buffer size or other TCP timeout values that might resolve this?

jweigelt
Posts: 1
Joined: Sun Apr 14, 2019 10:01 pm

Re: 8 second stalls when TCP running, less than 10k/sec throughput

Postby jweigelt » Sun Apr 14, 2019 11:01 pm

Sorry for digging up this old thread again. I just wasted a week on a smiliar problem with the solution being rather simple, so maybe I can save someone a bit of hassle.

The short answer is: disable nagle.

The problem is caused by the nagle algorithm as it will only allow to push out data if the previous ACK was received.
When a single ACK is dropped, lwip will wait until the ack timeout expires before resending the packet as "spurious retransmission". As lwip's default ACK timeout is not very strict this will cause a "stall" lasting multiple seconds while lwip waits for the lost ack to arrive before resending. For sending "live" data, this is obviously very bad.
The timeout can be reduced to 500ms at which point lwip's tcp polling rate will prevent setting it any lower. By rebuilding lwip, this rate can be increased.

For sending a steady stream of data, I found that disabling nagle completely works best. When an ack is dropped it'll just send the next packet. If a packet is dropped the client will detect that it's out of sequence and resend the lost packet ("TCP fast retransmission").
The connection usually recovers within a few milliseconds without causing any notable stalls.

Disabling nagle obviously comes at the price of reduced payload throughput.

I'm not entirely sure how you would disable it using arduino. Using lwip's raw tcp api you can just call

Code: Select all

tcp_nagle_disable(pcb);

Who is online

Users browsing this forum: No registered users and 94 guests