Hi jcsbanks,
Just to make sure I understand: when you look at IP layer packet captures you see some TCP packets contain the results of multiple calls to netconn_send() or netconn_write(), yes?
I think a 40ms delay is probably due to task timing rather than LWIP deliberating "nagling" the packets.
When you send to a socket from a task (using either the netconn or BSD socket algorithms), the packet is added to a queue for the TCP/IP task to handle. When the task runs, it will send all of the waiting data that it can for a particular socket. If other tasks or interrupts in the system prevent the TCP/IP task from running until after multiple writes have been done to that particular socket, the TCP/IP task will combine these writes into a single IP packet (which is desirable, to reduce packet overhead).
ie the TCP/IP task will send packets as fast as it can, but only if it's able to run.
The other possibility is that if an ACK is lost or delayed, the LWIP stack will start queueing up packets to be sent after the un-acked packet in the stream. So this may cause some combining of data.
I wrote a quick bit of test code and I actually was unable to make LWIP combine any writes at all, with or without tcp_nagle_disable() - all packets had 6 byte payloads. I put this down to a fast network, but mostly due to nothing else being active on the ESP32 when the task is running.
The best thing you can do is to lower the priority of other task(s) you are running in the system (and reduce the frequency of any interrupts, if you can), to give the TCP/IP task the maximum possibility of running.