Driven by a bad performance of the SMB-Mount (CORE-Filesystem) the network has been checked for issues.
(Started this investigation on 06. April 2021)
(Together with Stefan Dietrich, Juergen Hannappel, Sebastian Mleczek and Tim Schoof )
First observation was, that CORE transfers are a factor of 2 slower compared to BLFS transfers. (Windows)
As the next point iperf3 measurements has been done.
The results were very surprising as they showed some sender-TCP-port dependence. (From now on, we concentrated to iperf and didn check for SMB anymore)
(SMB issues was found by Stefan and colleagues, signing of SMB packets are bad in terms of data rate → changed now to encryption )
As the Network on DESY-campus is made out of several redundant connections combined to portchannels, there is a chance that one of this lines has some issues.
Every Network frames is analyzed by dst-IP,src-IP,dst-port,src-port (Layer 3+4 hashing) an depending on this, sended on a particular physical link.
Sebastian check for almost all transceivers and replaced some with "bad" signal-strength, but there was no change at all.
Very confusing was the fact, that this datarate-fluctuations are only visible by using a Windows PC as iperf3 client. By having a closer look linux has also issues which were counted as retransmission within the TCP-Stack.
Linux seems to have a better / different algorithm to recover from packet loss.
Packetloss / SACK / Retransmission all the time visible on wireshark traces
Up to this stage, all tests were done on a server connected via portchannel (bonding of physical network links to form a logical interface) with two switches. (MLAG, MultichassisLinkAGgregation)
A test with a single-Link sever doesn't show this datarate-fluctuations.
To be able to do tests on every day, a dedicated test-system has been installed on 08. April.
The results are in line with the results from the proxy-node.
Layer3+4 Hashing an distributing of traffic on the two links as expected, also differences in speed are visible
Sometimes datarate changes during a run:
Offload function (NIC applies some calculations on the frames and also combines them) can help a lot to reduce CPU (and IRQ) load. On the other hand, it can interfere with some functions of the OS-Stack.
A test by with disabled offload function was successful at the first moment, but finally it turned out, that this doesn't solve the problem reliable.
There was one run (0@3gbit/s, cyan trace) which was fine, but afterwards it became bad again.
In/out on same NIC?
At some point, it was noticeable, that the bad data rates occur once the IN-Data (payload from iperf3 client) as well as the OUT-Data (mainly ACK) are handled by the same NIC (or the same physical link on the portchannel)
This is true for the bad data rates from Windows machines and also for the high rate of retransmissions from Linux machines
Interrupt moderation rate / Interrupt delay
As data comes in small pieces (MTU=1500bytes) every frame generates an interrupt. To reduces this load on the server, the NIC can moderate the rate of interrupts or one can define am maximum rate.
Changing this value helped a lot on the test-server but was not successful on the proxy node. (maybe there was already a reasonable configuration)
RSS, Flow Director, Queues, IRQBalance
As the changes of the IRQ helped a lot on the test-server, this topic has to be checked more in detail.
As the NICs are multi-queue devices, they calculate a hash (like the switches are doing for the portchannels) and send RX data to one of the queues.
Intel network cards have advanced features to control how the data is distributed and the OS keeps track on distributing the IRQ-request equally over all available cores.
In addition the affinity of the application, which will receive the data, is adapted (ATR, Application Targeting Routing).
For some reason this mechanism is not always perfect.
A feature called flow-director accepts user define rules, how to distribute the data to the queues
By enabling this feature, all problems are gone...
It turned out, that massive switching of core-affinty of the application (iperf3 in this case) is the reason for packet loss.
During the time of switching between cores, the application cant handle the data and the buffer has an overflow.
In addition the ATR mechanism tries to follow this application-core-switching (or the other way round...) and switches the queue in order to match the configured core.
Irqbalance then also tries to changes core-affinity to have a more even distribution of load over the cores.
In this case, there ist just one process (iperf3 -s) which consumes a lot of CPU time and the NIC - base interrupts.
There are different ways how to handle this:
- Pin processes and NIC-interrupts to fixed cores (also disable irqbalance)
- Disable ATR (frames are just sorted to queues based on the hash-code)
First approach might be very interesting for dedicated server with multi-socket CPU configuration (to prevent data movements to a core which is not diretly connected to the hardware)
Second seems to be a bit less demanding on administration.
Why is windows more affected???
TCP transfers are als ACKed by the receiver. If the sender doesn't receive the ACK for sended data within some time, it defines the packet as lost.
TCP doesn't need an ACK per packet, it can send out up to MB (GB in theory) without getting an ACK, this is adapted during running transfer by window-sizes.
TCP has build in methods to deal with congestion. For example: a defined number (2 or 3 depending on OS) of duplicated ACK should tell the sender to slow down.
TCP assumes, that packet-loss is due to congestion, so the sender lowers the datarate and startet to increase step by step until next packet-loss occurs (a real network changes it capacity dynamically)
Windows seems to be a bit more gentle to the network and, after testing for high speed, stays at lower speed.
But what are the facts behind this behavior?
Windows seems to retransmit data even if they are ACKed. This happens on a time base of some ms. In addition to the extra-load on the wire, there is also some machanism which ask the sender to slow down in case of Dup Ack. In general this is a sign of packet-loss due to congestion.
What about Linux?
As defined in RFCxyz the receiver has to send a Dup ACK once a packet is missing. (It cant know, that it will arrive just ns later, due to network-related reordering)
Ther Linux-transmitting host just continous after getting the Dup ACK and also the ACK for all packets up to this point.(Dup ACK was not necessary as the missing packet was not missing but just in wrong order)
So there is also no need to slow down the transmitting-rate
Turning off SACK on the Linux receiver (for Windows I have no idea how to do) the results looked much better (in terms of data-rate)
But by looking at the wireshark-traces (no screenshot available yet) there are still a lot of Dup ACK on the wire. (But no retransmission like with SACK enabled)
There is still some friction in the system due to the jumping core-affinity and jumping rx-queue-affinity
Only with having everything together, like process-to-core-pinning, ATR:off and SACK:off (more important for windows-hosts), all runs smoothly with line-speed and almost no packetloss.
This is as it should be in such a network!
Further tests on other applications
Data transfer from eiger to GPFS.
Windows acts as man-in-the-middle. There is a port forwarding to just pipe through all traffic on port 80 (To make use of 10GE SFP+ transceiver with max cable length shorter 30m)
SMB mount of gpfs on a windows PC
SMB mount of core-fs on a 1GE windows pc
Also similar reports in the Internet (Found by Tim Schoof)
Tests with asap3-bl-prx04 (p07 proxy node)
Current situation with ntuple off (ATR is on) A lot of out-of-order packets
Test with ntuple on (ATR off) NOT A SINGLE ERROR
Test with new 100GE prx20
No problems with data transfer by using iperf3
maybe some tuning of the TCP parameters are needed but overall datarate was at line-speed