Network Readiness - Part 1: Physical Network MTU

Dear readers

Welcome to a new series of blogs talking about the network readiness. As you might be already aware, NSX-T requires from the physical underlay network mainly two things:

IP Connectivity– IP connectivity between all components of NSX-T and compute hosts. This includes on one hand the Geneve Tunnel Endpoint (TEP) interfaces and an other management interfaces (typically vmk0) on hosts as well NSX-T Edge nodes (management interface) - both bare metal and virtual NSX-T Edge nodes.
Jumbo Frame Support– A minimum required MTU is 1600, however MTU of 1700 bytes is recommended to address the full possibility of variety of functions and future proof the environment for an expanding Geneve header. To get out most of your VMware SDDC your physical underlay network should support at least an MTU of 9000 bytes.

This blog has a focus on the MTU readiness for NSX-T. There are other VMkernel interfaces than for the overlay encapsulation with Geneve, like vSAN or vMotion which perform better with a higher MTU. So we keep this discussion on the MTU more generally. Physical network gear vendors, like Cisco with the Nexus Data Center switch family typically support a MTU of 9216 bytes. Other vendors might have the same MTU upper size.

This blog is about the correct MTU configuration and the verification within the Data Center spine-leaf architecture with Nexus 3K switches running NX-OS. Lets have a look to a very basic and simple lab spine-leaf topology with only three Nexus N3K-C3048TP-1GE switches:

Out of the box, the Nexus 3048 switches are configured with a MTU of 1500 bytes only. For an MTU of 9216 bytes we need to configure three pieces.

Layer 3 Interfaces MTU Configuration – This type of interface is used between the Leaf-10 and the Borderspine-12 switch respective between the Leaf-11 and Borderpine-12 switch. We run on this interface OSPF to announce the Loopback0 interface for the iBGP peering connectivity. As example the MTU Layer 3 interface configuration on interface e1/49 from the Leaf-10 is shown below:

Nexus 3048 Layer 3 Interface MTU Configuration
NY-N3K-LEAF-10# show run inter e1/49 ---snip--- interface Ethernet1/49 description L3 to NY-N3K-BORDERSPINE-12 no switchport mtu 9216 no ip redirects ip address 172.16.3.18/30 ip ospf network point-to-point no ip ospf passive-interface ip router ospf 1 area 0.0.0.0 NY-N3K-LEAF-10#

Nexus 3048 Layer 3 Interface MTU Configuration

NY-N3K-LEAF-10# show run inter e1/49

---snip---

interface Ethernet1/49

description **L3 to NY-N3K-BORDERSPINE-12**

no switchport

mtu 9216

no ip redirects

ip address 172.16.3.18/30

ip ospf network point-to-point

no ip ospf passive-interface

ip router ospf 1 area 0.0.0.0

NY-N3K-LEAF-10#

Layer 3 Switch Virtual Interfaces (SVI) MTU Configuration – This type of interface is required as example to establish an IP connectivity between the Leaf-10 and Leaf-11 switches when the interfaces between the Leaf switches are configured as Layer 2 interfaces. We are using a dedicated SVI for VLAN 3 for the OSPF neighborship and the iBGP peering connectivity between the Leaf-10 and Leaf-11. In this lab topology are the interfaces e1/51 and e1/52 configured as dot1q trunk to carry multiple VLANs (including VLAN 3) and these to interfaces are combined into a portchannel running LACP for redundancy reason. As example the MTU configuration of the SVI for VLAN 3 from the Leaf-10 is shown below:

Nexus 3048 Switch Virtual Interface (SVI) MTU Configuration
NY-N3K-LEAF-10# show run inter vlan 3 ---snip--- interface Vlan3 description iBGP-OSPF-Peering no shutdown mtu 9216 no ip redirects ip address 172.16.3.1/30 ip ospf network point-to-point no ip ospf passive-interface ip router ospf 1 area 0.0.0.0 NY-N3K-LEAF-10#

Nexus 3048 Switch Virtual Interface (SVI) MTU Configuration

NY-N3K-LEAF-10# show run inter vlan 3

---snip---

interface Vlan3

description *iBGP-OSPF-Peering*

no shutdown

mtu 9216

no ip redirects

ip address 172.16.3.1/30

ip ospf network point-to-point

no ip ospf passive-interface

ip router ospf 1 area 0.0.0.0

NY-N3K-LEAF-10#

Global Layer 2 Interface MTU Configuration – This global configuration is required for this type of Nexus switches and a few other Nexus switches (please see footnote 1 for more details). This Nexus 3000 does not support individual Layer 2 interface MTU configuration; the MTU for Layer 2 interfaces must be configured via a network-qos policy command. All interfaces configured as access or trunk port for host connectivity and as well for the dot1q trunk between the Leaf switches (e1/51 and e1/52) requires the network-qos configuration as shown below:

Nexus 3048 Global MTU QoS Policy Configuration
NY-N3K-LEAF-10#show run ---snip--- policy-map type network-qos POLICY-MAP-JUMBO class type network-qos class-default mtu 9216 system qos service-policy type network-qos POLICY-MAP-JUMBO NY-N3K-LEAF-10#

Nexus 3048 Global MTU QoS Policy Configuration

NY-N3K-LEAF-10#show run

---snip---

policy-map type network-qos POLICY-MAP-JUMBO

class type network-qos class-default

mtu 9216

system qos

service-policy type network-qos POLICY-MAP-JUMBO

NY-N3K-LEAF-10#

The network-qos global MTU configuration needs to be verified with the command as shown below:

Nexus 3048 Global MTU QoS Policy Verification
NY-N3K-LEAF-10# show queuing interface ethernet 1/51-52 \| include MTU HW MTU of Ethernet1/51 : 9216 bytes HW MTU of Ethernet1/52 : 9216 bytes NY-N3K-LEAF-10#

Nexus 3048 Global MTU QoS Policy Verification

NY-N3K-LEAF-10# show queuing interface ethernet 1/51-52 | include MTU

HW MTU of Ethernet1/51 : 9216 bytes

HW MTU of Ethernet1/52 : 9216 bytes

NY-N3K-LEAF-10#

The verification of the end-to-end MTU of 9216 bytes within the physical network should be done already typically before you attach your first hypervisor ESXi hosts. Please keep in mind, the virtual distributed switch (vDS) and the NSX-T N-VDS (e.g uplink profile MTU configuration) supports today up to 9000 bytes. This MTU includes the overhead for the Geneve encapsulation. As you could see in the table below of an ESXi host, the MTU is set to the maximum of 9000 bytes for the VMkernel interfaces used for Geneve (we label it unfortunately still with vxlan) respective for vMotion and IP storage.

ESXi Host MTU VMkernel Interface Verification
[root@NY-ESX50A:~] esxcfg-vmknic -l Interface Port Group/DVPort/Opaque Network IP Family IP Address Netmask Broadcast MAC Address MTU TSO MSS Enabled Type NetStack vmk0 2 IPv4 172.16.50.10 255.255.255.0 172.16.50.255 b4:b5:2f:64:f9:48 1500 65535 true STATIC defaultTcpipStack vmk2 17 IPv4 172.16.52.10 255.255.255.0 172.16.52.255 00:50:56:63:4c:85 9000 65535 true STATIC defaultTcpipStack vmk10 10 IPv4 172.16.150.12 255.255.255.0 172.16.150.255 00:50:56:67:d5:b4 9000 65535 true STATIC vxlan vmk50 910dba45-2f63-40aa-9ce5-85c51a138a7d IPv4 169.254.1.1 255.255.0.0 169.254.255.255 00:50:56:69:68:74 1500 65535 true STATIC hyperbus vmk1 8 IPv4 172.16.51.10 255.255.255.0 172.16.51.255 00:50:56:6c:7c:f9 9000 65535 true STATIC vmotion [root@NY-ESX50A:~]

ESXi Host MTU VMkernel Interface Verification

[root@NY-ESX50A:~] esxcfg-vmknic -l

Interface Port Group/DVPort/Opaque Network IP Family IP Address Netmask Broadcast MAC Address MTU TSO MSS Enabled Type NetStack

vmk0 2 IPv4 172.16.50.10 255.255.255.0 172.16.50.255 b4:b5:2f:64:f9:48 1500 65535 true STATIC defaultTcpipStack

vmk2 17 IPv4 172.16.52.10 255.255.255.0 172.16.52.255 00:50:56:63:4c:85 9000 65535 true STATIC defaultTcpipStack

vmk10 10 IPv4 172.16.150.12 255.255.255.0 172.16.150.255 00:50:56:67:d5:b4 9000 65535 true STATIC vxlan

vmk50 910dba45-2f63-40aa-9ce5-85c51a138a7d IPv4 169.254.1.1 255.255.0.0 169.254.255.255 00:50:56:69:68:74 1500 65535 true STATIC hyperbus

vmk1 8 IPv4 172.16.51.10 255.255.255.0 172.16.51.255 00:50:56:6c:7c:f9 9000 65535 true STATIC vmotion

[root@NY-ESX50A:~]

For sure, the verification of the end-to-end MTU between two ESXi hosts I still highly recommend by sending VMkernel pings with the don't-fragment bit set (e.g. vmkping ++netstack=vxlan -d -c 3 -s 8972 -I vmk10 172.16.150.13).

But for a serious end-to-end MTU 9216 physical network verification we need to look for another tool than the VMkernel ping. In my case I just using BGP running on the Nexus 3048 switches. BGP is running on the top of TCP and TCP support the option "Maximum Segment Size" to maximize the TCP datagrams.

The TCP Maximum Segment Size (MSS) is a parameter of the options field of the TCP header that specifies the largest amount of data, specified in bytes. This information is part of the SYN TCP three-way handshake, as the diagram below shows from a wireshark sniffer trace.

The TCP MSS defines the maximum amount of data that an IPv4 endpoint is willing to accept in a single TCP/IPv4 datagram. RFC879 explicit mention that MSS counts only data octets in the segment, but it does not count the TCP header or the IP header. In the wireshark trace example the two IPv4 endpoints (Loopback 172.16.3.10 and 172.16.3.12) have accepted an MSS of 9176 bytes on a physical Layer 3 link with MTU 9216 during the TCP three-way handshake. The difference of 40 bytes is based on the default TCP header of 20 bytes and IP header of again 20 bytes.

Please keep in mind, a small MSS values will reduce or eliminate IP fragmentation for any TCP based application, but will result in higher overhead. This is also truth for BGP messages.

BGP update messages carry all the BGP prefixes as part of the Network Layer Reachability Information (NLRI) Path Attribute. In regards for an optimal BGP performance in a spine-leaf architecture running BGP, it is advisable to set the MSS for BGP to the maximum value but avoid fragmentation. As defined RFC879 all IPv4 endpoints are required to handle an MSS of 536 bytes (=MTU 576 bytes minus 20 bytes for TCP Header*** minus 20 bytes IP Header).

But are these Nexus switches using MSS of 536 bytes only? Nope!

These Nexus 3048 switches running NX-OS 7.0(3)I7(6) are by default configured to discover the maximal MTU path between the two IPv4 endpoints leveraging Path MTU Discovery (PMTUD) feature. Other Nexus switches may requires the configuration of the global command "ip tcp path-mtu-discovery" to enable PMTUD.

MSS is sometimes mistaken for PMTUD. MSS is a concept used by TCP in the Transport Layer and it specifies the largest amount of data that a computer or communications device can receive in a single TCP segment. While PMTUD is used to specifies the largest packet size that can be sent over this path without suffering fragmentation.

But how we could verify the MSS used for the BGP peering session between the Nexus 3048 switches?

Nexus 3048 switches running NX-OS software allows the administrator to check the MSS of the TCP BGP session with the following command: show sockets connection tcp details.

Below we see two TCP BGP sessions between the IPv4 endpoints (Switch Loopback Interfaces) and each of the session shows a MSS of 9164 bytes.

BGP TCP Session Maximum Segment Size Verification
NY-N3K-LEAF-10# show sockets connection tcp local 172.16.3.10 detail ---snip--- Kernel Socket Connection: State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 172.16.3.10:24415 172.16.3.11:179 ino:78187 sk:ffff88011f352700 skmem:(r0,rb262144,t0,tb262144,f0,w0,o0) ts sack cubic wscale:2,2 rto:210 rtt:12.916/14.166 ato:40 mss:9164 cwnd:10 send 56.8Mbps rcv_space:18352 ESTAB 0 0 172.16.3.10:45719 172.16.3.12:179 ino:79218 sk:ffff880115de6800 skmem:(r0,rb262144,t0,tb262144,f0,w0,o0) ts sack cubic wscale:2,2 rto:203.333 rtt:3.333/1.666 ato:40 mss:9164 cwnd:10 send 220.0Mbps rcv_space:18352 NY-N3K-LEAF-10#

BGP TCP Session Maximum Segment Size Verification

NY-N3K-LEAF-10# show sockets connection tcp local 172.16.3.10 detail

---snip---

Kernel Socket Connection:

State Recv-Q Send-Q Local Address:Port Peer Address:Port

ESTAB 0 0 172.16.3.10:24415 172.16.3.11:179 ino:78187 sk:ffff88011f352700

skmem:(r0,rb262144,t0,tb262144,f0,w0,o0) ts sack cubic wscale:2,2 rto:210 rtt:12.916/14.166 ato:40 mss:9164 cwnd:10 send 56.8Mbps rcv_space:18352

ESTAB 0 0 172.16.3.10:45719 172.16.3.12:179 ino:79218 sk:ffff880115de6800

skmem:(r0,rb262144,t0,tb262144,f0,w0,o0) ts sack cubic wscale:2,2 rto:203.333 rtt:3.333/1.666 ato:40 mss:9164 cwnd:10 send 220.0Mbps rcv_space:18352

NY-N3K-LEAF-10#

Please reset always the BGP session when you change the MTU, as the MSS is only discovered during the initial TCP three-way handshake.

The MSS value of 9164 bytes confirms that the underlay physical network is ready with an end-to-end MTU of 9216 bytes. But why is the MSS value (9164) of BGP 12 bytes smaller than the TCP MSS value (9176) negotiated during the TCP three-way handshake?

Again, in many TCP IP stacks implementation we could see a MSS of 1460 bytes with the interface MTU of 1500 bytes respective a MSS of 9176 bytes for a interface MTU of 9216 bytes (40 bytes difference) , but there are other factors that can change this. For example, if both sides support RFC 1323/7323 (enhanced timestamps, windows scaling, PAWS***) this will add 12 bytes to the TCP header, reducing the payload to 1448 bytes respective 9164 bytes.

And indeed, the Nexus NX-OS TCP/IP stacks used for BGP supports by default the TCP enhanced timestamps option and leverage the PMTUD (RFC 1191) feature to handle the 12 byte extra room and hence reduce the maximal payload (payload in our case is BGP) to a MSS of 9164 bytes.

The below diagram from a wireshark sniffer trace confirms the extra 12 byte used for the TCP timestamps option.

Hope you had a little bit fun reading this small Network Readiness write-up.

Footnote 1: Configure and Verify Maximum Transmission Unit on Cisco Nexus Platforms - Cisco

** 20 bytes TCP Header is only correct when default TCP header options are used, RFC 1323 - TCP Extensions for High Performance and replaced by RFC 7323 - TCP Extensions for High Performance defines TCP extension which requires up to 12 bytes more.

*** PAWS = Protect Against Wrapped Sequences

Software Inventory:

vSphere version: VMware ESXi, 6.5.0, 15256549

vCenter version:6.5.0, 10964411

NSX-T version: 2.5.1.0.0.15314288 (GA)

Cisco Nexus 3048 NX-OS version: 7.0(3)I7(6)

Blog history:

Version 1.0 - 23.03.2020 - first published version

Network Readiness - Part 1: Physical Network MTU

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112