Dear readers
I was recently at the customer site, where we have discussed the details about the NSX-T north/south connectivity with active/active edge node virtual machines to maximizing throughput and resiliency. To achieve the highest north to south and vice versa bandwidth requires the installation of multiple edge nodes in active/active mode leveraging ECMP routing.
But lets have first a basic view of a NSX-T ECMP deployment.
The physical router is in a typical deployment a Layer 3 leaf switch acting as Top of Rack (ToR) device. Two of them are required to provide redundancy. NSX-T support basically two edge node deployment option. Active/standby and active/active deployments. For maximizing throughput and highest level of resiliency is the active/active deployment option the right choice. NSX-T is able to install up to eight paths leveraging ECMP routing. As you are most likely already familiar with NSX-T, then you know that NSX-T requires the Service Router (SR) component on each individual edge nodes (VM or Bare Metal) to setup the BGP peering with the physical router. But have you ever thought about the details what does eight ECMP path entries really mean? Are these eight paths counted on the Tier0 logical router or on the edge node itself or where?
Before we talk about the eight ECMP paths let us have a closer look to the physical setup. For this exercise I have in my lab only 4 ESXi hosts available. Each host is equipped with four 1Gbit/s pNIC. Two of these ESXi hosts are purely used to provide CPU and memory resources to the edge node VMs and the other two ESXi hosts are prepared with NSX-T (NSX-T VIBs installed). The two "Edge" ESXi hosts have two vDS, each with 2 pNIC configured. The first vDS is used for vmk0 management, vMotion and IPStorage, the second vDS is used for the Tunnel End Point (TEP) encapsulated GENEVE traffic and the routed uplink(s) traffic towards the ToR switches. The edge node VM is acting as NSX-T transport nodes, they have typically two or three N-VDS embedded (future release will support a single N-VDS per edge node). The two compute hosts are prepared with NSX-T, they act also as transport nodes and they have a slightly different setup regarding vSwitches. The first vSwitch is again a vDS with two pNIC and is used for vmk0 management, vMotion and IPStorage. The other two pNIC are assigned to the NSX-T N-VDS and is responsible for the TEP traffic. The diagram below shows the simplified physical setup.
As you could easily see, the two "Edge" vSphere hosts have totally eight edge node VMs installed. This is a purpose-built "Edge" vSphere cluster to serve edge node VMs only. Is this kind of deployment recommend in a real customer deployment? It depends :-)
To have 4 pNICs probably is a good choice, but most likely are 10Gbit/s or 25Gbit/s interfaces instead 1Gbit/s interfaces preferred respective required for high bandwidth throughput. When you host more than one edge node VM per ESXi hosts, then I recommend to use at least 25Gbit/s interfaces. As our focus is on maximizing throughput and resiliency, a customer deployment would have likely 4 or more ESXi hosts for the Edge" vSphere cluster. Other aspects should be consider as well, like the used storage system (e.g vSAN), operational aspects (e.g. maintenance mode) or vSphere cluster settings. For this lab are "small" sized edge node VM used; real deployment should use "large" sized edge node VM where maximal throughput is required. To have a dedicated purpose-built "Edge" vSphere cluster can be considered as best practice when maximal throughput and highest resiliency along with operation simplification is required. Here two additional diagrams from the edge node VM deployment in my lab.
As we now have already an idea, how the physical environment looks, it is now time to move forward and dig into the logical routing design.
To simplify the diagram, the diagram shows only a single compute transport node (NY-ESX70A) and only six of the eight edge node VMs. All these eight edge node VMs are assigned to a single NSX-T edge cluster and these edge cluster is assigned to the Tier0 logical router. The logical design show a two tier architecture with Tier0 logical routers and two Tier1 logical routers. This is very common design. Centralized services are not deployed at Tier1 level in this exercise. A Tier0 logical router consist in almost all cases (as you normally want use static or dynamic routing to reach the physical world) of a Service Router (SR) and a Distributed Router (DR). Only the edge node VM can host the Service Router (SR). As already said, the Tier1 logical router has in this exercise only the DR component instantiated, a Service Router (SR) is not required, as centralized service (e.g. Load Balancer) are not configured. Each SR has two eBGP peerings with the physical routers. Please keep in mind, only the two overlay segments green-240 and blue-241 are user configured segments. Workload VMs are attached to these overlay segments. This overlay segment provides VM mobility across physical boundaries. The segment between the Tier0 SR and DR and the segments between the Tier0 DR and Tier1 DR are automatically configured overlay segments through NSX-T, including the IP addressing assignment.
Meanwhile, you might have already recognized that eight edge node might be equally with eight ECMP path. Yes this is true....but where we have these eight ECMP path installed in the routing respective in the forwarding table? These eight paths are not installed on the logical construct Tier0 logical router nor on a single edge node. The eight ECMP path are installed on each Tier0 DR component of the individual compute transport node, in our case on the NY-70A Tier0 DR and NY-71A Tier0 DR. The CLI output below shows the forwarding table on the compute transport node NY-ESX70A.
IPv4 Forwarding Table NY-ESX70A Tier0 DR |
---|
NY-ESX70A> get logical-router e4a0be38-e1b6-458a-8fad-d47222d04875 forwarding ipv4 Logical Routers Forwarding Table - IPv4 -------------------------------------------------------------------------------------------------------------- Flags Legend: [U: Up], [G: Gateway], [C: Connected], [I: Interface] [H: Host], [R: Reject], [B: Blackhole], [F: Soft Flush], [E: ECMP]
Network Gateway Type Interface UUID ============================================================================================================== 0.0.0.0/0 169.254.0.2 UGE 48d83fc7-1117-4a28-92c0-7cd7597e525f 0.0.0.0/0 169.254.0.3 UGE 48d83fc7-1117-4a28-92c0-7cd7597e525f 0.0.0.0/0 169.254.0.4 UGE 48d83fc7-1117-4a28-92c0-7cd7597e525f 0.0.0.0/0 169.254.0.5 UGE 48d83fc7-1117-4a28-92c0-7cd7597e525f 0.0.0.0/0 169.254.0.6 UGE 48d83fc7-1117-4a28-92c0-7cd7597e525f 0.0.0.0/0 169.254.0.7 UGE 48d83fc7-1117-4a28-92c0-7cd7597e525f 0.0.0.0/0 169.254.0.8 UGE 48d83fc7-1117-4a28-92c0-7cd7597e525f 0.0.0.0/0 169.254.0.9 UGE 48d83fc7-1117-4a28-92c0-7cd7597e525f 100.64.48.0/31 0.0.0.0 UCI 03ae946a-bef4-45f5-a807-8e74fea878b6 100.64.48.2/31 0.0.0.0 UCI 923cbdaf-ad8a-45ce-9d9f-81d984c426e4 169.254.0.0/25 0.0.0.0 UCI 48d83fc7-1117-4a28-92c0-7cd7597e525f --snip-- |
Each compute transport node can distribute the traffic sourced from the attached workload VMs from south to north for these eight paths (as we have eight different next hops), a single paths per Service Router. With such a active/active ECMP deployment we can maximize the forwarding bandwidth from south to north. This is shown in the diagram below.
On the other hand, from north to south, each ToR switch has eight path installed (indicated with "multipath") to reach the destination networks green-240 or blue-241. The ToR switch will distributed the traffic from the physical world to all of the eight next hops. Here we achieve as well the maximum of throughput from north to south. Lets have a look to the two ToR switches routing table for the destination network green-240.
BGP Table for "green" prefix 172.16.240.0/24 on RouterA and RouterB |
---|
NY-CAT3750G-A#show ip bgp 172.16.240.0/0 BGP routing table entry for 172.16.240.0/24, version 189 Paths: (9 available, best #8, table Default-IP-Routing-Table) Multipath: eBGP Flag: 0x1800 Advertised to update-groups: 1 2 64513 172.16.160.20 from 172.16.160.20 (172.16.160.20) Origin incomplete, metric 0, localpref 100, valid, external, multipath 64513 172.16.160.22 from 172.16.160.22 (172.16.160.22) Origin incomplete, metric 0, localpref 100, valid, external, multipath 64513 172.16.160.23 from 172.16.160.23 (172.16.160.23) Origin incomplete, metric 0, localpref 100, valid, external, multipath 64513 172.16.160.21 from 172.16.160.21 (172.16.160.21) Origin incomplete, metric 0, localpref 100, valid, external, multipath 64513 172.16.160.27 from 172.16.160.27 (172.16.160.27) Origin incomplete, metric 0, localpref 100, valid, external, multipath 64513 172.16.160.26 from 172.16.160.26 (172.16.160.26) Origin incomplete, metric 0, localpref 100, valid, external, multipath 64513 172.16.160.25 from 172.16.160.25 (172.16.160.25) Origin incomplete, metric 0, localpref 100, valid, external, multipath 64513 172.16.160.24 from 172.16.160.24 (172.16.160.24) Origin incomplete, metric 0, localpref 100, valid, external, multipath, best 64513 172.16.3.11 (metric 11) from 172.16.3.11 (172.16.3.11) Origin incomplete, metric 0, localpref 100, valid, internal NY-CAT3750G-A# |
NY-CAT3750G-B#show ip bgp 172.16.240.0/0 BGP routing table entry for 172.16.240.0/24, version 201 Paths: (9 available, best #9, table Default-IP-Routing-Table) Multipath: eBGP Flag: 0x1800 Advertised to update-groups: 1 2 64513 172.16.161.20 from 172.16.161.20 (172.16.160.20) Origin incomplete, metric 0, localpref 100, valid, external, multipath 64513 172.16.161.23 from 172.16.161.23 (172.16.160.23) Origin incomplete, metric 0, localpref 100, valid, external, multipath 64513 172.16.161.21 from 172.16.161.21 (172.16.160.21) Origin incomplete, metric 0, localpref 100, valid, external, multipath 64513 172.16.161.26 from 172.16.161.26 (172.16.160.26) Origin incomplete, metric 0, localpref 100, valid, external, multipath 64513 172.16.161.22 from 172.16.161.22 (172.16.160.22) Origin incomplete, metric 0, localpref 100, valid, external, multipath 64513 172.16.161.27 from 172.16.161.27 (172.16.160.27) Origin incomplete, metric 0, localpref 100, valid, external, multipath 64513 172.16.161.25 from 172.16.161.25 (172.16.160.25) Origin incomplete, metric 0, localpref 100, valid, external, multipath 64513 172.16.3.10 (metric 11) from 172.16.3.10 (172.16.3.10) Origin incomplete, metric 0, localpref 100, valid, internal 64513 172.16.161.24 from 172.16.161.24 (172.16.160.24) Origin incomplete, metric 0, localpref 100, valid, external, multipath, best NY-CAT3750G-B# |
Traffic arriving at the Service Router (SR) from the ToR switches is kept locally on the edge node before the traffic is forwarded to the destination VM (GENEVE encapsulated). This is shown in the next diagram below.
And what is the final conclusion of this little lab exercise?
Each single Service Router on an edge node provide to each individual compute transport node exactly a single next hop. The number of BGP peerings per edge node VM is not relevant for the eight ECMP path, the number of edge nodes is relevant. Theoretically a single eBGP peer from each edge node would achieve the same number of ECMP path. But please keep in mind, two BGP session per edge node provide better resiliency. Hope you had a little big fun reading this NSX-T ECMP edge node write-up.
Software Inventory:
vSphere version: 6.5.0, build 13635690
vCenter version: 6.5.0, build 10964411
NSX-T version: 2.4.1.0.0.13716575
Blog history
Version 1.0 - 06.08.2019 - first published version