EVPN VXLAN Design Guide
EVPN VXLAN Design Guide
eos.arista.com/evpn-vxlan-design-guide
A Detailed Overview of the EVPN & VxLAN Protocols, Route Types, Use-Cases and
Architectures
Contents [hide]
1. Introduction
2. VXLAN Overview
2.1 VXLAN Bridging
2.2 VXLAN Routing
3 EVPN Overview
3.1 EVPN Operational Benefits
3.2 EVPN Terminology
3.3 EVPN Address Family and Routes
3.4 EVPN Service Models
4. EVPN Core Operations
4.1 MAC Address Learning.
4.2 ARP Suppression
4.3 MAC Mobility
4.4 MAC address Damping
4.5 Broadcast and Multicast Traffic
4.6 Integrated Routing and Bridging
4.7 EVPN Type 5 Routes – IP Prefix advertisement
4.8 Summary Comparison of Route Type-2 and Type-5 Prefix Announcements
4.9 Auto RT and Auto RD For VLAN-Based EVIs
5. Deployment Models
5.1 Underlay and Overlay Design Options
5.2 Site Topology Design Options
5.3 Layer 2 VPN deployment model
5.4 EVPN VXLAN Layer 2 With Layer 3 IRB Integration
5.5 Pure Layer 3 VPN Deployment Model (Type-5 only)
7. Configuration Guides & Further Reading
7.1 General Collateral
7.2 Layer 2 EVPN VXLAN Configuration Guides
7.3 IRB EVPN VXLAN Configuration Guides
7.4 Layer 3 EVPN VXLAN Configuration Guides
7.5 Related EVPN VxLAN Services and Functions
7.6 Configuring & Managing EVPN VXLAN Using Cloudvision
1/45
1. Introduction
This document describes the operation and configuration of BGP EVPN Services over a
VXLAN (Virtual eXtensible LAN) overlay on Arista platforms.
The focus in this design guide is VxLAN as the protocol for the data-plane encapsulation for
the overlay tunnels, and the functionality of the Multiprotocol BGP (MP-BGP) EVPN address-
family for control plane signaling in the overlay. MP-BGP EVPN is not only used for
advertising MAC addresses, MAC and IP bindings and IP prefixes across the overlay; it
provides efficiencies in the way learning is managed in the overlay and enables enhanced
active/active topologies. Some examples of these efficiencies/enhancements include:
Control plane learning of MAC and IP information. Contrary to existing Layer 2 VPN
technologies, such as VPLS, that learn only through the data-plane and have no Layer
3 awareness
The control plane learning further allows for active-active forwarding into dual-homed
environments due to split horizon and designated forwarder capability definitions.
Again, in contrast to VPLS, which is exclusively active-standby as it lacks a capability
to detect loops
Using BGP route policies to control MAC (MAC+IP) advertisements, much like IP
VPN’s. Unlike existing Layer 2 VPN technologies that rely on data-plane MAC filtering
locally on each device.
2. VXLAN Overview
Firstly, let’s review the VXLAN data-plane encapsulation that’s being utilized to provide the
overlay tunnels when using BGP EVPN with an IP only underlay.
The VXLAN protocol is an RFC (7348) standard co-authored by Arista. The standard defines
a MAC in IP encapsulation protocol allowing the stretching of layer 2 domains across a layer
3 IP infrastructure. The protocol is typically deployed as a data center technology to create
overlay network topologies both within and across data centers for:
1. Providing layer 2 connectivity between racks, or halls of the data center without
requiring an underlying layer 2 infrastructure
2. Linking geographically dispersed data centers as a data center Interconnect (DCI)
technology
3. Replacement for traditional MPLS technologies in MAN and WAN environments, to
provide Layer 2 and Layer 3 VPN services across an IP only infrastructure
To perform the packet encapsulation and forwarding within the overlay network, the standard
introduces a set of new components and functions to the traditional network forwarding and
control plane.
2/45
Figure 2.1 VXLAN Encapsulation
Virtual Tunnel End-point (VTEP). The VTEP acts as the entry and exit point into and out of a
VXLAN overlay network. The task of the VTEP is to encapsulate locally received traffic
destined for nodes learnt on a remote VTEP with a VXLAN header. For traffic received from
a remote VTEP, it will decapsulate the traffic and forward it to the relevant locally attached
nodes using standard layer 2 forwarding techniques.
2. Software switch: The software virtual switch within the hypervisor of a physical server
to provide VXLAN forwarding for directly attached Virtual Machines (VM).
Virtual tunnel Interface (VTI): Is the IP interface of the VTEP. The originating or local VTEP
uses this as the source IP address for any traffic to be VXLAN encapsulated and would be
the destination IP address for any VXLAN encapsulated traffic destined to the VTEP.
VXLAN Frame: The outer IP header added by the VTEP is a standard IP/UDP header,
containing the source IP address of the local VTEP and the destination IP address of the
remote VTEP. To provide a level of entropy for load-balancing the inner IP packet across
network, the SRC port of the UDP header is a hash of the inner frame.
Virtual Network Identifier (VNI): The VNI is a 24-bit field contained within the VXLAN header
of the encapsulated frame and is the logical layer 2 network identifier for the overlay network.
The use of a 24-bit field for the VNI provides the ability to scale the layer 2 domains within
the overlay beyond the 4k limit of traditional 802.1Q VLANs, providing support for potentially
16 million layer 2 domains.
3/45
VXLAN bridging is the concept of using the VXLAN protocol to provide layer 2 connectivity
across the layer 3 infrastructure. This is achieved on an Arista VTEP, by taking a traditional
layer 2 domain, defined by an access or trunked interface and mapping the layer 2 domain
into a VXLAN VNI. With a pair of VTEPs deployed, layer 2 connectivity can be achieved
between the VTEPs across a layer 3 infrastructure.
The configuration of VXLAN bridging on an Arista switch and the concepts involved are
covered in the document available at the following link:
https://eos.arista.com/vxlan-with-mlag-configuration-guide/
4/45
As the default gateway for Serv-1, traffic destined to Serv-2 is routed by leaf-1 into the Serv-
2’s subnet (10.10.20.0/20). The destination MAC for Serv-2 has been learnt behind a remote
VTEP (VTEP-2), thus to forward the traffic to Serv-2, leaf-1 will VXLAN encapsulate the
frame and VXLAN bridge the frame to the remote VTEP.
Receiving the frame, VTEP-2 de-encapsulates the VXLAN frame and based on its local
configuration, maps the VNI 1020 to VLAN 20. A MAC address lookup of VLAN 20, results in
the packet being forward to Serv-2. The packet walk of the traffic flow is illustrated below.
The traffic flow from Serv-1 to Serv-2, therefore results in Layer 3 routing on the original
frame at Leaf-1 and VXLAN bridging to the remote VTEP, VTEP-2. Traffic flow in the
opposite direction, with the default gateway for Serv-2 being 10.10.20.1, would result in
VXLAN bridging between VTEP-2 and VTEP-1 and routing of the inner frame on leaf-1 for
local forwarding to the final destination, Serv-1.
The configuration of VXLAN Routing on an Arista switch and the concepts involved are
covered in the document available at the following link:
https://eos.arista.com/vxlan-routing-with-mlag/
3 EVPN Overview
5/45
EVPN is a standards-based BGP control plane to advertise MAC addresses, MAC and IP
bindings and IP Prefixes. The standard was first defined in RFC 7432 for an MPLS data
plane, that work has since been extended in the BESS (BGP Enabled ServiceS) working
group, with additional drafts published by the group defining the operation in the context of
Network Virtualization Overlay (NVO) for VXLAN, NVGRE and MPLS over GRE data planes
(RFC 8365). This design document focuses on EVPN and it’s operation with a VXLAN data
plane, which has become the de facto standard for building overlay networks in the data
center.
A number of control planes exist today for VXLAN, based on specific use cases, whether it
be a requirement to integrate with an SDN overlay controller, or operate in a standards
based flood and learn control plane model.
Current flood and learn models operate either with a multicast control plane, or ingress
replication, where the operator manually configures the remote VTEPs in the flood list. Both
of these are data-plane driven, that is, MAC’s are learnt via flooding. In the IP multicast
model MAC’s are learnt in the underlay via flooding to an IP multicast group, while ingress
replication (HER) floods to configured VTEP endpoints and no IP Multicast is required in the
underlay.
The controller based solution with cloud vision exchange (CVX), locally learned MAC’s are
published to a centralized controller and these MAC’s are then programed to all participating
VTEPs.
6/45
Finally, there is what’s known as “controller-less” BGP EVPN MAC learning. Where a
standards-based control-plane (MP-BGP) is used to discover remote VTEPs and advertise
MAC address and MAC/IP bindings in the VXLAN overlay, thus eliminating the flood and
learn paradigms of the previously mentioned (multicast or HER) controller-less approaches.
As a standards-based approach, the discovery and therefore the advertisement of the EVPN
service models can inter-operate amongst multiple vendors.
The initial EVPN standard is RFC 7432 which defines the BGP EVPN control plane, and
specifies an MPLS data-plane. The control plane with an MPLS data plane was extended to
consider additional data plane encapsulations models including VXLAN, NVGRE and MPLS
over GRE.
This highlights an important and powerful advantages of BGP EVPN; that being, it is a single
control plane for multiple data-plane encapsulations and defines both Layer 2 and layer 3
VPN services. As network operators drive toward simplicity and automation, having one
control plane protocol and address family for all data-planes and VPN services will prove
extremely powerful.
The BESS working group has defined a number of standards and draft proposals for the
operation of EVPN, the relevant standards discussed in this document in the context of an
VXLAN data plane are summarised below:
https://tools.ietf.org/html/rfc7432
https://tools.ietf.org/html/rfc8365
https://tools.ietf.org/html/draft-ietf-bess-evpn-inter-subnet-forwarding-03
7/45
IP prefix advertisement in EVPN
https://tools.ietf.org/html/draft-ietf-bess-evpn-prefix-advertisement-04
Standards based BGP control plane for VXLAN to provide support for multi-vendor
interoperability.
Reuse of well known and mature MP-BGP concepts, Route-Targets and Route-
Distinguishers to deliver multi-tenant Layer 2 and layer 3 VPNs
MAC address learning in the control plane using BGP, rather than flood and learn,
making the operation more akin to that of an IP (L3) VPN service.
Optional ARP (MAC to IP) learning / suppression for the reduction of traffic flooding
across layer 2 domains
Network Virtualization Overlay (NVO): The overlay network used to deliver the Layer
2 and Layer 3 VPN services. For VXLAN encapsulation, this would define a VXLAN
domain, which would include one or more VNIs, for the transportation of tenant traffic
over a common IP underlay infrastructure.
8/45
Network Virtualization End-Point (NVE): The provider edge node within the NVO
environment responsible for the encapsulation of tenant traffic into the overlay
network. For a VXLAN data plane, this defines the Virtual Tunnel End-Point (VTEP)
Virtual Network Identifier (VNI): The label identifier within the VXLAN encapsulated
frame, defining a layer 2 domain in the overlay network
EVPN instance (EVI): A logical switch within the EVPN domain which spans and
interconnects multiple VTEPs to provide tenant layer 2 and layer 3 connectivity.
MAC-VRF: A Virtual Routing and Forwarding table for storing Media Access Control
(MAC) addresses on a VTEP for a specific tenant.
To provide multi-tenancy, the standard uses the aforementioned traditional VPN methods to
control the import and export of routes and provide support for overlapping IP addresses
between tenants.
Multi-protocol BGP for EVPN: A new AFI and SAFI have been defined for EVPN.
These are AFI=25 (L2VPN) and SAFI = 70 (EVPN)
9/45
EVPN L2/L3 Tenant Segmentation: Similar to standard MPLS VPN configurations
Route Distinguishers (RD’s) and Route Targets (RT’s) are defined for the VPN.
Route Target (RT): To control the import and export of routes across VRFs, EVPN
routes are advertised with Route-Target (RT) (BGP extended communities). The RT
can be auto derived to simplify the rule configuration, typically this is based on the AS
number and the VNI of the MAC-VRF.
As illustrated in figure 3.4, the original EVPN MPLS RFC (7432) and subsequent IP prefix
draft (draft-ietf-bess-evpn-prefix-advertisement-04), introduced five unique EVPN route
types.
Ethernet A-D route per ESI route, announces the reachability of a multi-homed Ethernet
Segment. The route type is used for fast convergence (ie: ‘mass withdraw’) functions, as
well as split horizon filtering used for active-active multi-homing.
10/45
Ethernet A-D route per EVI route, is used to implement the Aliasing and Backup Path
features of EVPN associated with active-active multi-homing.
Used to advertise the reachability of a MAC address, or optionally a MAC and IP binding as
learnt by a specific EVI. With the advertisement of the optional IP address of the host, EVPN
provides the ability for VTEPs to perform ARP suppression and ARP proxy to reduce
flooding within the layer 2 VPN.
The type-3 route is used to advertise the membership of a specific layer 2 domain (VNI
within the VXLAN domain), allowing the dynamic discovery of remote VTEPs in a specific
VNI and the population of a VTEP ingress flood list for the forwarding of Broadcast Unknown
unicast and Multicast (BUM) traffic.
The type-4 route is specific to VTEPs supporting the EVPN multihoming model, for active-
active and active-standby forwarding. The route is used to discover VTEPs which are
attached to the same shared Ethernet Segment. Additionally, this route type is used in the
Designated Forwarder (DF) election process.
The type-5 route is used to advertise IP prefixes rather the MAC and IP hosts addresses of
the type-2 route. This advertisement of prefixes into the EVPN domain provides the ability to
build classic layer 3 VPN topologies.
A detailed understanding of the function of each of these route types in the operation of
EVPN to provide multi-tenant layer 2 and 3 VPN services, is defined in Section 4 of this
document.
While this guide focuses on EVPN with VXLAN data-plane encapsulation, it’s important to
note that, in addition to the new routes type, a BGP encapsulated extended community is
included in all advertisements to determine the data-plane encapsulation. The Encapsulation
extended community is defined in RFC 5512. The different IANA registered tunnel types for
an NVO environment are summarized in the table below:
11/45
Figure 3.5 Defined Data-Plane Encapsulations
In the VLAN based service there is a one-to-one mapping between the VLAN-ID and the
MAC-VRF of the EVPN instance. With the MAC-VRF mapping directly to the associated
VLAN, there will be a single bridge table within the MAC-VRF. The VLAN tag is not carried in
any route update and the VNI label in the route advertisement is used to uniquely identify the
bridge domain of the MAC-VRF in the VXLAN forwarding plane.
12/45
With a one-to-one mapping between the VLAN-ID and the MAC-VRF of EVI instance, the
EVI will represent an individual tenant subnet/VLAN in the overlay. The one-to-one mapping
also means the route-target associated with the MAC-VRF, uniquely identifies the tenant’s
subnet/VLAN, providing granular importing of MAC routes on a per VLAN basis on each
VTEP.
In this service, the associated MAC-VRF table is identified by the Route-Target in the control
plane and by the VNI in the data plane and the MAC-VRF table corresponds to to a single
VLAN bridge domain.
In the VLAN aware bundle service there is a many-to-one mapping between the VLAN-IDs
and the MAC-VRF of the EVPN instance. However, the MAC-VRF contains a unique layer 2
bridge table for each associated VLAN-ID and a unique VNI label for each bridge domain.
With the MAC-VRF containing multiple layer 2 bridge tables, the VLAN tag is carried in any
EVPN route update to allow mapping to the correct tenant bridge table within the MAC-VRF.
Only the unique VNI label is carried in the VXLAN data plane, to allow forwarding to the
correct VLAN with the MAC-VRF
In this service, MAC-VRF of the EVI instance represents multiple subnet/VLANs of the
tenant. The layer 2 bridge table of the MAC-VRF is identified by a combination of the Route-
Target and the ethernet tag in the control plane and by the unique VNI and in the VXLAN
data plane.
This service type is a common DCI/WAN deployment, where a tenant’s VLANs are bundled
into single EVI instance, while VLAN “awareness” can be retained in the EVPN service as
the VNI tag is advertised in the MAC-IP route (which now identifies the VLAN within the EVI).
13/45
Bundling into a service like this reduces the number of EVI’s that need to be configured,
reducing complexity and the control-plane signaling between PE’s.
The route advertisements are EVPN type-2 routes, which can advertise just the MAC
address of the host, or optionally the MAC and IP address of the host. The format of the
type-2 route is illustrated in the figure below, along with the mandatory and optional extended
community attached to the route.
14/45
Figure 4.2 EVPN Type 2 MAC and IP route format
Ethernet Segment Identifier (ESI), this field is populated when the VTEP participating in
a multi-homed topology. This is discussed in the following sections.
Ethernet tag ID that will be 0 for VLAN-based service, and the customer VLAN ID in a
VLAN-aware bundle service.
IP address of the host which is associated with advertised MAC address. The
advertisement of the Host’s IP address is optional
Label in the context of a VXLAN forwarding plane is the VNI associated with the MAC-
VRF/layer 2 domain the advertised MAC address has been learnt on.
Route Target associated with the MAC-VRF advertised with route to allow the control of
the import and export of routes.
15/45
The MAC mobility extended community, as discussed in the following section is used
during MAC moves to update all VTEPs of the new location of the host.
Importantly, the optional MAC and IP route can be advertised separately from the MAC only
type-2 route. This is done so that if the MAC and IP route is cleared, i.e. ARP flushed, or the
ARP timeout is set to less than the MAC timeout, then the MAC only route will still exist.
To cater for this situation a sequence number is attached to the new MAC advertisement
ensuring an EVI wide refresh of the MAC table, with VTEPs updating their forwarding tables
to point to the advertising VTEP as the new next-hop for MAC address.
16/45
Figure 4.3 EVPN type-2 MAC Mobility Behaviour
When a MAC address is learnt and advertised for the first time it is advertised without a
sequence number and the receiving VTEP assumes the sequence to be zero. On detection
of a MAC move, i.e. a MAC is learnt locally when the same MAC route is active via a type-2
advertisement, then the sequence number is incremented by one, and the MAC route is
advertised to the remote peers. The original advertising VTEP, receives the MAC route with a
now higher sequence number and withdraws its own local MAC route. All other VTEPs flush
the original MAC route, and update their tables with the new higher sequence number route.
On advertising a locally learned MAC, the VTEP will start an M second counter (default is
180s), if the VTEP detects N MAC moves (default is 5) for the route within the M second
window, it will generate a syslog message and stop sending and processing any further
updates for the route.
17/45
Broadcast, unknown unicast and Multicast (BUM) traffic is handled within the EVPN
forwarding model using ingress replication. Where the BUM frame is replicated on the
ingress VTEP to each of the remote VTEPs in the associated EVI/VNI. The VTEP replication
list for the EVI, is dynamically populated based on Type-3 route advertisements (Inclusive
Multicast Ethernet Tag Route), where VTEPs advertise type-3 routes for each EVI they are
members.
Figure 4.4 EVPN type-3 IMET route behavior for ingress replication
From the figure above the salient fields of the type-3 route are:
18/45
Multiprotocol Reachable NLRI (MP_REACH_NLRI) attribute of the route is used to
carry the next-hop for the advertised route. In the context of a VXLAN forwarding plane,
this will be the source address (VTI) of the advertising VTEP.
Ethernet tag that will be 0 for VLAN-based service, and the MAC-VRF VNI for a VLAN-
aware bundle service.
Route Target associated with the MAC-VRF or the EVI in a VLAN-aware bundle
service.
PMSI Tunnel Attribute, to advertise the replication model the VTEP is supporting. The
supported options defined within the standard are ingress replication and IP multicast.
To provide a more optimal forwarding model and avoid traffic tromboning, the EVPN inter-
subnet draft (draft-sajassi-l2vpn-evpn-inter-subnet-forwarding) proposes integrating the
routing and bridging (IRB) functionality directly onto the VTEP, thereby allowing the routing
operation to occur as close to the end host as possible. The draft proposes two forwarding
models for the IRB functionality, which are termed asymmetric IRB and symmetrical IRB,
these two models are described in the following sections.
In the asymmetric IRB model, the inter-subnet routing functionality is performed by the
ingress VTEP, with the packet after the routing action being VXLAN bridged to the
destination VTEP. The egress VTEP only then needs to remove the VXLAN header and
forward the packet onto the local layer 2 domain based on the VNI to VLAN mapping. In the
return path, the routing functionality is reversed with the destination VTEP now performing
the ingress routing and VXLAN bridging operation, hence the term asymmetric IRB.
19/45
Figure 4.6 EVPN Asymmetrical IRB
To provide inter-subnet routing on all VTEPs for all subnets, an anycast IP address is
utilized for each subnet and configured on each VTEP. The anycast IP acts as the default
gateway for the hosts, therefore regardless of where the host resides the directly attached
VTEPs can act as the host’s default gateway. The host MAC and MAC to IP bindings are
learned by each VTEP based on a combination of local learning/ARP snooping and type-2
route advertisements from remote VTEPs. In a typical implementation, the optional MAC and
IP, type-2 route is advertised separately from the MAC only type-2 route. This is done so that
if the MAC and IP route is cleared, for example the ARP flushed, or the ARP timeout is set to
less than the MAC timeout, then the MAC only route will still exist.
The format of the two advertised type-2 routes for Server-1 are illustrated below, where the
RD IP-A:1010 and route-target 1010:1010 are used to distinguish the uniqueness of the
route and allow the route to be imported into the correct remote MAC-VRF based on the
route-target import policy of the VTEP
20/45
Figure 4.7 EVPN Comparison of MAC & MAC+IP Type 2 Route in Asymmetrical IRB
The packet flow for the asymmetrical model is illustrated in the figure below, where two
subnets are configured subnet-10/VNI 1010 (Green) and subnet-11/VNI 1011 (Blue). For the
traffic flow between Server-1 in subnet-10 and Server-4 in subnet-11, the ingress VTEP
(VTEP-1) locally routes the packet into subnet-11/VNI 1011 and then VXLAN bridges the
frame , inserting the VNI 1011 into the VXLAN header with an inner DMAC equal to the
destination host, Server-4. This requires the receiving VTEP, (VTEP-4) to only perform a
local layer 2 lookup, based on the VNI to VLAN mapping, for the DMAC of Server-4.
21/45
Figure 4.8 EVPN Asymmetrical IRB VxLAN Data-plane Forwarding Detail
For the asymmetric model to operate the sending VTEP needs the information for all the
tenant’s hosts (MAC and MAC to IP binding), to route and bridge the packet. This means the
VTEP needs to be member of all the tenant’s subnets/VNI and have an associated SVI with
anycast IP for all the subnets, and this will be required on all VTEPs participating in the
routing functionality for the tenant. This introduces scaling issues on multiple fronts.
1. VNI Scaling: The number of VNIs supported on a hardware VTEP will be finite, so not
all VNIs can reside on all VTEPs. This is especially true in datacenter deployments,
where the TOR’s have traditionally been more resource constrained than chassis-
based edge systems.
2. Forwarding memory scaling: The VTEPs needs to store all host MACs and ARP entries
for all subnets in the network, on leaf switch this is hardware resource which again will
be a finite resource defined by the specific hardware platform deployed at the leaf.
Symmetric IRB
To address the scale issues of the asymmetric model, in the symmetrical model the VTEP is
only configured with the subnets that are present on the directly attached hosts, connectivity
to non-local subnets on a remote VTEP is achieved through an intermediate IP-VRF. The
subsequent forwarding model for symmetric IRB is illustrated in the figure below, for traffic
between Server-1 on subnet-10 (Green) and Server-4 on the remote subnet-11 (Blue). In this
model, the ingress VTEP routes the traffic between the local subnet (subnet-10) and the IP-
22/45
VRF, which both VTEPs are a member of, the egress VTEP then routes the frame from the
IP-VRF to the destination subnet. The forwarding model results in both VTEPs performing a
routing function, hence the term symmetrical IRB.
To provide the inter-subnet routing, when the subnet is stretched across multiple VTEPs, an
anycast IP address is utilised for each subnet, but only configured on the VTEP’s where the
subnet exists. The host MAC and MAC to IP bindings are learnt by each VTEP based on a
combination of local learning/ARP snooping and type-2 route advertisements.
For the symmetrical IRB model the type-2 (MAC and IP) route is advertised with two labels
and two route-targets corresponding to the MAC-VRF the MAC address is learnt on and the
IP-VRF. Remote VTEP’s receiving the route, import the IP host route into the corresponding
IP-VRF based on the IP-VRF route-target and if the corresponding MAC-VRF exists on the
VTEP the MAC address is imported into the local MAC-VRF based on the MAC-VRF’s
Route-Target. The import behavior for the type-2 route is illustrated in the diagrams below for
the host Server-1.
If the MAC-VRF exists locally on the receiving router, both the IP host route will be installed
in the IP-VRF, and the MAC address will be installed in the MAC-VRF. As shown in Figure
4.10. With both a MAC route in the MAC-VRF and an IP host route in the IP-VRF, the VNI
used in the data-path will depend on whether the traffic is being VXLAN bridged between
hosts in the same VNI (1010) or VXLAN routed (VNI 2000).
23/45
Figure 4.10 EVPN Type 2 Route in Symmetrical IRB – MAC-VRF on Both VTEPs
Compare this to Figure 4.11, where the MAC-VRF does not exist on the receiving VTEP
(VTEP-2). In this case the MAC route is not installed and ignored, as there is no
corresponding Route Target on the the VTEP. In this scenario, only the IP-VRF host route is
installed on VTEP-2. Traffic from VTEP-2 destined to hosts on subnet-10, are therefore
always VXLAN routed via the IP-VRF, VNI 2000.
24/45
Figure 4.11 EVPN Type 2 Route in Symmetrical IRB – MAC-VRF Only Exists on Sending
VTEP.
The symmetrical IRB type-2 route contains a number of additional extended community
attributes over the asymmetrical IRB type-2 route, the salient fields of the route are
summarised below: :
Route Distinguisher of the advertising node’s MAC-VRF. For Server-1 in the example
above this would be IPA:1010
MAC address field contains the 48-bit MAC address of the host being advertised.For
Server-1 in the example above this would be MAC-1
IP address and length field contain the IP address and 32-bit mask for the host being
advertised. For Server-1 in the example above this would be IP-1
MAC-VRF label, this contains the VNI number (label) corresponding to the local layer 2
domain/MAC-VRF the host MAC was learnt on. For Server-1 in the example above this
would be VNI 1010
IP-VRF label, this contains the VNI number (label) corresponding to the MAC-VRF’s
associated lP-VRF. For MAC-VRF 10 in the example above this would be IP-VRF
2000
Extended community Route Target for the IP-VRF. This contains the route-target of the
IP-VRF associated with the learnt MAC address
Extended community Router MAC, This field advertises the system MAC of the
advertising VTEP and is used as the DMAC for any packet sent to the VTEP via the IP-
VRF.
Extended community Route Target for the MAC-VRF. This contains the route-target of
the MAC-VRF associated with the learnt MAC address
25/45
decouple the advertisement of an IP prefix from any specific MAC address, providing the
ability to support floating IP address, optimised the mechanism for advertising external IP
prefixes, and reduce the churn when withdrawing IP prefixes.
The format of the new type-5 IP-prefix route is illustrated in the figure below:
The IP prefix draft defines a number of specific uses cases for the type-5 route, which
consequently affect the format and content of the fields within the route. The different
deployment scenarios and use cases defined within the draft are summarised below
26/45
Support for layer 2 appliances, acting as a “bump in the wire” with no physical IP
addresses configured, where instead of the appliances having an IP next-hop there is
only a MAC next-hop.
IP-VRF to IP-VRF model, which is similar to inter-subnet forwarding for host routes
(detailed in the symmetric/asymmetric section), except only Type-5 routes and IP-
prefixes are advertised, allowing announcement of IP-prefixes into a tenant’s EVI
domain for external connectivity outside the domain. The IP-VRF to IP-VRF model, is
further divided in the draft into three distinct use cases.
Interface-less
In interface-less mode, the IP-prefixes within the type-5 route, whether they are local or
learned from a connected router are advertised to remote peers via the shared IP-VRF, as
illustrated in the figure below.
As illustrated in the figure, the IP prefix (subnet-A) residing behind the router (Rtr-1) is
learned via an IGP in EVI-1 on VTEP-1. The prefix is announced and learnt by the remote
VTEPs residing in the same EVI, via the type-5 route announcement. The type-5 route, is
advertised along with the prefix, with a route-target (2000:2000) and a VNI label (2000) equal
to the IP-VRF which interconnects the VTEPs in the EVI, the router-mac extended
community of the route is used to define the inner DMAC (equal to system MAC of VTEP-1)
for any VXLAN frame destined to advertised IP prefix.
27/45
From a forwarding perspective, host residing on subnet-B communicating with a host on
subnet-A, will send traffic to their default gateway which is the IRB interface on VTEP-2 in
VLAN 11/VNI 1011. VTEP-2 performs a route lookup for the destination subnet (subnet-A),
which has been learnt in the IP-VRF with a next-hop of VTEP-1 and VNI label of 2000. The
packet is thus VXLAN encapsulated with VNI label of 2000 an inner DMAC of A (VTEP-1
system/router MAC), and routed to VTEP-1, which is the next-hop for the prefix. Receiving
the frame, VTEP-1 de-encapsulates the packet, with an inner DMAC of the VTEPs router
MAC, it performs a local route lookup for the destination subnet (subnet-A), which has been
learnt with a next-hop of rtr-1. The frame is forwarded directly to rtr-1, which subsequently
routes the packet to the local host on subnet-A. The format of the type-5 route in interface-
less mode is illustrated in figure below:
In this model, the VTEPs forming the EVI are interconnected via an IP-VRF, meaning there is
no IRB interface (MAC and IP) created for the interconnection on each of the VTEPs, hence
the term “interface-less”. With no IRB interface the gateway IP address within the type-5
route is set to zero, traffic is routed to the prefix based on the next-hop of the route (VTEP
IP) as well as MAC address conveyed within the Router MAC extended community, which
represents the inner destination MAC of the VXLAN encapsulated frame.
28/45
4.8 Summary Comparison of Route Type-2 and Type-5 Prefix
Announcements
Although both type-2 and type-5 routes have the ability to announce IP prefixes, each is
used for a specific operation. Type 2 routes announce MAC and IP bindings, and are used
for MAC mobility and ARP resolution. Importantly the next-hop of the prefix (which is always
a host route) is always fixed to the associated advertised MAC address.
Type-5 routes are used to advertise IP prefixes with an associated next-hop IP address. As
discussed, this prefix announcement does not need to be bound to a MAC address, although
in interface-less mode the extended community gateway MAC is sent, there are floating IP
and interface-full modes that do not explicitly set the MAC in the type-5 update, and instead
rely on the route type-2 update to provide this resolution.
The auto-RT is derived from the local BGP AS number, the overlay index type and the VNI in
the case of VXLAN and the normalized VLAN ID in the case of MPLS. This feature is only
supported with 2-byte AS numbers, and with I-BGP overlay peerings. In addition, solutions
like CloudVision EVPN Configlet builder enables customers to fully automate EVPN
deployments.
5. Deployment Models
As the overlay network with EVPN VXLAN is IP, then there is no requirement to run an IGP
to support a label distribution protocol such as LDP, or and IGP for the extension required for
RSVP-TE. In this case customers often use eBGP in the underlay because it’s scalable,
predictable and has a high degree of route control via policies.
29/45
Given eBGP is used in the underlay it is then a simple extension to use the same eBGP
session to advertise the EVPN routes as well, or alternatively use a separate multi-hop
eBGP session between loopbacks to advertise the EVPN routes, providing a logical
separation between the advertised underlay and overlay prefixes. Second design option for
the overlay EVPN prefixes is an iBGP topology on the loopbacks, with the Spine switches
acting as resilient route-reflectors. The third option outlined below is to use an IGP such as
OSPF or ISIS as an underlay routing protocol, providing reachability between VTEP and
spine loopbacks, with an MP iBGP overlay session for EVPN route advertisement. This
design would normally employ the spines as route reflectors to ease the BGP configuration.
With reference to Figure 5.1; eBGP is deployed in the underlay, running on the physical
interfaces of the leaf and spine switches for the VTEP prefixes, a separate eBGP session is
configured to peer (via loopback interfaces) with one or all spine switches to advertising the
overlay EVPN prefix. The spine switches are transparently re-advertising the EVPN prefix to
the other leaf routers without changing the next-hop and retaining any advertised community.
This role is also known as am EVPN transit router, reflecting the EVPN routes between
leafs. Technically, it is not a Route Server because it is not transparent in the AS path.
Figure 5.1 eBGP Underlay and eBGP Overlay Using Spine as EVPN Transit Routers
With reference to Figure 5.2; using eBGP as the underlay is to use a separate iBGP in the
overlay to advertise the EVPN routes. The choice to run iBGP in the overlay is normally so a
route-reflector can be used in the overlay. Route reflector functionality is supported in EOS,
which would typically be deployed on two of the spine switches, there is no need for a full
iBGP mesh between all the leafs switches in the topology. The iBGP sessions in the diagram
below are configured on the loopback interface of the leaf and spine switches, with iBGP
30/45
peering configured between the MLAG peers, with the “local-as” parameter configured
between the iBGP neighbors in order to create the same AS number for the iBGP EVPN
sessions.
Figure 5.2 eBGP Underlay and iBGP Overlay Using Spine Route Reflectors
With reference to Figure 5.3; using an IGP like OSPF or ISIS as the underlay, and iBGP in
the overlay to advertise the EVPN routes. The choice to run MP-iBGP in the overlay is
normally so a route-reflector can be used in the overlay. Route reflector functionality is
supported in EOS, which would typically be deployed on two of the spine switches, there is
no need for a full iBGP mesh between all the leafs switches in the topology. The iBGP
sessions in the diagram below are configured on the loopback interface of the leaf and spine
switches.
31/45
Figure 5.3 IGP Underlay and iBGP Overlay Using Spine Route Reflectors
To provide support for active-active multi-homing while preventing any disruption to the
existing leaf spine topology and cabling, EVPN operates in conjunction with Arista’s standard
MLAG leaf topology. While providing support for multi-homing via MLAG, the solution, as
documented in section 4.11, interoperates with any leaf running the EVPN multihoming
model, with type-1 route advertisements.
An MLAG leaf topology interworking with EVPN is illustrated in the diagram below, where the
physical switches are configured with a single shared logical VTEP (next-hop for any
advertised EVPN routes) while running separate BGP EVPN session with the Spine, with
each leaf advertise the same locally learnt type-2 and type-5 routes with the same next-hop,
the logical VTEP IP address.
32/45
Figure 5.4 Active-active forwarding with MLAG
With the same next-hop set by both leaf switches in the MLAG, they are able to work in
active-active mode, with traffic load-balanced to both the physical switches via the ECMP
topology of the leaf-spine architecture.
To provide resiliency in the event of a leaf losing connectivity to all four spine switches, an
iBGP session is run across the peer link interconnecting the two MLAG leaf switches, where
both underlay prefixes and overlay EVPN routes are exchanged.
Just like in Non-EVPN domains; for locally connected hosts, MAC and ARP aging is
occurring, the only difference with EVPN is that the remotely learnt BGP EVPN MAC and
ARP are programmed as static entries. To avoid the locally learnt MACs from being flushed
after the default timeout (5 minutes) due to a lack of traffic, it is advised to configure the ARP
aging time (default 4 hours) to a value less than the configured MAC timeout. This
configuration will force an ARP refresh, and consequently a re-learning of the MAC entry,
before the MAC is flushed. The ARP aging timer is configured at the interface level with the
CLI command ‘arp timeout <60-65535 seconds>’, the MAC timeout value is a global
parameter and configured with the CLI command “mac address-table aging-time <10-
1000000 seconds’. This is particularly important if there are “quiet” hosts in the domain and
one needs to ensure MAC entries are not flushed (and then relearnt) unnecessarily.
Another common variable is the site design, and whether the VTEP acts as a default
gateway for a downstream layer 2 domain or peers at layer 3 with downstream L3 aware
nodes. This becomes more relevant at the edge of the DC, when the VTEP is being
deployed for DCI or WAN connectivity, therefore instead of end-nodes being connected
directly to the VTEP, it could be a Layer 2 switching domain, or the DC Core routers that are
peering with the VTEP.
33/45
Figure 5.5 Layer 2 POD Design
Firstly, the Layer 2 site: In this topology, the VTEP is the default-gateway for the Layer 2
domain, and as such it can spoof the ARP’s from the connected hosts and generate the
type-2 MAC+IP route for ARP suppression and provide layer 2 connectivity across sites,
Layer 3 VPN services between site can also be provided, in the design by the advertisement
of type-5 routes for the local prefix.
The second site design option is to have routers peering with the VTEPs as detailed below.
34/45
Figure 5.6 Layer 3 POD Design
In this topology, the southbound interfaces on the VTEP are configured as router ports (no
switch-port) and most commonly the router’s peer using BGP. Again, any prefixes learnt via
this BGP session are advertised on to remote VTEPs as type-5 prefix routes when
redistribute learned is configured in the VRF. This provides Layer 3 VPN services between
sites.
EVPN BGP sessions are configured on all leafs in either a full-mesh multi-hop eBGP
topology between Leaf/VTEP switches, or in a partial-mesh using route server capabilities on
the spine. These BGP EVPN peering sessions advertise the dynamically learnt, and
statically configured, MAC addresses to all remote VTEPs. The sequence number is
included in these MAC address advertisements to suppress MAC flapping and MAC spoofing
35/45
Broadcast, unknown unicast and multicast (BUM) traffic is flooded via head-end replication
(HER) to remote VTEPs using the BGP EVPN type 3 route.
The Arista leaf VTEPs can be configured in an active/active dual-homing configuration using
the standard MLAG configuration, and MAC addresses advertised via BGP EVPN updates to
remote VTEPs, with a next-hop of the shared virtual VTEP.
https://eos.arista.com/evpn-configuration-layer-2-evpn-design-with-type-2-routes/
Figure 5.7 below illustrates the Layer 2 EVPN model for multiple VNIs, with two of the Arista
VTEPs configured in an active/active M-LAG pair. One spine switch is configured as the
transit route server for the EVPN overlay routes, while a non-EVPN aware gateway router
provides inter-VNI routing capabilities and external access.
Figure 5.8 illustrates the same topology as in Figure 5.7, except that now a 3rd party VTEP is
providing gateway functionality. The 3rd party GW VTEPs can be configured either as an ESI
active/standby or an active/active site.
36/45
Figure 5.8 Layer 2 EVPN model – Third-Party Layer 3 VTEP
As detailed in Section 4.6, one of the fundamental concepts to understand in EVPN VXLAN
is inter-subnet routing. Firstly, what it is, and secondly, what are the different modes of inter-
subnet routing?
Similarly, VXLAN overlay traffic that stays within the same VLAN, or VNI, is known as intra-
subnet forwarding, or intra-VLAN/intra-VNI forwarding.
37/45
Figure 5.8 Logical Intra VNI Topology
As shown in the diagrams above, intra-VNI forwarding only needs the destination MAC to
forward over the VTEP. In EVPN VXLAN this information is gleaned from the mandatory
MAC address in the type-2 route. The associated IP address may be included as a separate
MAC/IP route if an SVI is configured, but as discussed previously, this MAC/IP route allows
for ARP suppression. Intra-VNI is simple, and any vendor that supports EVPN VXLAN will
advertise MAC routes.
The complication arises when traffic needs to be routed between VNIs. In this case, there
are two methods for providing this functionality, Asymmetric forwarding and Symmetric
forwarding.
For inter-subnet routing to happen Integrated Routing and Bridging (IRB) needs to be
enabled. In EOS, this means configuring an SVI for the VLAN.
38/45
Asymmetric IRB forwarding – Arista’s Direct” routing model
In this mode, the ingress VTEP must route the packet locally, then bridge over the VTEP so
that the receiving side only needs to do a VXLAN header strip and a direct Layer 2 forward
onto the receiving host.
As detailed in figure above, the ingress VTEP locally routes the packet into VNI 1010
(Orange) and then bridges the packet over the VTEP, with the VTEP header having the
correct VNI for the destination host (VNI1010) and inserting the DMAC for the receiving host
in the inner-packet, as this MAC is a MAC type-2 route in VNI1010. This requires the
receiving VTEP, VTEP4, to perform only a layer-2 lookup for the locally connected hosts.
The major drawback with this approach is the sending VTEP needs all the information for
every host and for every VNI to be able to build this packet. This means all VNIs, and
associated SVI with anycast IP, needs to be configured on all participating VTEPs
39/45
1. VNI Scaling: There are only a limited number of VNIs supported on some hardware,
so not all VNIs can reside on all VTEPs. This is especially true in datacenter
deployments, where the TOR’s have traditionally been more resource constrained than
chassis-based edge systems.
3. IP Next–hop scaling issue: With this mode each host has a MAC+IP binding in this
mode, meaning each /32 prefix has its own NH. This is inefficient, and in the symmetric
mode this is avoided by having the remote IP host routes all pointing to the router MAC
they are located behind.
To address some of these concerns around VNI scaling and host MAC and MAC+IP state
bloat on VTEPs symmetric mode was proposed to optimize this type-2 MAC/IP inter-subnet
host routing.
As shown in the figure above, as shared “routing” VNI is now used to forward inter-VNI
traffic, therefore not all MAC VRFs and associated SVI’s need to be configured on all VTEPs.
The model now is Routes, VXLAN bridge, then route again.
40/45
When a host in the green subnet, needs to communicate with a host in the orange subnet it
sends traffic to the default gateway (VTEP-1 router MAC in VLAN green), Now the ingress
VTEP does a lookup in the routing VRF (VNI 2000), and will swap the SMAC to be the local
router-mac and the DMAC to be destination router MAC (VTEP4 in this case). This
forwarding is standard layer 3 routing. The next step is to forward the traffic over VNI 2000,
this is VXLAN bridging, do the VNI is 2000 and the source and destination IP is simply the
VTEP IP endpoints.
The receiving router now needs to remove the VTEP tunnel header and perform a layer 3
look-up on the received packet. in this case it resolves to subnet Orange. Finally, the
receiving router just does a lookup for the DMAC of the host. SMAC is set to be the MAC of
VTEP-4.
This addresses the scaling issues seen in asymmetric mode, because now all IP hosts
routes have a NH of the remote router-mac, thus dramatically lowering the number of next-
hops in the system. Additionally, there are less MAC’s in the system overall as all MAC’s in
those VNIs not local to that VTEP are not known or installed locally.
It must be noted that even if the two MAC VRFs exist on the same VTEP, inter-subnet routing
will go via the shared routing VRF VNI.
The common use cases for using a Layer 3 EVPN service are detailed below. In Figure 5.12,
multiple Layer 2 POD’s are inter-connected at layer 3 using an EVPN L3VPN. The IP
prefixes of each tenant are advertised in their respective EVPN L3VPN instance.
41/45
Figure 5.12 Layer 3 EVPN Inter-POD VRF
Figures 5.13 and 5.14 illustrate the DCI/WAN use case, where each tenants prefixes in each
DC are advertised in a separate EVPN L3VPN instance. Within each DC the sites can be
connected at Layer 2, such that the edge BGP EVPN speakers are gateways for the local
DC subnets, and are advertising these subnets to remote DC’s, as shown in Figure 5.13.
Or alternatively the subnets can be learnt from local peering routers, such that the BGP
EVPN speakers are advertising these learned local IP prefixes, as well as connected
prefixes to remote DC’s, as shown in Figure 5.14.
42/45
Figure 5.14 Layer 3 EVPN DCI – Layer 3 Handoff
6. Conclusion
As customers move resources to the cloud and/or expand their current cloud-based
resources an architecture that is scalable, secure and standards-based is a necessity.
Furthermore, the data center architectures now demand a high degree of flexibility, and
rapid on-ramping of services anytime and anywhere.
EVPN provides this workload mobility, optimized forwarding and routing capabilities and the
ability to extend these services in DCI and WAN interconnects for both layer 2 and layer 3
VPN services, over multiple data plane encapsulations. This allows customers to
standardize on BGP EVPN as the unified service control plane, thus simplifying and de-
risking their deployments and operations.
EOS supports a full suite of BGP EVPN service types and deployment models to support
both these layer 2 and layer 3 VPN services over VXLAN for both DC and DCI/WAN
topologies.
BGP L3LS Fabric Design (used for general BGP best practices): L3LS Design Guide
43/45
Layer 2 configuration guide – and general concepts:
https://eos.arista.com/eos-4-18-1f/evpn-vxlan/
Multi-Tenant EVPN VXLAN IRB Configuration & Verification Guide (iBGP Overlay eBGP
Underlay): https://eos.arista.com/multi-tenant-evpn-vxlan-irb-configuration-verification-guide-
ibgp-overlay-ebgp-underlay/
Multi-Tenant EVPN VXLAN IRB Configuration & Verification Guide (eBGP Overlay eBGP
Underlay): https://eos.arista.com/multi-tenant-evpn-vxlan-irb-configuration-verification-guide-
ebgp-overlay-underlay/
44/45
Enabling DHCP relay in Multi-tenant IRB EVPN VXLAN Overlays: https://eos.arista.com/eos-
4-20-5f/dhcprelay-anycast/
Inserting host-routes into underlay with Asymmetric IRB VXLAN in default VRF:
https://eos.arista.com/eos-4-20-1f/hostinject/
45/45