By Gur Shatz, Cato Networks
Anyone with hands-on experience setting up long-haul VPNs over the Internet knows it’s not a pleasant exercise. Even factoring out the complexity of appliances and the need to work with old relics like IPSEC, managing latency, packet loss and high availability remain huge problems. Service providers also know this — and make billions on MPLS.
The bad news is, it doesn’t matter that available capacity has gone up. The problem is twofold: the way providers are interconnected and mismanagement of global routes. The same architecture that allowed the Internet to cost-effectively scale to billions of devices also set its limits.
Addressing these challenges requires a deep restructuring in the fabric of the Internet and core routing – and should form the foundation for possible solutions. There isn’t going to be a shiny new router that would magically solve it all.
IP Routing’s Historical Baggage: Simplistic Data Plane
When IP packets that make up the Internet are lost along their routes, it is the responsibility of higher-level protocols such as TCP to recover them. Packets hop from router to router, only aware of their next hop and their ultimate destination.
In other words, routers make all the decisions, according to their routing tables. When a router receives a packet, it performs a calculation according to its routing table to identify the best next hop for the packet.
From the early days of the Internet, routers were shaped by technical constraints. There has always been a shortage of processing power available to move packets along their path, otherwise known as the data plane. Routers had to rely on custom hardware that performed minimal processing per packet and had no state management. Communication with this restricted data plane was simple and infrequent.
Routing decisions were then moved out to a separate process, the control plane, which pushed its decision about the next next router back into the data plane.
This separation of control and data planes allowed architects to build massively scalable routers, handling millions of packets per second. However, even as processing power increased on the data plane, it wasn’t really used. The control plane makes all the decisions, the data plane executes the routing table, and they rarely communicate.
A modern router therefore has no idea how long it actually took a packet to reach its next hop, or whether it reached it at all. The router doesn’t know if the way is congested, and even if it had information to share, it cannot communicate back to the control plane, where routing decisions are actually made.
Enter BGP – A Routing Decisions Protocol
Border Gateway Protocol (BGP) is the routing protocol that glues the Internet together. In very simple terms, its task is to communicate the knowledge of where an IP address (or a whole IP subnet) originates. BGP involves routers connecting with their peers, and exchanging information about which IP subnets they originate, and also “gossip” about IP subnets they learned about from other peers. As these rumors propagate between the peers and across the globe, they are appended with the accumulated rumor path from the originator.
BGP has a straightforward way of assessing distance: it uses the AS-Path, so if it got the route first-hand it is assumed to be closest. Rumored routes are considered further away as the hearsay “distance” increases. The general assumption is that the router that reported the closest rumor is also the best choice to send packets. BGP doesn’t know if a specific path has 0% or 20% packet loss. Also, using the AS-Path as a method to select smallest latency is pretty limited: It’s like calculating the shortest path between two points on the map by counting traffic lights, instead of miles, along the way.
BGP Abused: The Commercial Internet
To work around BGP’s algorithms, the protocol itself allows network engineers to tweak and improve problematic routes and to alleviate congestion issues. There are a host of manual controls to manipulate “next best hop” decisions: for weight, local preference (prioritizing routes from specific peers), communities (allow peers to add custom attributes, which may then affect the decisions of other peers along the path), and AS-Path prepending (manipulates the propagated AS-Path).
The relationship between BGP peers on the Internet is a reflection of commercial contracts of ISPs. Customers pay for Internet traffic. Smaller service providers pay larger providers, and most pay tier-1 providers. Any non-commercial relationship has to be mutually beneficial, or very limited.
BGP gives service providers the tools to implement these financial agreements:
- Service providers usually prefer routing traffic for “paying” connections.
- Service providers want to quickly get rid of “unpaid” packets, rather than carrying them across their backbone (so called “hot potato” routing).
- Sometimes, service providers will carry the packets over long distances just to get the most financially beneficial path.
All this comes at the expense of best path selection.
The MPLS Racket
To address these problems, service providers came up with an alternative offering: private networks, built on their own backbones, using Multiprotocol Label Switching (MPLS) as the routing protocol.
MPLS is in many ways the opposite of BGP. Instead of an open architecture, MPLS uses policy based, end-to-end routing. A packet’s path through the network is predetermined, which makes it suitable only for private networks. This is why MPLS is sold by a single provider, even if the provider patched together multiple networks behind the scenes to reach customer premises.
But MPLS is ultimately a control plane protocol. It has many of the same limitations as BGP: Routing is decided by policy, not real traffic conditions such as latency or packet loss. Providers are careful about bandwidth management to maintain their SLAs.
The combination of single vendor lock-in and the need for planning and overprovisioning to maintain SLAs make these private networks a premium, expensive product. As the rest of the Internet, with its open architecture, became increasingly competitive and cost-efficient, MPLS now faces pressure. As a backbone implementation, it is not likely to ever become affordable.
A Way Forward
The Internet just works. Not flawlessly, not optimally, but packets generally reach their destination. The basic structure of the Internet has not changed much over the past few decades, and has proven itself probably beyond the wildest expectations of its designers.
However, it is time to expose its key limitations:
- The data plane is clueless. Routers, which form the data plane, are built for traffic load, and are therefore stateless, and have no notion of individual packet or traffic flows.
- Control plane intelligence is limited. Because the control plane and the data plane are not communicating, the routing decisions are not aware of packet loss, latency, congestion, or actual best routes.
- Shortest path selection is abused: Service providers’ commercial relationships often work against the end user interest in best path selection.
The limited exchange between the control and data planes has been taken to the extreme in OpenFlow and Software-defined Networking (SDN): the separation of the control plane and data plane into two different machines. This might be a good solution for cutting costs in the data center, but to improve global routing, it makes more sense to substantially increase information sharing between the control plane and the data plane.
To solve the limitations of the Internet as a global traffic transport mechanism, and open up more possibilities for cost-efficiency, it’s time to converge the data and control planes to work more closely together. When you do that, not only can the network dynamically select the best path for better latency control, but something equally, if not more, important happens: Network engineers will be able to integrate security directly into the data plane, a process that has eluded the industry
for years. This in turn addresses a more systemic problem: the perfect storm of dissolving network perimeters, an explosion of enterprise security point solutions and the lack of people to run it all. Infrastructure must adapt to today’s reality of threat velocity and a distributed, cloud-centric enterprise.
Gur Shatz is co-founder and CTO of Cato Networks. Prior to Cato Networks, he was the co-founder and CEO of Incapsula Inc., a Cloud-based web applications security and acceleration company. Before Incaspula, Gur was Director of Product Development, Vice President of Engineering and Vice President of Products at Imperva, a web application security and data security company. Gur holds a BSc in Computer Science from Tel Aviv College.