Monday, February 6, 2012

Hedera

by Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang and Amin Vahdat.

What is Hedera?

This paper is proposes Hedera, a dynamic network flow scheduling system that improves upon existing IP multi-pathing protocols such as ECMP which use static hashing to assign routes to flows.

What problem does Hedera aim to solve?

In a multi-rooted datacenter setup such as a Fat-Tree, the network interconnect is constructed to allow for full bisection bandwidth. However, to fully take advantage of this infrastructure, there needs to be a routing protocol which can spread the net work flows among the multiple paths between the hosts to minimize congestion.

Before Hedera, the state of the art was to use equal cost multipath (ECMP), which hashes a field from the TCP header and uses the hash value to statically assign the flow to a route. The problem with ECMP is that sometimes large flows can hash to the same value and thus be assigned the same route, causing long lasting congestion over that route while there is unused bandwidth along another route.

How does it work?

The basic idea is a three step approach: detect large flows, compute a good path for those flows, and then install that path on the switches (using OpenFlow). For small flows, the standard ECMP approach is used because the overhead of computing the paths is too large to benefit when flow is not very long.

The authors examine two different scheduling algorithms for determining the routes: global first fit, and simulated annealing. Global first fit is simple: upon detecting a large flow, the scheduler linearly searches possible paths to find one which can accomodate the flow, and then installs that route into the switches along that path for the duration of the flow. Simulated annealing is more complicated: for each scheduler tick (about every 5 seconds) it probabilistically searches the space of possible routes to improve upon the previously calculated flow matrix. To limit the search space, the authors propose assigning a specific core switch to each host.

In addition to the dynamic routing, Hedera uses PortLand mechanisms to provide fault tolerant routes: if routing hardware along a computed path fails, it will be re-routed.

Results

The authors test Hedera using both scheduling algorithms against a hypothetical non-blocking switch and the ECMP approach in a simulator. They show that in all cases, Hedera outperforms ECMP (up to 113% improvement), and that the simulated annealing scheduler provides higher bisection bandwidth than does the global first fit scheduler. The simulated performance of Hedera with simulated annealing shows that 96% of optimal bisection bandwidth can be achieved.

Discussion

While this is an interesting paper, I think that in a real datacenter, this approach will not have the benefits shown by the simulations. The scheduler can only run so often, and the 5 second intervals that are used in this paper are simply too large for the types of traffic observed in real datacenters. In fact, most flows are less than 100 milliseconds, and even large flows are less than one second. Given the small amount of large flows, it's unlikely that re-routing such flows will have much of an impact.

Another problem with Hedera is that it only focuses on network-limited flows. Consider a large transfer of data from disk across the network. Such a transfer is likely to be host limited, yet still long-running. These long running flows will be ignore by the scheduler, and other network limited flows could be scheduled on the same route. A similarly bad situation can occur if there is a host-limited flow that barely exceeds the threshold and is picked up by the scheduler, but then does not use all of the bandwidth.

Lastly, achieving acceptable schedule latencies would require specialized hardware which is expensive.

No comments:

Post a Comment