Researchful

Wednesday, February 8, 2012

Multi-path TCP

Improving Datacenter Performance and Robustness with

Multipath TCP by Costin Raiciu, Sebastien Barre, Christopher Pluntke,

Adam Greenhalgh, Damon Wischik, Mark Handley

Multipath TCP is a modification to the TCP protocol which takes advantage of multiple routes between two hosts on a network. The idea is to stripe TCP flows across multiple TCP connections which might take different routes through a network which employs some sort of ECMP mechanism. Each of the sub-flows created is treated like its own TCP connection with its own sequence and acknowledgment numbers, and then the bits are reorganized on the receiving end to give the application listening on that socket the illusion that regular old TCP has been used.

Fairness issues

This idea isn't new, in fact you can do this without a new protocol by just using multiple TCP connections in an application. However that approach would cause fairness issues due to the fact that all of these connections would grab for the same amount of bandwidth on all of the various routes regardless of how congested they are.

The authors propose a very interesting and simple way to ensure fairness while using MPTCP. The window size increase on each subflow's TCP connection is a function of the current window size for that connection, as well as the sum of all window sizes on all subflows for the connection. Using this approach, the proposed MPTCP protocol can back off of routes that are more congested, and shift usage to less congested paths.

Discussion

MPTCP seems like a great fit for datacenters. It is much less heavy weight than the Hedera solution as it requires no centralized flow controller, no OpenFlow, and can be implemented on currently available hardware on existing datacenters. The authors even show that it works by deploying it in one of Amazon EC2's datacenters. The only cost of deploying this seems to be the cost of installing the modified kernel at each end point of the connection so that the nodes will use MPTCP in place of regular TCP.

As Ashik pointed out, it would be quite interesting to see how the combination of MPTCP and DCTCP would perform in a datacenter.

Monday, February 6, 2012

Hedera

Hedera: Dynamic Flow Scheduling for Data Center Networks

by Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang and Amin Vahdat.

What is Hedera?

This paper is proposes Hedera, a dynamic network flow scheduling system that improves upon existing IP multi-pathing protocols such as ECMP which use static hashing to assign routes to flows.

What problem does Hedera aim to solve?

In a multi-rooted datacenter setup such as a Fat-Tree, the network interconnect is constructed to allow for full bisection bandwidth. However, to fully take advantage of this infrastructure, there needs to be a routing protocol which can spread the net work flows among the multiple paths between the hosts to minimize congestion.

Before Hedera, the state of the art was to use equal cost multipath (ECMP), which hashes a field from the TCP header and uses the hash value to statically assign the flow to a route. The problem with ECMP is that sometimes large flows can hash to the same value and thus be assigned the same route, causing long lasting congestion over that route while there is unused bandwidth along another route.

How does it work?

The basic idea is a three step approach: detect large flows, compute a good path for those flows, and then install that path on the switches (using OpenFlow). For small flows, the standard ECMP approach is used because the overhead of computing the paths is too large to benefit when flow is not very long.

The authors examine two different scheduling algorithms for determining the routes: global first fit, and simulated annealing. Global first fit is simple: upon detecting a large flow, the scheduler linearly searches possible paths to find one which can accomodate the flow, and then installs that route into the switches along that path for the duration of the flow. Simulated annealing is more complicated: for each scheduler tick (about every 5 seconds) it probabilistically searches the space of possible routes to improve upon the previously calculated flow matrix. To limit the search space, the authors propose assigning a specific core switch to each host.

In addition to the dynamic routing, Hedera uses PortLand mechanisms to provide fault tolerant routes: if routing hardware along a computed path fails, it will be re-routed.

Results

The authors test Hedera using both scheduling algorithms against a hypothetical non-blocking switch and the ECMP approach in a simulator. They show that in all cases, Hedera outperforms ECMP (up to 113% improvement), and that the simulated annealing scheduler provides higher bisection bandwidth than does the global first fit scheduler. The simulated performance of Hedera with simulated annealing shows that 96% of optimal bisection bandwidth can be achieved.

Discussion

While this is an interesting paper, I think that in a real datacenter, this approach will not have the benefits shown by the simulations. The scheduler can only run so often, and the 5 second intervals that are used in this paper are simply too large for the types of traffic observed in real datacenters. In fact, most flows are less than 100 milliseconds, and even large flows are less than one second. Given the small amount of large flows, it's unlikely that re-routing such flows will have much of an impact.

Another problem with Hedera is that it only focuses on network-limited flows. Consider a large transfer of data from disk across the network. Such a transfer is likely to be host limited, yet still long-running. These long running flows will be ignore by the scheduler, and other network limited flows could be scheduled on the same route. A similarly bad situation can occur if there is a host-limited flow that barely exceeds the threshold and is picked up by the scheduler, but then does not use all of the bandwidth.

Lastly, achieving acceptable schedule latencies would require specialized hardware which is expensive.

Data Center TCP

Data Center TCP (DCTCP) By M. Alizadeh, A. Greenberg, D. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan. .

This paper aims to achieve high burst tolerance, low latency, and high throughput, with commodity switches by modifying the traditional Transport Control Protocol (TCP) and optimizing it for Data Center Traffic. They name this modified protocol "Data Center Transport Control Protocol (DCTCP)"

Challenges

This protocol targets 3 main performance issues which are common in a production cluster: Incast problem, Queue buildup and Buffer pressure.

1. Incast Problem: If a large number of flows converge on the same interface of a switch over a short period of time, the packets may exhaust either the switch memory or the maximum permitted buffer for that interface, resulting in packet losses. This can occur even for flows of small size.

2. Queue Buildup: If long and small flows are traversing through the same queue, the packets of small flow, even when no packets are lost, observe an increase in the latency due to the queue build up caused by the packets of the large flow.

3. Buffer Pressure: If a few ports of a switch are overloaded with traffic, there is direct impact of the traffic traversing from other ports of the same switch. The explanation for this effect is that different ports on the same switch use a shared memory pool which get occupied by the overloaded ports and causes loss of packets on other ports.

DCTCP Algorithm

In order to tackle these issues, DCTCP uses a simple marking scheme at switches that sets the Congestion Experienced (CE) codepoint of packets as soon as the buffer occupancy exceeds a fixed small threshold. The receiver of the packet will check this flag and set the flag in the ACK in order to notify the sender about the congestion. The sender will modify its window size in proportion to the amount of ACKs received with the flag as set. This approach of reacting to congestion is different from the approach of regular TCP which cuts its window size by a factor of 2 when it detect congestion. By reacting proportionally to the amount of congestion. DCTCP tries to get maximum throughput without incurring any packet loss.

Results

By using DCTCP, the impact of the Incast problem does get reduced. However when a large number of synchronized small flows hit the same queue, even a single packet from all flow can cause the Incast problem. So there isn't much one can do about this issue.

Since DCTCP senders start reacting as soon as queue length on an interface exceeds beyond K, the issue of Queue Build up is resolved.

The DCTCP also solves the buffer pressure problem because a congested port's queue length doesn't grow exceedingly large.

The results show that if DCTCP is used in a data center, it could handle 10 times larger query responses and 10 times larger background flows while performing better than it does with TCP today.

Reviewer's comments

This paper seems very interesting and also feels very promising as it performs really well in the test results which consisted of real world work loads. Implementing DCTCP in a data center would take 30 lines of code change which is quite amazing considering the performance gain that it could achieve. The authors mention it in the paper that this protocol might not be suitable for WAN and it would be interesting to see how it performs in presence of TCP traffic. Overall the authors have down a very good job.