A (In)Cast of Thousands: Scaling Datacenter TCP to Kiloservers and Gigabits (CMU-PDL-09-101)

2009 
This paper presents a practical solution to the problem of high-fan-in, high-bandwidth synchronized TCP workloads in datacenter Ethernets—the Incast problem. In these networks, receivers often experience a drastic reduction in throughput when simultaneously requesting data from many servers using TCP. Inbound data overfills small switch buffers, leading to TCP timeouts lasting hundreds of milliseconds. For many datacenter workloads that have a synchronization requirement (e.g., filesystem reads and parallel dataintensive queries), incast can reduce throughput by up to 90%. Our solution for incast uses high-resolution timers in TCP to allow for microsecond-granularity timeouts. We show that this technique is effective in avoiding incast using simulation and real-world experiments. Last, we show that eliminating the minimum retransmission timeout bound is safe for all environments, including the wide-area. Acknowledgements: We would like to thank Brian Mueller at Panasas Inc. for helping us to conduct experiments on their systems. We also would like to thank our partners in the Petascale Data Storage Institute, including Andrew Shewmaker, HB Chen, Parks Fields, Gary Grider, Ben McClelland, and James Nunez at Los Alamos National Lab for help with obtaining packet header traces. We thank the members and companies of the PDL Consortium (including APC, Cisco, Data Domain, EMC, Facebook, Google, Hewlett-Packard, Hitachi, IBM, Intel, LSI, NetApp, Oracle, Seagate, Sun Microsystems, Symantec, and VMware) for their interest, insights, feedback, and support. Finally, we’d like to thank Michael Stroucken for his help managing the PDL cluster. This material is based on research sponsored in part by the National Science Foundation, via grants #CNS-0546551, #CNS-0326453 and #CCF-0621499, by the Army Research Office under agreement number DAAD19–02–1–0389, by the Department of Energy under Award Number #DE-FC02-06ER25767, and by DARPA under grant #HR00110710025.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    28
    References
    1
    Citations
    NaN
    KQI
    []