At the core of contemporary high performance computer systems is the communication infrastructure. For this reason, there has been a lot of work on providing low-latency, high-bandwidth communication subsystems for clusters. In this paper, we introduce MultiEdge, a connection oriented communication system designed for high-speed commodity hardware. MultiEdge provides support for end-to-end flow -control, ordering, and reliable transmission. It transparently supports multiple physical links within a single connection. We use MultiEdge to examine the behavior of edge-based protocols using both micro-benchmarks and real-life shared memory applications. Our results show that MultiEdge is able to deliver about 88% of the nominal link throughput with a single 10-GBit/s link and more than 95% with multiple 1-GBit/s links. Our application results show that performing all of the communication protocol at the edge does not seem to cause any degradation in performance.
Computer systems keep increasing in size. Systems scale in the number of processing units, memories and peripheral devices. This creates many and diverse architectural trade-offs that the existing operating systems are not able to address. We are designing and implementing, FenixOS, a new operating system that aims to improve the state of the art in scalability and reliability.
In this work we examine the implications of building a single logical link out of multiple physical links. We use MultiEdge to examine the throughput-CPU utilization tradeoffs and examine how overheads and performance scale with the number and speed of links. We use low- level instrumentation to understand associated overheads, we experiment with setups between 1 and 8 1-GBit/s links, and we contrast our results with a single 10-GBit/s link. We find that: (a) Our base protocol achieves up-to 65% of the nominal aggregate throughput, (b) Replacing the interrupts with polling significantly impacts only the multiple link configurations, reaching 80% of nominal throughput, (c) The impact of copying on CPU overhead is significant, and removing copying results in up-to 66% improvement in maximum throughput, reaching almost 100% of the nominal throughput, (d) Scheduling packets over heterogeneous links requires simple but dynamic scheduling to account for different link speeds and varying load.
IMulticore processors and systems are often constrained on power and hardware resources – It matters how resources are spent IDedicated hardware for floating point (FP) operations requires valuable hardware resources and consumes power IAccelerators consume valuable chip area and may lead to an overall reduction of the number of cores IAchieving acceleration of FP operations without spending valuable silicon area on big accelerators is desirable
Ethernet line rates are projected to reach 100 Gbits/s by as soon as 2010. While in principle suitable for high performance clustered and parallel applications, Ethernet requires matching improvements in the system software stack. In this paper we address several sources of CPU and memory system overhead in the I/O path at line rates reaching 80 Gbits/s (bi-directional), using multiple 10 Gbit/s links per system node. Key contributions of our work are the design of a parallel high-performance communication protocol that uses context-independent page-remapping to (a) reduce packet processing overheads; (b) reduce thread management and synchronization overheads; and (c) address affinity issues in NUMA multicore CPUs. Our design result in the full 40 Gbits/s of available one-way Ethernet bandwidth and in 57.6 Gbits/s (72%) of the 80 Gbits/s maximum bidirectional throughput (limited only by the memory system), while leaving ample CPU cycles for application processing.