Abstract
The performance and behavior of large-scale distributed applications is highly influenced by network properties such as latency, bandwidth, packet loss, and jitter. For instance, an engineer might need to answer questions such as: What is the impact of an increase in network latency in application response time? How does moving a cluster between geographical regions affect application throughput? What is the impact of network dynamics on application stability? Currently, answering these questions in a systematic and reproducible way is very hard due to the variability and lack of control over the underlying network. Unfortunately, state-of-the-art network emulation or testbed environments do not scale beyond a single machine or small cluster (i.e., MiniNet), are focused exclusively on the control-plane (i.e., CrystalNet) or lack support for network dynamics (i.e., EmuLab).
In this paper, we address these limitations with Kollaps, a fully distributed network emulator. Kollaps hinges on two key observations. First, from an application's perspective, what matters are the emergent end-to-end properties (e.g., latency, bandwidth, packet loss, and jitter) rather than the internal state of the routers and switches leading to those properties. This premise allows us to build a simpler, dynamically adaptable, emulation model that does not require maintaining the full network state. Second, this simplified model is amenable to be maintained in a fully decentralized way, allowing the emulation to scale with the number of machines required by the application.
Kollaps is fully decentralized, agnostic of the application language and transport protocol, scales to thousands of processes and is accurate when compared against a bare-metal deployment or state-of-the-art approaches that emulate the full state of the network. We showcase how Kollaps can accurately reproduce results from the literature and predict the behaviour of a complex unmodified distributed key-value store (i.e., Cassandra) under different deployments.