Quan Chen, Zhenning Wang, Jingwen Leng, Chao Li, Wenli Zheng, Minyi Guo
In Proc. of ACM International Conference on Supercomputing (ICS). June 2019.
Existing techniques for improving datacenter utilization while guaranteeing the QoS are based on the assumption that queries have similar behaviors. However, user queries in emerging compute demanding services demonstrate significantly diverse behavior and require adaptive parallelism. Our study shows that the end-to-end latency of the compute demanding query is determined together by the system-wide load, its workload, its parallelism, contention on shared cache, and memory bandwidth. When hosting such new services, the current cross-query resource allocation results in either severe QoS violation or significant resource under-utilization.
To maximize hardware utilization while guaranteeing the QoS, we present Avalon, a runtime system that independently allocates shared resources for each query. Avalon first provides an automatic feature identification tool based on Lasso regression, to identify features that are relevant to a query's performance. Then, it establishes models that can precisely predict a query's duration under various resource configurations. Based on the accurate prediction model, Avalon proactively allocates "just-enough" cores and shared cache spaces to each query, so that the remaining resource can be assigned to execute best-effort applications. During runtime, Avalon monitors the progress of each query and mitigates any possible QoS violation due to memory bandwidth contention, occasional I/O contention, or unpredictable system interference. Our results show that Avalon improves utilization by 28.9% on average compared with state-of-the-art techniques while achieving 99%-ile latency target.