Wenyi Zhao, Quan Chen, Hao Lin, Jianfeng Zhang, Jingwen Leng, Chao Li, Wenli Zheng, Li Li, Minyi Guo
In Proceedings of IEEE International Parallel & Distributed Processing Symposium (IPDPS), May 2019
Predicting performance degradation of a GPU application when it is co-located with other applications on a spatial multitasking GPU without prior application knowledge is essential in public Clouds. Prior work mainly targets CPU co-location, and is inaccurate and/or inefficient for predicting performance of applications at co-location on spatial multitasking GPUs. Our investigation shows that hardware event statistics caused by co-located applications, which can be collected with negligible overhead, strongly correlate with their slowdowns. Based on this observation, we present Themis, an online slowdown predictor that can precisely and efficiently predict application slowdown without prior application knowledge. We first train a precise slowdown model offline using hardware event statistics collected from representative co-locations. When new applications co-run, Themis collects event statistics and predicts their slowdowns simultaneously. Our evaluation shows that Themis has negligible runtime overhead and can precisely predict application-level slowdown with prediction error smaller than 9.5%. Based on Themis, we also implement an SM allocation engine to rein in application slowdown at co-location. Case studies show that the engine successfully enforces fair sharing and QoS.