Jingwen Leng, Alper Buyuktosunoglu, Ramon Bertran, Pradip Bose, Quan Chen, Minyi Guo, Vijay Janapa Reddi
In Proceedings of International Symposium on High Performance Computer Architecture (HPCA), Feb. 2020
Accelerators make the task of building systems that are resilient against transient errors like voltage noise and soft erors hard. Architects integrate accelerators into the system as black box third-party IP components. So a fault in one or more accelerators may threaten the system’s reliability if there are no established failure semantics for how an error propagates from the accelerator to the main CPU. Existing solutions that assure system reliability come at the cost of sacrificing accelerator generality, efficiency, and incur significant overhead, even in the absence of errors. To overcome these drawbacks, we examine reliability management of accelerator systems via hardware-software co-design, coupling an efficient architecture design with compiler and runtime support, to cope with transient errors. We introduce asymmetric resilience that architects reliability at the system level, centered around a hardened CPU, rather than at the accelerator level. At runtime, the system exploits task-level idempotency to contain accelerator errors and use memory protection instead of taking checkpoints to mitigate overheads. We also leverage the fact that errors rarely occur in systems, and exploit the trade-off between error recovery performance and improved error-free performance to enhance system efficiency. Using GPUs, which are at the forefront of accelerator systems, we demonstrate how our system architecture manages reliability in both integrated and discrete systems, under voltage-noise and soft-error related faults, leading to extremely low overhead (less than 1%) and substantial gains (20% energy savings on average).