Dakkak, Abdul; Li, Cheng; Garcia de Gonzalo, Simon

Deep neural networks (DNNs) have become pervasive within low latency Function as a Service (FaaS) prediction pipelines, but suffers from two major sources of latency overhead: 1) the round-trip network latency between FaaS container and a remote model serving process; 2) Deep Learning (DL) framework runtime instantiation and model loading from storage to CPU or GPU memory. We characterize 2) for online DL model inference across popular DL frameworks, such as Caffe, Caffe2, MXNet, and TensorFlow, on both CPUs and GPUs and show the overhead is significant. While persistent model serving schemes remove 2), they do so by eternally persisting models within memory introducing resource waste and increases cost. Based on the observation that within FaaS environments, models are frequently shared: image recognition, object detection, NLP, and speech synthesis for example, we propose TrIMS, a multi-tier software caching layer on top of FaaS worker machines to address the problem. 

TrIMS consists of a persistent model store across the GPU, CPU, local storage, and cloud storage hierarchy, an efficient resource management layer that provides isolation, and a succinct set of application APIs and container technologies for easy and transparent integration with FaaS, DL frameworks, and user code. It enables sharing models across user processes within a system while guaranteeing isolation.
We  implement TrIMS within Apache MXNet and evaluate on three systems that represent current cloud system offerings. We used 45 DL models and show a speedup of up to 24× for small models and up to 210× for large models. When running concurrent inference, we can increase the overall system throughput by up to 8× and are within 20% of ideal speedup (with ideal being that model loading and data movement takes no time). Our methodology, when applied to DL frameworks, offers advantages to both cloud providers and users. The isolation along with the significant memory reduction through model sharing enable cloud providers to over-provision hardware resources, thus decreasing the total cost of ownership. The benefits of TrIMS to the cloud providers can be passed down to the users in the form of reducing latency or cost of inference.

TrIMS is a generic memory sharing technique that can be used when computation requires large number of constant parameters to be in situ on the CPU or GPU, while still maintaining isolation between users. As such, the proposed method can be easily generalized to any application or algorithm that spans multiple processes and requires large amount of constant data resources. While we motivated our work with deep learning, other types of applications such as image processing, physical simulation, or in-memory databases can benefit from our approach. By removing the memory copy overhead in these use cases, we also make the GPU a more viable alternative within these use scenarios.

Related papers:

"TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep Learning Inference in Function-as-a-Service", Abdul Dakkak, Cheng Li, Simon Garcia de Gonzalo, Jinjun Xiong, Wen-mei Hwu, IEEE International Conference on Cloud Computing, July 8-13, 2019, Milan, Italy. [more...]
"TrIMS: Transparent and Isolated Model Sharing for LowLatency Deep Learning Inference in Function as aService Environments", Abdul Dakkak, Cheng Li, Simon Garcia de Gonzalo, Jinjun Xiong, Wen-mei Hwu, Systems for ML at NIPS 2018. [more...]