HyperLink   TrIMS: Transparent and Isolated Model Sharing for LowLatency Deep Learning Inference in Function as aService Environments
Publication Year:
  Abdul Dakkak, Cheng Li, Simon Garcia de Gonzalo, Jinjun Xiong, Wen-mei Hwu
  Systems for ML at NIPS 2018

Deep neural networks (DNNs) have become pervasive within low latency Function as a Service (FaaS) prediction pipelines, but suffers from two major sources of latency overhead: 1) the round-trip network latency between FaaS container and a remote model serving process; 2) Deep Learning (DL) framework runtime instantiation and model loading from storage to CPU or GPU memory. While models servers process solves the latter, they do so by eternally persisting models within memory introduces resource waste and increases cost. With FaaS environments, models are frequently shared: image recognition, object detection, NLP, and speech synthesis for example. We propose TrIMS, a multi-tier software caching layer on top of FaaS worker machines to solve this problem. Our solution consists of a managing model within caches that span GPUs, CPUs, local and cloud storage through a resource management service. This enables sharing models across user processes within a system while guaranteeing isolation, a succinct set of APIs and container technologies for easy and transparent integration with FaaS, DL frameworks and user code. Moreover, we show that by adopting this technique, we are able to oversubscribe the system without degrading the baseline latency. We evaluate our solution by interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24× speedup in latency for image classification models.