HyperLink   TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep Learning Inference in Function-as-a-Service
Publication Year:
  Abdul Dakkak, Cheng Li, Simon Garcia de Gonzalo, Jinjun Xiong, Wen-mei Hwu
  IEEE International Conference on Cloud Computing, July 8-13, 2019, Milan, Italy

Deep  neural  networks  (DNNs)  have  become  core computation  components  within  low  latency  Function  as  a Service  (FaaS)  prediction  pipelines.  Cloud  computing,  as  the defacto  backbone  of  modern  computing  infrastructure,  has to  be  able  to  handle  user-defined  FaaS  pipelines  containing diverse  DNN  inference  workloads  while  maintaining  isolation and   latency   guarantees   with   minimal   resource   waste.   The current  solution  for  guaranteeing  isolation  and  latency  within FaaS is inefficient. A major cause of the inefficiency is the need to  move  large  amount  of  data  within  and  across  servers.  We propose TrIMS as a novel solution to address this issue. TrIMS is  a  generic  memory  sharing  technique  that  enables  constant data  to  be  shared  across  processes  or  containers  while  still maintaining isolation between users. TrIMS consists of a persistent model store across the GPU, CPU, local storage, and cloudstorage hierarchy, an efficient resource management layer that provides  isolation,  and  a  succinct  set  of  abstracts,  application APIs,  and  container  technologies  for  easy  and  transparent integration  with  FaaS,  Deep  Learning  (DL)  frameworks,  anduser code. We demonstrate our solution by interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24× speedup in latency for image classification models, up to210×speedup  for  large  models,  and  up  to 8× system  throughput improvement.