Slow prediction time with GPUs on AWS

We have trained an XGBoost model and I am trying to deploy it in AWS Sagemaker using a custom container.

Issue:
I am not running into any errors when I hit my Sagemaker endpoint however the prediction times are much slower than the endpoint I built that doesn’t try to make use of GPUs (200ms vs 20ms). When I check the metrics in CloudWatch I can see GPU Memory Utilization but no GPU Utilization.

What I’ve tried:

  • I have set the predictor to ‘gpu_predictor’ and the tree_method to ‘gpu_hist’.
  • I am deploying on a single ml.g4dn.xlarge instance.
  • I tried building my container from the nvidia/cuda:10.1-cudnn7-runtime image and also from the AWS XGBoost image found here.

Questions

  • Is the gpu_predictor primarily designed for training/batch prediction and this is expected for single predictions?
  • In order to properly make use of GPUs for inference does the model need to be trained on GPUs?
  • Has anyone else attempted this and run into similar issues?

Yes, this is expected for single predictions. This is because you incur a fixed overhead for moving data from the main memory to the GPU memory.

No, this is not necessary.