We have trained an XGBoost model and I am trying to deploy it in AWS Sagemaker using a custom container.
Issue:
I am not running into any errors when I hit my Sagemaker endpoint however the prediction times are much slower than the endpoint I built that doesn’t try to make use of GPUs (200ms vs 20ms). When I check the metrics in CloudWatch I can see GPU Memory Utilization but no GPU Utilization.
What I’ve tried:
- I have set the predictor to ‘gpu_predictor’ and the tree_method to ‘gpu_hist’.
- I am deploying on a single ml.g4dn.xlarge instance.
- I tried building my container from the nvidia/cuda:10.1-cudnn7-runtime image and also from the AWS XGBoost image found here.
Questions
- Is the gpu_predictor primarily designed for training/batch prediction and this is expected for single predictions?
- In order to properly make use of GPUs for inference does the model need to be trained on GPUs?
- Has anyone else attempted this and run into similar issues?