Container Performance Question


#docker #modeltraining #gridsearch #performance
I’ve recently had to migrate my team’s data science/ML environment from a cloud architecture to on-prem for one of my client projects (so not able to share code/output snippets). The cloud implementation is Anaconda 5.3. The on-prem implementation is based on a continuumio Docker image that is more or less Anaconda 5.3 as well.

We’ve conda installed xgboost in both places. Cloud is version 0.81 while on-prem (container) is 0.90. The cloud host OS is RHEL 7, while the on-prem host OS is RHEL 7.6. The container is Debeian 9.

I have one particular model training workload that uses scikit-learn’s GridSearchCV() to tune the hyperparameters. Pretty standard and straightforward. Data size is relatively modest, and the host machines are both very large. The newer on-prem implementation of the same code is taking significantly more wall-time than the previous cloud implementation. I have a modicum of experience with containers, so definitely possible that I’m missing something at that level, but is there a reason that a containerized environment would be so much slower to train?


Did you try using the same XGBoost version in both environments? We would like to know whether performance regression is due to XGBoost version or other factors.