Container Performance Question

rlantz-cfa · August 14, 2019, 4:14pm

#docker #modeltraining #gridsearch #performance
I’ve recently had to migrate my team’s data science/ML environment from a cloud architecture to on-prem for one of my client projects (so not able to share code/output snippets). The cloud implementation is Anaconda 5.3. The on-prem implementation is based on a continuumio Docker image that is more or less Anaconda 5.3 as well.

We’ve conda installed xgboost in both places. Cloud is version 0.81 while on-prem (container) is 0.90. The cloud host OS is RHEL 7, while the on-prem host OS is RHEL 7.6. The container is Debeian 9.

I have one particular model training workload that uses scikit-learn’s GridSearchCV() to tune the hyperparameters. Pretty standard and straightforward. Data size is relatively modest, and the host machines are both very large. The newer on-prem implementation of the same code is taking significantly more wall-time than the previous cloud implementation. I have a modicum of experience with containers, so definitely possible that I’m missing something at that level, but is there a reason that a containerized environment would be so much slower to train?

hcho3 · August 14, 2019, 9:21pm

Did you try using the same XGBoost version in both environments? We would like to know whether performance regression is due to XGBoost version or other factors.