Prediction time is independent of number trees?

I was recently measuring a prediction time of XGB as a part of a research project and I noticed a strange behavior. No matter how many trees I use, it gives me pretty much similar prediction times. I would expect that the time will grow proportionally to the ensemble size. For instance, below are some representative numbers:

note: all times are measured using gettimeofday() in UNIX. Dataset of size 50k with 512 features is used.

So, the questions are:

  • Does XGB internally use parallel processing during prediction as well? If so, how can I force it to use a single core and a single thread (i.e., no parallelism at all) in Python interface?
  • Are there any “clever” speed up that is used by XGB during prediction? For example, tree’s prediction is computed by matrix-vector product instead of recursive tree traversal… If this is a long answer, you could probably refer me to some papers where it is explained?
  • Yes, by default XGBoost uses all CPU cores to run prediction. You can either set nthread=1 hyperparameter or set environment variable OMP_NUM_THREADS=1.
  • No, XGBoost implements tree traversal.