Xgboost Booster vs future for DaskXGBRegressor.predict()

From the xgboost docs re: xgboost.dask:

https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.dask.predict

model (Union[Dict[str, Any], xgboost.core.Booster, distributed.Future]) – The trained model. It can be a distributed.Future so user can pre-scatter it onto all workers.

How can I get the distributed future of a model?

Both using .get_booster() on a an instance of DaskXGBRegressor (after running .fit()) and the output of xgb.dask.train() give me an xgboost.core.Booster.

Also, what do I lose if I supply the Booster rather than the future when using a Dask cluster?

I’m having trouble understanding the performance implications of not being able to scatter to the workers.

Hi, you can use client.scatter to obtain a future to the booster.

Before running prediction on each worker, the workers need to have a copy of booster first. client.scatter does exactly that and returns a handle to the remote copy of the booster as future. If you pass this future to XGBoost, then XGBoost doesn’t need to make the copy itself, so multiple calls to the prediction function can reuse the same remote copy, hence higher performance.

Thanks for the response! That does make sense about how this enhances performance.

My current code is:

estimator = xgboost.dask.DaskXGBRegressor()
regressor.fit(X,y)
booster = regressor.get_booster()
preds = xgboost.dask.predict(client, booster, dmatrix)

Based on your suggestion, I should add the following between the booster = and preds = lines:
booster_futures = client.scatter(booster, broadcast=True)

and then I’d change the preds line to:
preds = xgboost.dask.predict(client, booster_futures, dmatrix)

I’m assuming broadcast should be true because I want to send the data to all workers and that I don’t need to set direct since it will check automatically. Is that correct?

Also, I’m curious why this isn’t handled more directly by Dask. As a user, I would expect .get_booster() on a DaskXGBRegressor (or any xgboost.dask estimator) to return a Dask version of a booster that would then handle the data distribution to the workers. Do you know if there is a reason that isn’t possible? Or is it, maybe just not implemented yet?

Your code looks good.

Also, I’m curious why this isn’t handled more directly by Dask

If you meant why isn’t it handled by xgboost dask module, I did give it some thought. Feel free to make your suggestions. There are some difficulties, firstly the setting was easier to implement because the booster object stored in sklearn interface can be used by other features (like getting the n_features_in_ attribute, computing global feature contributions etc). Secondly, how do you select a worker to return the booster in the future function, some workers might have an empty booster due to empty training partitions. Lastly, the sklearn wrapper doesn’t know how to maintain a coherent cache for the booster as future, what if you load a model from a file? What if you pickle the model? What if you set some new parameters? What if you want to continue the training? In the end, I just keep it as a blocking function, not super dask friendly as you have pointed out, but at least it’s robust.

Having said that, if you have some ideas on how to improve it I’m exicted to help.

Yea, what I meant by that is that I would expect the following code:

estimator = xgboost.dask.DaskXGBRegressor()
regressor.fit(X,y)
booster = regressor.get_booster()

would yield a ‘booster’ that is of type xgb.dask.booster (or a future as when using client.scatter), instead of xgb.Core.booster. And when I call:

preds = xgboost.dask.predict(client, booster, dmatrix)

It would distribute the work.

I think I understand what you are saying the difficulties are, which is the answer to what I was trying to understand which is ‘are there good reasons that this hasn’t been implemented beyond that it hasn’t been looked into?’ The answer to that seems to be yes, there are complicating factors that make it non-trivial.

Thanks for your responses!