Online training and larger than memory data

ExpandingMan · September 16, 2022, 7:06pm

Hello all. I’ve used xgboost on occasion for several years now and recently I have just undertaken completely rewriting the Julia wrapper which is almost complete.

Anyway, I find that a useful pattern that drastically simplifies the engineering aspects of training on large datasets is to perform online training of a model of on 1 server, possibly prepping data on other machines. I’ve come up with some hacks to achieve this in principle with xgboost but I’m rather dubious about anything I’ve tried so far.

I have a few questions that perhaps can lead to improving the documentation on this subject:

It seems that all the external memory data iteration is doing is stashing data on disk before training… I might be missing something because this doesn’t seem particularly useful… is there any safe way of executing the data iteration concurrently with training? (it looks to me like iteration has to finish first)
I have read this issue but still feel a bit apprehensive about whether I know what I’m doing. Can someone provide more insight on what exactly I can do with the updater parameter? It’s not discussed in the white paper (I don’t think).
Yes, I realize that no matter what can be hacked together for online learning it will have no guarantees of 0-bias etc, but I personally would find it hugely useful if I could even just use xgboost for online training of random forests, I’m just not sure whether what I’m trying has been achieving that.