I had a question to which I did not find an answer anywhere (saw a reference to it for example here tho). I have been reading the XGB code for a while and I have seen some dramatic changes to the code specially in “src/common/quantile.cu” where the cuts are extracted in the current code. The former snapshot I was reading was for March.
To be more specific, the function responsible for extracting cuts is “PruneImpl” now but back then it was “ExtractCuts” (for the unweighted case). “PruneImpl” seems to consider whether the column is categorical and if it is, it would directly put the “SketchEntry” in the “out_cuts”.
Now my question is, is it still the case that categorical columns need to be one-hot-encoded? or now we have this “is_cat” boolean which takes care of categorical columns?
If the latter is the case, how would pruning be done if we need to reduce the number of cuts to 256 (in case we have more than 256 categories)? I ask that because if we cannot have an order in categorical features, how are we distilling down 257+ categories into 256?
Also as a side note, if categorical features are to be treated differently (and not by one-hot-encoding and then treated like any other feature), does that mean that instead of having a (feature, value) pair for each node that decides data points going to the left node (if x[feature]<value) or right node(if x[feature]>=value), we will have categorical values where for example for x[categ_feature]==value we go to left and go to right otherwise?