XGBoost application on possible list data

Lu_Ste · June 6, 2023, 5:19pm

Hi Everyone!

I’m using XGboost on a set of data that could be easily converted to a 2D table.
However, there is one column that would be better suited to be shaped as a list. This column represents a feature that has a variable number of entries, all describing the same row. The number is not constant and these are not ordered, so it is very difficult to use a one-hot encoding (it is likely that the same feature may end in different columns). It is not feasible to know all the possible entries, because it will be used to add more data on the fly.

I can work around the different number of features. But how could I work with the different orders?

For instance:

row1 [apple, banana, orange, NaN]
row2 [orange, NaN, NaN, NaN]

Do you have any suggestions? Or could you point me to another algorithm if you thinnk XGBoost is not well suited for this task?
Plus, How could I deal with “unseen” features in this column when going into prediction?