Question about libsvm file and feature names


#1

I use python xg 0.82 and train model from a libsvm file like this:

0 1:0.08839 2:31 3:439 4:43.07175 5:12.53936 6:94 7:14 8:2 9:14.50000 10:203 11:159 12:1.55674 13:0.17610 14:28 15:67.05142 16:686.73000 17:1671.36000 18:18908.50000 19:0.08839 20:1713 21:95.20000 22:14 24:0.00227 25:0.00000 26:32 27:0.01367 28:6 29:0.00012 30:16830 31:0.00010 32:9602 33:0.00000 34:1524 35:0.00018 36:5689 37:80.28062 38:4149.94088 39:245.81713 40:0.98364 41:42 42:4 43:27.00000 44:0.16129 45:0.22437 46:28.49091 47:4.34392 48:0.09982 49:9056 50:409 51:0.04516 52:3.11340 53:0.31786 54:4.28571 55:0.09419 56:2004 57:103 58:0.05140 59:3.21429 60:0.19737 61:4.50000 62:0.10909 63:836 64:41 65:0.04904 66:3.50000 67:2.84835 68:0.06546 69:2.85714 70:0.06279 71:4.00000 72:0.09697 73:2.85938 74:0.06571 75:5.70773 76:0.13117 77:2.78571 78:0.06122 79:5.64286 80:0.12402 81:3.00000 82:0.07273 83:7.00000 84:0.16970 85:9.80000 86:16 87:2.06308 88:0.00855 89:0.72650 90:0.07942 91:56.83000 92:0.35000 93:3564.66000 94:576 95:0.00000 96:0.00000 97:37 98:0.49000 99:0.00000 100:0.01000 101:17 102:15.77193 103:0.06250 104:0.00736 105:0.87500 106:0.06919 107:1.00000 108:0.53446 109:0.87500 110:0.45791 111:20.49000 112:0.06250 113:0.06036 138:2.64800 211:6 217:23 218:1 219:1 220:0.04348 221:0.04348 222:1.00000 883:60 1082:2 1083:0 1084:0 1085:0.05882 1086:0.00000 1087:0.00000 1088:797 1089:69 1090:8 1091:0.08726 1092:0.12731 1093:29.88613 1094:14.41176 1095:139799 1096:17313 1097:2272 1098:0.12384 1099:0.13127 1100:36.54340 1101:84533 1102:11549 10002:1.00000 10013:1.00000 10015:1.00000 10019:1.00000 10021:1.00000 10024:1.00000 10028:1.00000 10036:1.00000 10037:1.00000 10042:1.00000 10044:1.00000 10051:1.00000 10054:1.00000 10057:1.00000 10060:1.00000 10063:1.00000 10066:1.00000 10069:1.00000 10087:1.00000 10099:1.00000 10102:1.00000 10105:1.00000 10107:1.00000 10146:73.00000 10147:488.00000 10148:17.00000 10149:0.14959 10150:0.03484 10151:0.10549 10152:0.02954 10153:138.00000 10154:1070.00000 10155:22.00000 10156:0.12897 10157:0.02056 10158:0.10909 10159:0.01636 10160:18.00000 10161:115.00000 10162:4.00000 10163:0.15652 10164:0.03478 10165:0.13115 10166:0.01639 10167:34.00000 10168:228.00000 10169:4.00000 10170:0.14912 10171:0.01754 10172:0.15574 10173:0.00820 10174:2648.00000 10177:0 10178:428 10179:4.62200 10180:5.00000 10181:-2.03879 10183:1.00000 10184:24.20000 10185:238.40000 10186:0.05116 10187:0.04942 10188:8.50000 10189:61.10000 10190:0.04067 10191:0.03788 10192:16.20000 10193:152.40000 10194:0.05891 10195:0.06482 10196:0.79183 10197:0.07000 10198:0.10850 10199:0.79183 10200:0.07000 10201:0.10850 10202:14.41176 10203:69 10204:0.08726 10205:797 10206:29.88613 10207:8 10208:0.12731 10209:0 10210:0.05882 10211:2 10212:0.00000 10213:0 10214:0.00000 10215:4 10216:1.00000 10222:3 10223:8 10224:3 10226:1.00000 10229:1.00000 10232:1.00000 10240:0.00000 10241:0.00000 10242:0.00000 10243:0.00000 10244:0.00000 10245:0.00000 10246:1.00000 10247:0.00000 10248:0.00000 10249:0.00000 10250:10.00000 10251:10.00000 10252:10.00000 10253:1 10254:33 10255:34 10256:2 10257:34

feature index from 1 to 10257.

My question:

  1. num_col() show 10258 (Why? is it a mistake? My ft index starts with 1)
  2. If I want to set feature_names using “xg.DMatrix(“libsvm_file_path”, feature_names=XXX)”
    Do I need to set feature_names a 10258 length list? Some indexes are not used in my file ,need I still give them names?
  3. When I use “model.get_score(importance_type=‘gain’).items()” , I get “ftXX” names.If I have setted feature_names above,Can I get true feature_names using this func?

Thanks!


#2
  1. XGBoost uses 0-based indexing by default. To override, you can append ?indexing_mode=1 to indicate the use of 1-based indexing:
dmat = xgb.DMatrix('data.libsvm?indexing_mode=1')
  1. I think so.
  2. Yes

#3

Thanks for your reply.
My question is:
Both train and valid data are same format like above( with 1-based indexing). Is it working well for my trained model(not set ?indexing_mode=1)?

I want to know " 0-based indexing by default" means. It means XGBoost’s “f0” equals my “1:0.08839” or XGBoost insert a “f0” with default value in the head and “f1” equals my “1:0.08839” (so num_col()=10258)?


#4

XGBoost will insert f0 with missing values.


#5

Same problem.
But my libsvm file start with 0 index.

0 0:12 1:2 2:27 3:19 6:2499 7:1 8:12 13:1 14:2.255972623825073 15:2.3423912525177 16:2.202216148376465 17:1.657084226608276 18:1.574175834655762 19:1 20:1 21:1 22:1 23:1 24:4 25:2 26:6 27:3 28:7 29:586 30:368 31:361 32:487 33:728 35:1048 36:0.4193677604198456 37:-0.06002399325370789 38:0.1023342609405518 39:-0.1023342609405518 44:-1

There are 45 features.

dtrain = xgb.DMatrix('./feature/12_32_train_data.train#dtrain.cache', nthread=-1)

dtrain has 46 cols. And feature_name is f0 - f45。
After trainning the model, I plot the feature importance. It has 46 features, too.
I’m confused.
Is the model use my label as feature?