Inconsistent feature_names in Python API

SpinOneThird · July 15, 2018, 10:04pm

Hello,

When using the Python API, the way feature names behave is wrong or inconsistent depending on how a DMatrix was created. This is bothersome as it makes it difficult to mix and match DMatrices created with different methods in train/test/predict.

the issues I see are:

When slicing a DMatrix, the feature_names are lost and get set to f0, f1, f2, …
When creating a DMatrix from a numpy array, passing in a list of feature name when creating the DMatrix has no effect.
When creating a DMatrix from a libsvm file (wether using external memory or not), the feature_names have to include the name of the label in addition to the features. Whereas when using pandas or numpy you don’t.

To reproduce:

import xgboost as xg
import pandas as pd
import numpy as np

from pathlib import Path
import sys, os

print('XGBoost version:', xg.__version__)
print('Python version:', sys.version)
print()

df = pd.DataFrame(np.arange(12).reshape((4,3)), columns=['a', 'b', 'c'])
m_df = xg.DMatrix(df)

print('DataFrame: ', m_df.feature_names)
print('DataFrame & slice: ', m_df.slice([0, 1]).feature_names)

m_np = xg.DMatrix(df.values)
print('np: ', m_np.feature_names)

m_np_set_features = xg.DMatrix(df.values, feature_names=['a', 'b', 'c'])
print('np & feature_names: ', m_np.feature_names)

for p in Path('.').glob('m.libsvm*'):
    os.remove(p)
with open('m.libsvm', 'w') as f:
    f.write("""\
0 1:1 2:2 3:3
0 1:4 2:5 3:6
0 1:7 2:8 3:9
    """)
m_libsvm = xg.DMatrix('m.libsvm')
print('libsvm:', m_libsvm.feature_names)

# Throws: it expects the features to include the label
# m_libsvm_set_feature = xg.DMatrix('m.libsvm', feature_names=['a', 'b', 'c'])

m_libsvm_set_feature = xg.DMatrix('m.libsvm', feature_names=['label', 'a', 'b', 'c'])
print('libsvm & feature_names:', m_libsvm_set_feature.feature_names)

print('libsvm & feature_names & slice', m_libsvm_set_feature.slice([0, 1]).feature_names)

m_ext_mem = xg.DMatrix('m.libsvm#m.cache', feature_names=['label', 'a', 'b', 'c'])
print('Ext_mem & feature_name:', m_ext_mem.feature_names)
print('Ext_mem & feature_name: & slice', m_ext_mem.slice([0, 1]).feature_names)

Output:

XGBoost version: 0.72.1
Python version: 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:44:09) 
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]

DataFrame:  ['a', 'b', 'c']
DataFrame & slice:  ['f0', 'f1', 'f2']
np:  ['f0', 'f1', 'f2']
np & feature_names:  ['f0', 'f1', 'f2']
libsvm: ['f0', 'f1', 'f2', 'f3']
libsvm & feature_names: ['label', 'a', 'b', 'c']
libsvm & feature_names & slice ['f0', 'f1', 'f2', 'f3']
Ext_mem & feature_name: ['label', 'a', 'b', 'c']
Ext_mem & feature_name: & slice ['f0', 'f1', 'f2', 'f3']

hcho3 · July 27, 2018, 11:23pm

Thanks for your report. We will take a look at it when we get a chance. For now, you can set validate_features=False when calling predict() in order to avoid issues with feature names.

hcho3 · October 7, 2018, 9:47am

@SpinOneThird I am working on a pull request to fix the bug. The first item is really a bug, so it will be fixed. As for the second item

When creating a DMatrix from a numpy array, passing in a list of feature name when creating the DMatrix has no effect

You made a typo in the example script. The lines should have been

m_np_set_features = xg.DMatrix(df.values, feature_names=['a', 'b', 'c'])
print('np & feature_names: ', m_np_set_features.feature_names)
    # prints ['a', 'b', 'c']

The third item

When creating a DMatrix from a libsvm file (wether using external memory or not), the feature_names have to include the name of the label in addition to the features. Whereas when using pandas or numpy you don’t.

is expected behavior (NOT a bug) because XGBoost uses 0-based indexing for LIBSVM files. So your example should be fixed to

with open('m.libsvm', 'w') as f:
    # feature index starts with 0
    f.write("""\
0 0:1 1:2 2:3
0 0:4 1:5 2:6
0 0:7 1:8 2:9
    """)
m_libsvm = xg.DMatrix('m.libsvm')
print('libsvm:', m_libsvm.feature_names)

m_libsvm_set_feature = xg.DMatrix('m.libsvm', feature_names=['a', 'b', 'c'])
print('libsvm & feature_names:', m_libsvm_set_feature.feature_names)
print('libsvm & feature_names & slice', m_libsvm_set_feature.slice([0, 1]).feature_names)

m_ext_mem = xg.DMatrix('m.libsvm#m.libsvm.cache', feature_names=['a', 'b', 'c'])
print('Ext_mem & feature_name:', m_ext_mem.feature_names)
print('Ext_mem & feature_name: & slice', m_ext_mem.slice([0, 1]).feature_names)

hcho3 · October 7, 2018, 9:52am

https://github.com/dmlc/xgboost/pull/3766 fixes the first item.

SpinOneThird · October 9, 2018, 1:06pm

Thanks!

The examples of LIBSVM format I had seen where 1-based, which threw me off. As it is a sparse format I imagine it does not make much difference in most cases.