I replaced the xgboost script implemented in R with Python.
I was expecting to match the results predicted by the R script.
However, I am wondering that there is a considerable divergence in the prediction results of Python replaced with the prediction results learned with R Script.
Just generate a training data DMatrix, train (), and then predict () the prediction data.
Of course, the training data and the prediction data are the same, and the parameters are as follows.
param = {
'booster': 'gbtree'
, 'objective': 'multi: softmax'
, 'eval_metric': 'merror'
, 'gamma': 0
, 'eta': 0.3
, 'max_depth': 6
, 'min_child_weight': 1
, 'colsample_bytree': 0.9
, 'subsample': 0.8
, 'alpha': 1
, 'num_class': 8
, 'nthread': multiprocessing.cpu_count () -1
}
The version of xgboost package is 0.90.
Is it impossible to get the same result with R and Python implementations even if they are almost the same implementation, eliminating randomness?
Postscript
Python Code
x_train = pd.read_csv("./test/x_train.csv")
y_train = pd.read_csv("./test/y_train.csv")
np.random.seed(0) # シードを固定
xd_train = xgb.DMatrix(
data = x_train
,label = y_train
)
# model parameter
# https://xgboost.readthedocs.io/en/latest/parameter.html
param = {
'booster' : 'gbtree'
,'objective' : 'multi:softmax'
,'eval_metric' : 'merror'
,'gamma' : 0
,'eta' : 0.3
,'max_depth' : 6
,'min_child_weight' : 1
,'colsample_bytree' : 0.9
,'subsample' : 0.8
,'alpha' : 1
,'num_class' : 8
}
# Xgboost Learning
bst_fit_down = xgb.train(param,
xd_train,
num_boost_round = 500
)
print(xd_train.get_label())
print(xd_train.get_weight())
print(xd_train.get_base_margin())
print(xd_train.num_row())
bst_fit_down.dump_model('./test/model_P.txt', with_stats = True)
R Code
x_train <- fread("./test/x_train.csv", stringsAsFactors = F)
y_train <- fread("./test/y_train.csv", stringsAsFactors = F, header = T)
# シードを固定
set.seed(0)
# data table matrix
dt_train <- data.table(x_train, keep.rownames=F)
## 素性をsparse matrix形式に変換
smat_train <- sparse.model.matrix( ~ ., data = x_train)
xd_train <- xgb.DMatrix(
data = smat_train
,label = data.table(y_train, keep.rownames=F)$Y# label dataのみ
)
#並列処理(結果に影響しないことを確認済み)
require(doParallel)
registerDoParallel(detectCores()-1)
# model parameter
param <- list(
booster = "gbtree"
,objective = "multi:softmax"
,eval_metric = "merror"
,gamma = 0
,eta = 0.3
,max_depth = 6
,min_child_weight = 1
,colsample_bytree = 0.9
,subsample = 0.8
,alpha = 1
,num_class = 8
)
# Xgboost Learning trainとxgboostの学習結果は同じ
bst_fit_down <- xgboost(data = xd_train,
nround = 500, #model.cv$best_iteration,#ベストな学習数
param = param
)
#bst_fit_down <- xgb.train(data = xd_train,
# nrounds = 500,
# params = param)
print(getinfo(xd_train, "label"))
print(getinfo(xd_train, "weight"))
print(getinfo(xd_train, "base_margin"))
print(getinfo(xd_train, "nrow"))
xgb.dump(bst_fit_down, fname = "./test/model_R.txt"
, dump_format = "text"
, with_stats = T)