To community users and developers,
I am using checkpoint during training for distributed scala spark version, how saw potential two issues:
-
I can see checkpointed models in specified LOCAL path, with names like 100.model, 200.model,…; however my checkpoint interval is 50, which looks like model name numbers always look like 2*x.model, where x=checkpointinterval
-
the training will not continue by picking-up latest file *.model, it will always re-start from the beginning, just like ignoring *.model files in checkpointPath.
Version info: pre-compiled scala spark xgb version 0.81-criteo-20180821 on CDH 2.3.1
Since this version is not compiled with HDFS on, so it only accepts local file path. All above discussion is based on .setCheckpointPath(localPath).
Thanks
Yao