Move trained xgboost classifier from PySpark EMR notebook to S3

bennicholl · December 6, 2019, 5:45pm

I built a trained classifier in an AWS EMR notebook

bst = xgb.train(param, dtrain)
bst.save_model('s3://###-data-science/trained_xgBoost')

When I try and save this function to s3, I get the error

[17:44:03] /workspace/dmlc-core/src/io.cc:57: Please compile with DMLC_USE_S3=1 to use S3
Stack trace:

Is there any way I can upload a trained model inside of an EMR notebook to S3?

hcho3 · December 6, 2019, 6:39pm

I think you can save the model in a local drive first and then use boto3 to upload it to S3.

bennicholl · December 6, 2019, 7:29pm

If I save using,

bst.save_model(/home/hadoop/myname)

Then I go to load it back up, I get a NoneType object

hcho3 · December 6, 2019, 7:48pm

Are you able to locate the model file on the local disk?

bennicholl · December 6, 2019, 7:57pm

Yes, I can locate it using
loader = bst.load_model('path')
but then loader is of NoneType

or, if I can locate it, then push it into S3 using

s3_client.upload_file('home/hadoop/###//classifier.model', "###-data-science", "classifier.model")

but when I go to download it from S3 using
s3_client.download_file('###-data-science', 'classifier.model', 'classifier.model')

I get error:

[Errno 13] Permission denied: ‘classifier.model.26B9A3Aa’
Traceback (most recent call last):

And I DO have permissions to read and write from S3

hcho3 · December 6, 2019, 8:01pm

Can you ensure that you have full read/write access to the local disk? If not, using /tmp may be a solution.

bennicholl · December 6, 2019, 8:03pm

Where would /tmp go?

hcho3 · December 6, 2019, 8:04pm

Do you have access to the local disk? As for /tmp, see https://superuser.com/questions/332610/where-is-the-temporary-directory-in-linux

bennicholl · December 6, 2019, 8:07pm

Yes I have access to the local disk

Anjala-ar · December 11, 2019, 11:37am

I have the exact same error. I am running the prediction with deploy mode = ‘cluster’. I am guessing there is some issue with sharing the resources with the master and the worker. The model runs fine (small data) on the client mode deploy.