Move trained xgboost classifier from PySpark EMR notebook to S3


#1

I built a trained classifier in an AWS EMR notebook

bst = xgb.train(param, dtrain)
bst.save_model('s3://###-data-science/trained_xgBoost')

When I try and save this function to s3, I get the error

[17:44:03] /workspace/dmlc-core/src/io.cc:57: Please compile with DMLC_USE_S3=1 to use S3
Stack trace:

Is there any way I can upload a trained model inside of an EMR notebook to S3?


#2

I think you can save the model in a local drive first and then use boto3 to upload it to S3.


#3

If I save using,

bst.save_model(/home/hadoop/myname)

Then I go to load it back up, I get a NoneType object


#4

Are you able to locate the model file on the local disk?


#5

Yes, I can locate it using
loader = bst.load_model('path')
but then loader is of NoneType

or, if I can locate it, then push it into S3 using

s3_client.upload_file('home/hadoop/###//classifier.model', "###-data-science", "classifier.model")

but when I go to download it from S3 using
s3_client.download_file('###-data-science', 'classifier.model', 'classifier.model')

I get error:

[Errno 13] Permission denied: ‘classifier.model.26B9A3Aa’
Traceback (most recent call last):

And I DO have permissions to read and write from S3


#6

Can you ensure that you have full read/write access to the local disk? If not, using /tmp may be a solution.


#7

Where would /tmp go?


#8

Do you have access to the local disk? As for /tmp, see https://superuser.com/questions/332610/where-is-the-temporary-directory-in-linux


#9

Yes I have access to the local disk


#10

I have the exact same error. I am running the prediction with deploy mode = ‘cluster’. I am guessing there is some issue with sharing the resources with the master and the worker. The model runs fine (small data) on the client mode deploy.