Cannot compile with NCCL on Ubuntu 16.04

I’m trying to compile from the latest master on a Google Compute Engine VM instance with 4 Tesla P100 GPUs. I can compile just fine with NCCL turned off, but I run into this error when I compile with it turned on,

Here’s the output from running cmake .. -DUSE_CUDA=ON -DUSE_NCCL=ON

-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Try OpenMP C flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Try OpenMP CXX flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Found OpenMP: -fopenmp  
-- Setting build type to 'Release' as none was specified.
-- Performing Test SUPPORT_CXX11
-- Performing Test SUPPORT_CXX11 - Success
-- Performing Test SUPPORT_CXX0X
-- Performing Test SUPPORT_CXX0X - Success
-- Performing Test SUPPORT_MSSE2
-- Performing Test SUPPORT_MSSE2 - Success
-- Could NOT find GTest (missing:  GTEST_LIBRARY GTEST_INCLUDE_DIR GTEST_MAIN_LIBRARY) 
CMake Warning at dmlc-core/test/unittest/CMakeLists.txt:37 (message):
  Google Test not found


-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found suitable version "9.0", minimum required is "8.0") 
-- Found Nccl: /usr/include  
cuda architecture flags: -gencode arch=compute_35,code=sm_35;-gencode arch=compute_50,code=sm_50;-gencode arch=compute_52,code=sm_52;-gencode arch=compute_60,code=sm_60;-gencode arch=compute_61,code=sm_61;-gencode arch=compute_70,code=sm_70;-gencode arch=compute_70,code=compute_70;
-- Configuring done
-- Generating done
-- Build files have been written to: /home/jmarkow/dev/xgboost/build

And then make -j4

[  1%] Building NVCC (Device) object CMakeFiles/gpuxgboost.dir/src/predictor/gpuxgboost_generated_gpu_predictor.cu.o
Scanning dependencies of target rabit
Scanning dependencies of target dmlc
[  2%] Building CXX object CMakeFiles/rabit.dir/rabit/src/allreduce_base.cc.o
[  4%] Building CXX object dmlc-core/CMakeFiles/dmlc.dir/src/data.cc.o
Scanning dependencies of target objxgboost
[  5%] Building CXX object CMakeFiles/objxgboost.dir/src/gbm/gbtree.cc.o
[  7%] Building CXX object CMakeFiles/rabit.dir/rabit/src/allreduce_robust.cc.o
[  8%] Building CXX object CMakeFiles/rabit.dir/rabit/src/engine.cc.o
[ 10%] Building CXX object CMakeFiles/rabit.dir/rabit/src/c_api.cc.o
[ 11%] Building CXX object CMakeFiles/objxgboost.dir/src/gbm/gbm.cc.o
[ 13%] Linking CXX static library librabit.a
[ 13%] Built target rabit
[ 14%] Building CXX object dmlc-core/CMakeFiles/dmlc.dir/src/io.cc.o
[ 15%] Building CXX object CMakeFiles/objxgboost.dir/src/gbm/gblinear.cc.o
[ 17%] Building CXX object dmlc-core/CMakeFiles/dmlc.dir/src/recordio.cc.o
[ 18%] Building CXX object dmlc-core/CMakeFiles/dmlc.dir/src/config.cc.o
[ 20%] Building CXX object dmlc-core/CMakeFiles/dmlc.dir/src/io/line_split.cc.o
[ 21%] Building CXX object CMakeFiles/objxgboost.dir/src/logging.cc.o
[ 23%] Building CXX object dmlc-core/CMakeFiles/dmlc.dir/src/io/recordio_split.cc.o
[ 24%] Building CXX object dmlc-core/CMakeFiles/dmlc.dir/src/io/indexed_recordio_split.cc.o
[ 26%] Building CXX object CMakeFiles/objxgboost.dir/src/objective/rank_obj.cc.o
[ 27%] Building CXX object dmlc-core/CMakeFiles/dmlc.dir/src/io/input_split_base.cc.o
[ 28%] Building CXX object dmlc-core/CMakeFiles/dmlc.dir/src/io/filesys.cc.o
[ 30%] Building CXX object dmlc-core/CMakeFiles/dmlc.dir/src/io/local_filesys.cc.o
[ 31%] Building NVCC (Device) object CMakeFiles/gpuxgboost.dir/src/objective/gpuxgboost_generated_regression_obj_gpu.cu.o
[ 33%] Building CXX object CMakeFiles/objxgboost.dir/src/objective/multiclass_obj.cc.o
[ 34%] Linking CXX static library libdmlc.a
[ 34%] Built target dmlc
[ 36%] Building NVCC (Device) object CMakeFiles/gpuxgboost.dir/src/tree/gpuxgboost_generated_updater_gpu.cu.o
[ 37%] Building CXX object CMakeFiles/objxgboost.dir/src/objective/regression_obj.cc.o
/home/jmarkow/dev/xgboost/src/tree/updater_gpu.cu(112): warning: function "__shfl_up(int, unsigned int, int)"
/usr/local/cuda/include/sm_30_intrinsics.hpp(175): here was declared deprecated ("__shfl_up() is not valid on compute_70 and above, and should be replaced with __shfl_up_sync().To continue using __shfl_up(), specify virtual architecture compute_60 when targeting sm_70 and above, for example, using the pair of compiler options: -arch=compute_60 -code=sm_70.")
          detected during instantiation of "void xgboost::tree::reduceScanByKey(xgboost::GradientPair *, xgboost::GradientPair *, const xgboost::GradientPair *, const int *, const xgboost::tree::NodeIdT *, int, int, int, xgboost::GradientPair *, int *, const int *, xgboost::tree::NodeIdT) [with BLKDIM_L1L3=256, BLKDIM_L2=512]" 
(615): here

[ 39%] Building CXX object CMakeFiles/objxgboost.dir/src/objective/hinge.cc.o
[ 40%] Building CXX object CMakeFiles/objxgboost.dir/src/objective/objective.cc.o
[ 42%] Building CXX object CMakeFiles/objxgboost.dir/src/learner.cc.o
/home/jmarkow/dev/xgboost/src/tree/updater_gpu.cu(112): warning: function "__shfl_up(int, unsigned int, int)"
/usr/local/cuda/include/sm_30_intrinsics.hpp(175): here was declared deprecated ("__shfl_up() is deprecated in favor of __shfl_up_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
          detected during instantiation of "void xgboost::tree::reduceScanByKey(xgboost::GradientPair *, xgboost::GradientPair *, const xgboost::GradientPair *, const int *, const xgboost::tree::NodeIdT *, int, int, int, xgboost::GradientPair *, int *, const int *, xgboost::tree::NodeIdT) [with BLKDIM_L1L3=256, BLKDIM_L2=512]" 
(615): here

[ 43%] Building CXX object CMakeFiles/objxgboost.dir/src/metric/metric.cc.o
/home/jmarkow/dev/xgboost/src/tree/updater_gpu.cu(112): warning: function "__shfl_up(int, unsigned int, int)"
/usr/local/cuda/include/sm_30_intrinsics.hpp(175): here was declared deprecated ("__shfl_up() is deprecated in favor of __shfl_up_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
          detected during instantiation of "void xgboost::tree::reduceScanByKey(xgboost::GradientPair *, xgboost::GradientPair *, const xgboost::GradientPair *, const int *, const xgboost::tree::NodeIdT *, int, int, int, xgboost::GradientPair *, int *, const int *, xgboost::tree::NodeIdT) [with BLKDIM_L1L3=256, BLKDIM_L2=512]" 
(615): here

[ 44%] Building CXX object CMakeFiles/objxgboost.dir/src/metric/rank_metric.cc.o
[ 46%] Building NVCC (Device) object CMakeFiles/gpuxgboost.dir/src/tree/gpuxgboost_generated_updater_gpu_hist.cu.o
/home/jmarkow/dev/xgboost/src/tree/updater_gpu.cu(112): warning: function "__shfl_up(int, unsigned int, int)"
/usr/local/cuda/include/sm_30_intrinsics.hpp(175): here was declared deprecated ("__shfl_up() is deprecated in favor of __shfl_up_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
          detected during instantiation of "void xgboost::tree::reduceScanByKey(xgboost::GradientPair *, xgboost::GradientPair *, const xgboost::GradientPair *, const int *, const xgboost::tree::NodeIdT *, int, int, int, xgboost::GradientPair *, int *, const int *, xgboost::tree::NodeIdT) [with BLKDIM_L1L3=256, BLKDIM_L2=512]" 
(615): here

[ 47%] Building CXX object CMakeFiles/objxgboost.dir/src/metric/elementwise_metric.cc.o
/home/jmarkow/dev/xgboost/src/tree/updater_gpu_hist.cu(628): warning: function "__ballot"
/usr/local/cuda/include/sm_20_intrinsics.h(407): here was declared deprecated ("__ballot() is not valid on compute_70 and above, and should be replaced with __ballot_sync().To continue using __ballot(), specify virtual architecture compute_60 when targeting sm_70 and above, for example, using the pair of compiler options: -arch=compute_60 -code=sm_70.")

[ 49%] Building NVCC (Device) object CMakeFiles/gpuxgboost.dir/src/linear/gpuxgboost_generated_updater_gpu_coordinate.cu.o
[ 50%] Building CXX object CMakeFiles/objxgboost.dir/src/metric/multiclass_metric.cc.o
[ 52%] Building CXX object CMakeFiles/objxgboost.dir/src/tree/updater_skmaker.cc.o
/home/jmarkow/dev/xgboost/src/tree/updater_gpu.cu(112): warning: function "__shfl_up(int, unsigned int, int)"
/usr/local/cuda/include/sm_30_intrinsics.hpp(175): here was declared deprecated ("__shfl_up() is deprecated in favor of __shfl_up_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
          detected during instantiation of "void xgboost::tree::reduceScanByKey(xgboost::GradientPair *, xgboost::GradientPair *, const xgboost::GradientPair *, const int *, const xgboost::tree::NodeIdT *, int, int, int, xgboost::GradientPair *, int *, const int *, xgboost::tree::NodeIdT) [with BLKDIM_L1L3=256, BLKDIM_L2=512]" 
(615): here

/home/jmarkow/dev/xgboost/src/tree/updater_gpu_hist.cu(628): warning: function "__ballot"
/usr/local/cuda/include/sm_20_intrinsics.h(407): here was declared deprecated ("__ballot() is deprecated in favor of __ballot_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")

[ 53%] Building CXX object CMakeFiles/objxgboost.dir/src/tree/updater_colmaker.cc.o
/home/jmarkow/dev/xgboost/src/tree/updater_gpu.cu(112): warning: function "__shfl_up(int, unsigned int, int)"
/usr/local/cuda/include/sm_30_intrinsics.hpp(175): here was declared deprecated ("__shfl_up() is deprecated in favor of __shfl_up_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
          detected during instantiation of "void xgboost::tree::reduceScanByKey(xgboost::GradientPair *, xgboost::GradientPair *, const xgboost::GradientPair *, const int *, const xgboost::tree::NodeIdT *, int, int, int, xgboost::GradientPair *, int *, const int *, xgboost::tree::NodeIdT) [with BLKDIM_L1L3=256, BLKDIM_L2=512]" 
(615): here

/home/jmarkow/dev/xgboost/src/tree/updater_gpu_hist.cu(628): warning: function "__ballot"
/usr/local/cuda/include/sm_20_intrinsics.h(407): here was declared deprecated ("__ballot() is deprecated in favor of __ballot_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")

[ 55%] Building CXX object CMakeFiles/objxgboost.dir/src/tree/tree_updater.cc.o
[ 56%] Building CXX object CMakeFiles/objxgboost.dir/src/tree/updater_sync.cc.o
[ 57%] Building CXX object CMakeFiles/objxgboost.dir/src/tree/updater_histmaker.cc.o
/home/jmarkow/dev/xgboost/src/tree/updater_gpu_hist.cu(628): warning: function "__ballot"
/usr/local/cuda/include/sm_20_intrinsics.h(407): here was declared deprecated ("__ballot() is deprecated in favor of __ballot_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")

[ 59%] Building CXX object CMakeFiles/objxgboost.dir/src/tree/updater_fast_hist.cc.o
[ 60%] Building NVCC (Device) object CMakeFiles/gpuxgboost.dir/src/common/gpuxgboost_generated_host_device_vector.cu.o
[ 62%] Building CXX object CMakeFiles/objxgboost.dir/src/tree/updater_prune.cc.o
/home/jmarkow/dev/xgboost/src/tree/updater_gpu_hist.cu(628): warning: function "__ballot"
/usr/local/cuda/include/sm_20_intrinsics.h(407): here was declared deprecated ("__ballot() is deprecated in favor of __ballot_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")

[ 63%] Building CXX object CMakeFiles/objxgboost.dir/src/tree/split_evaluator.cc.o
[ 65%] Building CXX object CMakeFiles/objxgboost.dir/src/tree/updater_refresh.cc.o
[ 66%] Building CXX object CMakeFiles/objxgboost.dir/src/tree/tree_model.cc.o
/home/jmarkow/dev/xgboost/src/tree/updater_gpu_hist.cu(628): warning: function "__ballot"
/usr/local/cuda/include/sm_20_intrinsics.h(407): here was declared deprecated ("__ballot() is deprecated in favor of __ballot_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")

[ 68%] Building CXX object CMakeFiles/objxgboost.dir/src/linear/linear_updater.cc.o
[ 69%] Building CXX object CMakeFiles/objxgboost.dir/src/linear/updater_shotgun.cc.o
[ 71%] Building CXX object CMakeFiles/objxgboost.dir/src/linear/updater_coordinate.cc.o
[ 72%] Building CXX object CMakeFiles/objxgboost.dir/src/common/common.cc.o
[ 73%] Building CXX object CMakeFiles/objxgboost.dir/src/common/host_device_vector.cc.o
[ 75%] Building CXX object CMakeFiles/objxgboost.dir/src/common/hist_util.cc.o
[ 76%] Building CXX object CMakeFiles/objxgboost.dir/src/data/sparse_page_source.cc.o
[ 78%] Building CXX object CMakeFiles/objxgboost.dir/src/data/sparse_page_raw_format.cc.o
[ 79%] Building CXX object CMakeFiles/objxgboost.dir/src/data/simple_csr_source.cc.o
[ 81%] Building CXX object CMakeFiles/objxgboost.dir/src/data/sparse_page_writer.cc.o
[ 82%] Building CXX object CMakeFiles/objxgboost.dir/src/data/simple_dmatrix.cc.o
[ 84%] Building CXX object CMakeFiles/objxgboost.dir/src/data/data.cc.o
[ 85%] Building CXX object CMakeFiles/objxgboost.dir/src/data/sparse_page_dmatrix.cc.o
[ 86%] Building CXX object CMakeFiles/objxgboost.dir/src/predictor/cpu_predictor.cc.o
[ 88%] Building CXX object CMakeFiles/objxgboost.dir/src/predictor/predictor.cc.o
[ 89%] Building CXX object CMakeFiles/objxgboost.dir/src/c_api/c_api_error.cc.o
[ 91%] Building CXX object CMakeFiles/objxgboost.dir/src/c_api/c_api.cc.o
[ 92%] Building NVCC (Device) object CMakeFiles/gpuxgboost.dir/src/common/gpuxgboost_generated_common.cu.o
[ 94%] Building NVCC (Device) object CMakeFiles/gpuxgboost.dir/src/common/gpuxgboost_generated_hist_util.cu.o
[ 94%] Built target objxgboost
Scanning dependencies of target gpuxgboost
[ 95%] Linking CXX static library libgpuxgboost.a
[ 95%] Built target gpuxgboost
Scanning dependencies of target runxgboost
[ 97%] Building CXX object CMakeFiles/runxgboost.dir/src/cli_main.cc.o
[ 98%] Linking CXX executable ../xgboost
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/libnccl_static.a(all_gather_sum.o): In function `ncclAllGatherKernel_copy_i8(ncclColl)':
(.text+0x127): undefined reference to `__cudaPopCallConfiguration'
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/libnccl_static.a(all_gather_sum.o): In function `ncclAllGatherLLKernel_copy_i8(ncclColl)':
(.text+0x1e7): undefined reference to `__cudaPopCallConfiguration'
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/libnccl_static.a(all_gather_sum.o): In function `__device_stub__Z29ncclAllGatherLLKernel_copy_i88ncclColl(ncclColl&)':
(.text+0x2df): undefined reference to `__cudaPopCallConfiguration'
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/libnccl_static.a(all_gather_sum.o): In function `__device_stub__Z27ncclAllGatherKernel_copy_i88ncclColl(ncclColl&)':
(.text+0x38f): undefined reference to `__cudaPopCallConfiguration'
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/libnccl_static.a(all_reduce_max.o): In function `ncclAllReduceKernel_max_f64(ncclColl)':
(.text+0x577): undefined reference to `__cudaPopCallConfiguration'
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/libnccl_static.a(all_reduce_max.o):(.text+0x637): more undefined references to `__cudaPopCallConfiguration' follow
collect2: error: ld returned 1 exit status
CMakeFiles/runxgboost.dir/build.make:181: recipe for target '../xgboost' failed
make[2]: *** [../xgboost] Error 1
CMakeFiles/Makefile2:107: recipe for target 'CMakeFiles/runxgboost.dir/all' failed
make[1]: *** [CMakeFiles/runxgboost.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

And finally the error log from cmake,

Determining if the pthread_create exist failed with the following output:
Change Dir: /home/jmarkow/dev/xgboost/build/CMakeFiles/CMakeTmp

Run Build Command:"/usr/bin/make" "cmTC_d5357/fast"
/usr/bin/make -f CMakeFiles/cmTC_d5357.dir/build.make CMakeFiles/cmTC_d5357.dir/build
make[1]: Entering directory '/home/jmarkow/dev/xgboost/build/CMakeFiles/CMakeTmp'
Building C object CMakeFiles/cmTC_d5357.dir/CheckSymbolExists.c.o
/usr/bin/cc    -fPIC  -fPIE   -o CMakeFiles/cmTC_d5357.dir/CheckSymbolExists.c.o   -c /home/jmarkow/dev/xgboost/build/CMakeFiles/CMakeTmp/CheckSymbolExists.c
Linking C executable cmTC_d5357
/usr/bin/cmake -E cmake_link_script CMakeFiles/cmTC_d5357.dir/link.txt --verbose=1
/usr/bin/cc  -fPIC     CMakeFiles/cmTC_d5357.dir/CheckSymbolExists.c.o  -o cmTC_d5357 -rdynamic 
CMakeFiles/cmTC_d5357.dir/CheckSymbolExists.c.o: In function `main':
CheckSymbolExists.c:(.text+0x1b): undefined reference to `pthread_create'
collect2: error: ld returned 1 exit status
CMakeFiles/cmTC_d5357.dir/build.make:97: recipe for target 'cmTC_d5357' failed
make[1]: *** [cmTC_d5357] Error 1
make[1]: Leaving directory '/home/jmarkow/dev/xgboost/build/CMakeFiles/CMakeTmp'
Makefile:126: recipe for target 'cmTC_d5357/fast' failed
make: *** [cmTC_d5357/fast] Error 2

File /home/jmarkow/dev/xgboost/build/CMakeFiles/CMakeTmp/CheckSymbolExists.c:
/* */
#include <pthread.h>

int main(int argc, char** argv)
{
  (void)argv;
#ifndef pthread_create
  return ((int*)(&pthread_create))[argc];
#else
  (void)argc;
  return 0;
#endif
}

Determining if the function pthread_create exists in the pthreads failed with the following output:
Change Dir: /home/jmarkow/dev/xgboost/build/CMakeFiles/CMakeTmp

Run Build Command:"/usr/bin/make" "cmTC_1e4fe/fast"
/usr/bin/make -f CMakeFiles/cmTC_1e4fe.dir/build.make CMakeFiles/cmTC_1e4fe.dir/build
make[1]: Entering directory '/home/jmarkow/dev/xgboost/build/CMakeFiles/CMakeTmp'
Building C object CMakeFiles/cmTC_1e4fe.dir/CheckFunctionExists.c.o
/usr/bin/cc    -fPIC -DCHECK_FUNCTION_EXISTS=pthread_create -fPIE   -o CMakeFiles/cmTC_1e4fe.dir/CheckFunctionExists.c.o   -c /usr/share/cmake-3.5/Modules/CheckFunctionExists.c
Linking C executable cmTC_1e4fe
/usr/bin/cmake -E cmake_link_script CMakeFiles/cmTC_1e4fe.dir/link.txt --verbose=1
/usr/bin/cc  -fPIC -DCHECK_FUNCTION_EXISTS=pthread_create    CMakeFiles/cmTC_1e4fe.dir/CheckFunctionExists.c.o  -o cmTC_1e4fe -rdynamic -lpthreads 
/usr/bin/ld: cannot find -lpthreads
collect2: error: ld returned 1 exit status
CMakeFiles/cmTC_1e4fe.dir/build.make:97: recipe for target 'cmTC_1e4fe' failed
make[1]: *** [cmTC_1e4fe] Error 1
make[1]: Leaving directory '/home/jmarkow/dev/xgboost/build/CMakeFiles/CMakeTmp'
Makefile:126: recipe for target 'cmTC_1e4fe/fast' failed
make: *** [cmTC_1e4fe/fast] Error 2

Any thoughts?

See https://github.com/uber/horovod/issues/274. The idea is to install NCCL2 that matches the CUDA version.

That did the trick!

For the record, I am using CUDA 9.0 and Ubuntu 16.04 and installed the appropriate version of nccl with:

sudo apt install libnccl2=2.2.12-1+cuda9.0 libnccl-dev=2.2.12-1+cuda9.0

Thanks for the fast response!