CUDA_VISIBLE_DEVICES指定显卡时遇到问题

安装和配置问题反馈,请提供如下信息以帮助更好地诊断问题,社区也会更乐于提供帮助。

环境

tasla P100 * 4

Red hat 7.2 64bit
CUDA 9.0
python 3.6.3 (系统上自带了python2.7,没动它)
mxnet-cu90 1.0.0.post2 (pip3 install mxnet-cu90安装)

现象描述

一般运行mxnet的程序正常,这几天调了下kaggle上房价的比赛;
现在尝试,通过设置CUDA_VISIBLE_DEVICES的值,来控制GPU可见性;

test1.py内容如下:

import mxnet as mx
from mxnet import nd
a = nd.array([1,2,3], ctx=mx.gpu(1))
print(a)

test2.py内容如下:

import mxnet as mx
from mxnet import nd
a = nd.array([1,2,3], ctx=mx.gpu(2))
print(a)

执行下面的命令,都能得到正确输出,
CUDA_VISIBLE_DEVICES=0,1 python3 test1.py
CUDA_VISIBLE_DEVICES=0,1,2 python3 test1.py
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 test1.py
正确输出为:

[ 1.  2.  3.]
<NDArray 3 @gpu(1)>

但是执行下面命令,都会报错:
CUDA_VISIBLE_DEVICES=1 python3 test1.py (期望中,应该得到正确输出)
CUDA_VISIBLE_DEVICES=2 python3 test2.py (期望中,应该得到正确输出)
CUDA_VISIBLE_DEVICES=1,2 python3 test2.py
CUDA_VISIBLE_DEVICES=0,2 python3 test2.py
CUDA_VISIBLE_DEVICES=0,2,3 python3 test2.py
CUDA_VISIBLE_DEVICES=2,3 python3 test2.py
CUDA_VISIBLE_DEVICES=1,2,3 python3 test2.py
报错为:

[11:37:53] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [11:37:53] src/storage/storage.cc:63: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: invalid device ordinal

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x28d31c) [0x7ff38ef6031c]
[bt] (1) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x29de4f6) [0x7ff3916b14f6]
[bt] (2) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x29dd85b) [0x7ff3916b085b]
[bt] (3) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x23ee481) [0x7ff3910c1481]
[bt] (4) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(MXNDArrayCreateEx+0x145) [0x7ff3910b6945]
[bt] (5) /usr/local/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7ff3fab02a1c]
[bt] (6) /usr/local/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(ffi_call+0x165) [0x7ff3fab01b65]
[bt] (7) /usr/local/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x283) [0x7ff3faaf9fb3]
[bt] (8) /usr/local/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x96ff) [0x7ff3faaf16ff]
[bt] (9) python3(_PyObject_FastCallDict+0xa2) [0x451242]

Traceback (most recent call last):
  File "test1.py", line 3, in <module>
    a = nd.array([1,2,3], ctx=mx.gpu(1))
  File "/usr/local/lib/python3.6/site-packages/mxnet/ndarray/utils.py", line 146, in array
    return _array(source_array, ctx=ctx, dtype=dtype)
  File "/usr/local/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 2244, in array
    arr = empty(source_array.shape, ctx, dtype)
  File "/usr/local/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 3415, in empty
    return NDArray(handle=_new_alloc_handle(shape, ctx, False, dtype))
  File "/usr/local/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 139, in _new_alloc_handle
    ctypes.byref(hdl)))
  File "/usr/local/lib/python3.6/site-packages/mxnet/base.py", line 146, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:37:53] src/storage/storage.cc:63: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: invalid device ordinal

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x28d31c) [0x7ff38ef6031c]
[bt] (1) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x29de4f6) [0x7ff3916b14f6]
[bt] (2) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x29dd85b) [0x7ff3916b085b]
[bt] (3) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x23ee481) [0x7ff3910c1481]
[bt] (4) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(MXNDArrayCreateEx+0x145) [0x7ff3910b6945]
[bt] (5) /usr/local/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7ff3fab02a1c]
[bt] (6) /usr/local/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(ffi_call+0x165) [0x7ff3fab01b65]
[bt] (7) /usr/local/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x283) [0x7ff3faaf9fb3]
[bt] (8) /usr/local/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x96ff) [0x7ff3faaf16ff]
[bt] (9) python3(_PyObject_FastCallDict+0xa2) [0x451242]

nvidia-smi -L 输出为:

[tryai@P100v0 ~]$ nvidia-smi -L
GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-bdfaaf53-cf7f-4930-ccc0-8fb94dfed819)
GPU 1: Tesla P100-PCIE-12GB (UUID: GPU-fcf7844a-6bfa-f826-536c-19423c0cb5ea)
GPU 2: Tesla P100-PCIE-12GB (UUID: GPU-709abe77-f2aa-1b94-a704-34be456c2330)
GPU 3: Tesla P100-PCIE-12GB (UUID: GPU-823f8869-76f4-395c-7734-d4311c95528e)

你好,我最近也遇到了和你之前遇到的相同的问题,请问您当时是怎么解决的呀?方便加我微信吗?wx:dongzhenguo2015