Ubantu按官网教程实现分布式CPU部署,单机可用,分布式报错(已解决)

求大佬帮忙看下,主要问题如标题所示,下面是环境和问题重现的详细log.已解决,分析见下面的回答,感谢各位的指点:pray:

环境

node1环境:

tools目录下运行diagnose脚本:
root@node1:/home/admin/mxnet/tools# python diagnose.py
----------Python Info----------
('Version      :', '2.7.12')
('Compiler     :', 'GCC 5.4.0 20160609')
('Build        :', ('default', 'Nov 19 2016 06:48:10'))
('Arch         :', ('64bit', 'ELF'))
------------Pip Info-----------
('Version      :', '19.0.3')
('Directory    :', '/usr/local/lib/python2.7/dist-packages/pip')
----------MXNet Info-----------
('Version      :', '1.5.0')
('Directory    :', '/home/admin/mxnet/python/mxnet')
Hashtag not found. Not installed from pre-built package.
----------System Info----------
('Platform     :', 'Linux-4.4.0-93-generic-x86_64-with-Ubuntu-16.04-xenial')
('system       :', 'Linux')
('node         :', 'node1')
('release      :', '4.4.0-93-generic')
('version      :', '#116-Ubuntu SMP Fri Aug 11 21:17:51 UTC 2017')
----------Hardware Info----------
('machine      :', 'x86_64')
('processor    :', 'x86_64')
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                1
On-line CPU(s) list:   0
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
Stepping:              4
CPU MHz:               2499.994
BogoMIPS:              4999.98
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              33792K
NUMA node0 CPU(s):     0
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap avx512cd xsaveopt xsavec xgetbv1
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0002 sec, LOAD: 1.3775 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0004 sec, LOAD: 2.9173 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.5871 sec, LOAD: 1.2784 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.1930 sec, LOAD: 2.0958 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.5681 sec, LOAD: 2.3481 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 1.3597 sec, LOAD: 2.0816 sec.

node2环境:

root@node2:/home/admin/mxnet/tools# python diagnose.py
----------Python Info----------
('Version      :', '2.7.12')
('Compiler     :', 'GCC 5.4.0 20160609')
('Build        :', ('default', 'Nov 19 2016 06:48:10'))
('Arch         :', ('64bit', 'ELF'))
------------Pip Info-----------
('Version      :', '19.0.3')
('Directory    :', '/usr/local/lib/python2.7/dist-packages/pip')
----------MXNet Info-----------
('Version      :', '1.5.0')
('Directory    :', '/home/admin/mxnet/python/mxnet')
Hashtag not found. Not installed from pre-built package.
----------System Info----------
('Platform     :', 'Linux-4.4.0-93-generic-x86_64-with-Ubuntu-16.04-xenial')
('system       :', 'Linux')
('node         :', 'node2')
('release      :', '4.4.0-93-generic')
('version      :', '#116-Ubuntu SMP Fri Aug 11 21:17:51 UTC 2017')
----------Hardware Info----------
('machine      :', 'x86_64')
('processor    :', 'x86_64')
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                1
On-line CPU(s) list:   0
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
Stepping:              1
CPU MHz:               2494.220
BogoMIPS:              4988.44
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              40960K
NUMA node0 CPU(s):     0
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0001 sec, LOAD: 2.0057 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.1153 sec, LOAD: 29.6195 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.1215 sec, LOAD: 2.8773 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.1650 sec, LOAD: 2.7257 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.2725 sec, LOAD: 5.3097 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.2257 sec, LOAD: 2.7739 sec.

问题的记录

尝试1——在tools目录下运行launch.py,命令如下:
python launch.py -n 2 -H hosts --launcher ssh which python …/example/image-classification/train_mnist.py

打印的错误信息1如下:

root@node1:/home/admin/mxnet/tools# python launch.py -n 2 -H hosts --launcher ssh `which python` ../example/image-classification/train_mnist.py
Traceback (most recent call last):
  File "../example/image-classification/train_mnist.py", line 25, in <module>
    from common import find_mxnet, fit
  File "/home/admin/mxnet/example/image-classification/common/find_mxnet.py", line 20, in <module>
    import mxnet as mx
  File "/home/admin/mxnet/python/mxnet/__init__.py", line 91, in <module>
    from . import kvstore_server
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 85, in <module>
    _init_kvstore_server_module()
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module
    server.run()
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 73, in run
    check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
  File "/home/admin/mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:29:01] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed

Stack trace returned 10 entries:
[bt] (0) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f62d3b7c54c]
[bt] (1) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f62d3b7d8c8]
[bt] (2) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Start(int)+0x9e8) [0x7f62d7076ea8]
[bt] (3) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::ZMQVan::Start(int)+0x6c) [0x7f62d708199c]
[bt] (4) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7f62d70714dc]
[bt] (5) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::kvstore::KVStoreDist::RunServer(std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&)+0x1b4) [0x7f62d700c884]
[bt] (6) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreRunServer+0x65) [0x7f62d6f76315]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f62dc5f0e40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f62dc5f08ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f62dc8003df]


Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/admin/mxnet/tools/../3rdparty/dmlc-core/tracker/dmlc_tracker/tracker.py", line 366, in <lambda>
    target=(lambda: subprocess.check_call(self.cmd, env=env, shell=True)), args=())
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/bin/python ../example/image-classification/train_mnist.py' returned non-zero exit status 1

root@node1:/home/admin/mxnet/tools# INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus=None, image_shape='1, 28, 28', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=20, num_examples=60000, num_layers=None, optimizer='sgd', profile_server_suffix='', profile_worker_suffix='', save_period=1, test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
Traceback (most recent call last):
  File "../example/image-classification/train_mnist.py", line 25, in <module>
    from common import find_mxnet, fit
  File "/home/admin/mxnet/example/image-classification/common/find_mxnet.py", line 20, in <module>
    import mxnet as mx
  File "/home/admin/mxnet/python/mxnet/__init__.py", line 91, in <module>
    from . import kvstore_server
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 85, in <module>
    _init_kvstore_server_module()
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module
    server.run()
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 73, in run
    check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
  File "/home/admin/mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:29:01] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed

Stack trace returned 10 entries:
[bt] (0) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f6169f8a54c]
[bt] (1) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f6169f8b8c8]
[bt] (2) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Start(int)+0x9e8) [0x7f616d484ea8]
[bt] (3) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::ZMQVan::Start(int)+0x6c) [0x7f616d48f99c]
[bt] (4) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7f616d47f4dc]
[bt] (5) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::kvstore::KVStoreDist::RunServer(std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&)+0x1b4) [0x7f616d41a884]
[bt] (6) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreRunServer+0x65) [0x7f616d384315]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f61729fee40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f61729fe8ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f6172c0e3df]


Traceback (most recent call last):
  File "../example/image-classification/train_mnist.py", line 25, in <module>
    from common import find_mxnet, fit
  File "/home/admin/mxnet/example/image-classification/common/find_mxnet.py", line 20, in <module>
    import mxnet as mx
  File "/home/admin/mxnet/python/mxnet/__init__.py", line 91, in <module>
    from . import kvstore_server
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 85, in <module>
    _init_kvstore_server_module()
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module
    server.run()
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 73, in run
    check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
  File "/home/admin/mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:29:02] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed

Stack trace returned 10 entries:
[bt] (0) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f623a41254c]
[bt] (1) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f623a4138c8]
[bt] (2) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Start(int)+0x9e8) [0x7f623d90cea8]
[bt] (3) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::ZMQVan::Start(int)+0x6c) [0x7f623d91799c]
[bt] (4) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7f623d9074dc]
[bt] (5) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::kvstore::KVStoreDist::RunServer(std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&)+0x1b4) [0x7f623d8a2884]
[bt] (6) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreRunServer+0x65) [0x7f623d80c315]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f6242e86e40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f6242e868ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f62430963df]


INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus=None, image_shape='1, 28, 28', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=20, num_examples=60000, num_layers=None, optimizer='sgd', profile_server_suffix='', profile_worker_suffix='', save_period=1, test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
INFO:root:Epoch[0] Batch [0-100] Speed: 21531.49 samples/sec accuracy=0.788057
INFO:root:Epoch[0] Batch [100-200] Speed: 21412.12 samples/sec accuracy=0.899062
INFO:root:Epoch[0] Batch [200-300] Speed: 22038.35 samples/sec accuracy=0.932031
INFO:root:Epoch[0] Batch [0-100] Speed: 25505.21 samples/sec accuracy=0.771813
INFO:root:Epoch[0] Batch [300-400] Speed: 20160.85 samples/sec accuracy=0.933594
INFO:root:Epoch[0] Batch [100-200] Speed: 22780.91 samples/sec accuracy=0.915937
INFO:root:Epoch[0] Batch [400-500] Speed: 22050.10 samples/sec accuracy=0.948750
INFO:root:Epoch[0] Batch [200-300] Speed: 25171.39 samples/sec accuracy=0.930000
INFO:root:Epoch[0] Batch [500-600] Speed: 22135.37 samples/sec accuracy=0.949063
INFO:root:Epoch[0] Batch [300-400] Speed: 25702.61 samples/sec accuracy=0.934219
INFO:root:Epoch[0] Batch [600-700] Speed: 21914.58 samples/sec accuracy=0.952969
INFO:root:Epoch[0] Batch [400-500] Speed: 23535.12 samples/sec accuracy=0.946406
INFO:root:Epoch[0] Batch [700-800] Speed: 22056.04 samples/sec accuracy=0.953750
INFO:root:Epoch[0] Batch [500-600] Speed: 25957.79 samples/sec accuracy=0.946719
INFO:root:Epoch[0] Batch [800-900] Speed: 21995.24 samples/sec accuracy=0.958125
INFO:root:Epoch[0] Batch [600-700] Speed: 25738.46 samples/sec accuracy=0.952500
INFO:root:Epoch[0] Train-accuracy=0.925323
INFO:root:Epoch[0] Time cost=2.781
INFO:root:Epoch[0] Validation-accuracy=0.955812
...




尝试2——在image-classification目录下运行命令:
python /home/admin/mxnet/tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python train_mnist.py --kv-store dist_sync

得到错误信息2如下:


root@node1:/home/admin/mxnet/example/image-classification# python /home/admin/mxnet/tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python train_mnist.py --kv-store dist_sync
2019-03-13 11:55:01,733 INFO rsync /home/admin/mxnet/example/image-classification/ -> 39.98.181.196:/tmp/mxnet
2019-03-13 11:55:01,734 INFO rsync /home/admin/mxnet/example/image-classification/ -> 39.107.232.93:/tmp/mxnet
Traceback (most recent call last):
  File "train_mnist.py", line 25, in <module>
    from common import find_mxnet, fit
  File "/home/admin/mxnet/example/image-classification/common/find_mxnet.py", line 20, in <module>
    import mxnet as mx
  File "/home/admin/mxnet/python/mxnet/__init__.py", line 91, in <module>
    from . import kvstore_server
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 85, in <module>
    _init_kvstore_server_module()
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module
    server.run()
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 73, in run
    check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
  File "/home/admin/mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:55:03] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed

Stack trace returned 10 entries:
[bt] (0) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7fee400ce54c]
[bt] (1) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7fee400cf8c8]
[bt] (2) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Start(int)+0x9e8) [0x7fee435c8ea8]
[bt] (3) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::ZMQVan::Start(int)+0x6c) [0x7fee435d399c]
[bt] (4) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7fee435c34dc]
[bt] (5) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::kvstore::KVStoreDist::RunServer(std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&)+0x1b4) [0x7fee4355e884]
[bt] (6) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreRunServer+0x65) [0x7fee434c8315]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fee48b42e40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fee48b428ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7fee48d523df]


Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/admin/mxnet/tools/../3rdparty/dmlc-core/tracker/dmlc_tracker/tracker.py", line 366, in <lambda>
    target=(lambda: subprocess.check_call(self.cmd, env=env, shell=True)), args=())
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'python train_mnist.py --kv-store dist_sync' returned non-zero exit status 1

root@node1:/home/admin/mxnet/example/image-classification# Traceback (most recent call last):
  File "train_mnist.py", line 25, in <module>
    from common import find_mxnet, fit
  File "/tmp/mxnet/common/find_mxnet.py", line 20, in <module>
    import mxnet as mx
  File "/home/admin/mxnet/python/mxnet/__init__.py", line 91, in <module>
    from . import kvstore_server
Traceback (most recent call last):
  File "train_mnist.py", line 97, in <module>
    fit.fit(args, sym, get_mnist_iter)
  File "/tmp/mxnet/common/fit.py", line 156, in fit
    kv = mx.kvstore.create(args.kv_store)
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 85, in <module>
    _init_kvstore_server_module()
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module
    server.run()
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 73, in run
    check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
  File "/home/admin/mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:55:04] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed

Stack trace returned 10 entries:
[bt] (0) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7fe87fcd054c]
[bt] (1) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7fe87fcd18c8]
[bt] (2) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Start(int)+0x9e8) [0x7fe8831caea8]
[bt] (3) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::ZMQVan::Start(int)+0x6c) [0x7fe8831d599c]
[bt] (4) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7fe8831c54dc]
[bt] (5) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::kvstore::KVStoreDist::RunServer(std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&)+0x1b4) [0x7fe883160884]
[bt] (6) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreRunServer+0x65) [0x7fe8830ca315]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fe888744e40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fe8887448ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7fe8889543df]


  File "/home/admin/mxnet/python/mxnet/kvstore.py", line 674, in create
    ctypes.byref(handle)))
  File "/home/admin/mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:55:04] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed

Stack trace returned 10 entries:
[bt] (0) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f45d238c54c]
[bt] (1) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f45d238d8c8]
[bt] (2) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Start(int)+0x9e8) [0x7f45d5886ea8]
[bt] (3) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::ZMQVan::Start(int)+0x6c) [0x7f45d589199c]
[bt] (4) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7f45d58814dc]
[bt] (5) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::KVStore::Create(char const*)+0x701) [0x7f45d57f89b1]
[bt] (6) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreCreate+0x20) [0x7f45d5785510]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f45dae00e40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f45dae008ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f45db0103df]


Traceback (most recent call last):
  File "train_mnist.py", line 97, in <module>
    fit.fit(args, sym, get_mnist_iter)
  File "/tmp/mxnet/common/fit.py", line 156, in fit
    kv = mx.kvstore.create(args.kv_store)
  File "/home/admin/mxnet/python/mxnet/kvstore.py", line 674, in create
    ctypes.byref(handle)))
  File "/home/admin/mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:55:04] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed

Stack trace returned 10 entries:
[bt] (0) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f5c8e1ed54c]
[bt] (1) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f5c8e1ee8c8]
[bt] (2) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Start(int)+0x9e8) [0x7f5c916e7ea8]
[bt] (3) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::ZMQVan::Start(int)+0x6c) [0x7f5c916f299c]
[bt] (4) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7f5c916e24dc]
[bt] (5) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::KVStore::Create(char const*)+0x701) [0x7f5c916599b1]
[bt] (6) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreCreate+0x20) [0x7f5c915e6510]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f5c96c61e40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f5c96c618ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f5c96e713df]


Traceback (most recent call last):
  File "train_mnist.py", line 25, in <module>
    from common import find_mxnet, fit
  File "/tmp/mxnet/common/find_mxnet.py", line 20, in <module>
    import mxnet as mx
  File "/home/admin/mxnet/python/mxnet/__init__.py", line 91, in <module>
    from . import kvstore_server
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 85, in <module>
    _init_kvstore_server_module()
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module
    server.run()
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 73, in run
    check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
  File "/home/admin/mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:55:04] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed

Stack trace returned 10 entries:
[bt] (0) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f6b1784854c]
[bt] (1) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f6b178498c8]
[bt] (2) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Start(int)+0x9e8) [0x7f6b1ad42ea8]
[bt] (3) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::ZMQVan::Start(int)+0x6c) [0x7f6b1ad4d99c]
[bt] (4) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7f6b1ad3d4dc]
[bt] (5) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::kvstore::KVStoreDist::RunServer(std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&)+0x1b4) [0x7f6b1acd8884]
[bt] (6) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreRunServer+0x65) [0x7f6b1ac42315]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f6b202bce40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f6b202bc8ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f6b204cc3df]


^C
root@node1:/home/admin/mxnet/example/image-classification# 

复现问题

1.各节点编译MXNet,config文件修改如下:
USE_OPENCV = 1
USE_BLAS = openblas
USE_CUDA = 0
USE_MKLDNN = 1
USE_DIST_KVSTORE = 1

在node1编译好后,用scp命令把MXNet传输到node2中.

2.然后安装了python的language packages
3.单机运行mnist例子运行正常

4.设置ssh之间的免密码登录
node1:ssh-copy-id id_rsa root@node2
node2:ssh-copy-id id_rsa root@node1

5.修改hostname
node1中:
修改/etc/hostname的内容为node1
在/etc/hosts中添加一行映射:公网ip node1

node2节点同理

6.各节点的tools目录和image-classification目录下都建立了hosts文件,里面存了各节点(包括当前节点)的公网ip

7.在tools目录下运行命令:
python launch.py -n 2 -H hosts --launcher ssh which python …/example/image-classification/train_mnist.py

打印错误信息,但最后mnist有运行(不知道是不是只是单机运行了),而且运行的mnist在计算得到最终结果后没有自动结束.

8.在image-classification目录下运行命令:
python /home/admin/mxnet/tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python train_mnist.py --kv-store dist_sync

尝试过的解决方法

  1. 重装MXNet
  2. 重置系统
    3.使用ssh-agent

现在的状况:
1.仍然无法运行分布式任务

你看的是这篇教程吗?

你说的是哪篇?
我前面搭建单机参考的是这篇:
Installing MXNet on Ubantu
后面搭建分布式参考的是这篇:
distributed Training in MXNet

嗯,第二篇

有参考这篇的,设置ssh代理和为每个节点上MXNet的源码编译都加上了USE_DIST_KVSTORE = 1条件.然后运行launch.py却发现不行.

是不是node 2 机子上没有data?@eric-haibin-lin

应该不会吧,先前在每个node上都分别跑了一次的.没有创建网络共享文件NFS是否会有影响?

能不能贴一下你的hosts文件?ps-lite 里面用的 zeromq只支持ip address,如果你用了hostname那就有可能报bind错误。
具体可以参考这个issue: https://github.com/apache/incubator-mxnet/issues/14114

你的hosts文件里面用的是hostname还是IP address?如果用的是hostname,可以尝试直接IP address一下。

你好,用的是IP address的image image

用的是IP address的,想问下ssh是否可以免密码互联就可以了?还是一定要用ssh-agent

我看你上面设置ssh免密码登录时,使用的是hostname?可以尝试直接使用IP address

刚试过了,也不行

bind报错我基本上还是可以确定是ps-lite这边的问题,你的hosts文件貌似没问题。。。
能不能麻烦再贴一下ifconfig的结果?
有一种可能性是ps-lite试图bind错误的network interface,ps-lite我记得默认是从第一个network interface开始尝试bind,这是比较罕见的情况,但是如果同一个ip有多个network interface在上面的话就可能出问题

还有就是,你在两台机器互相之间配置了ssh免密登陆,但是每台机器自己和自己ssh免密登陆有配置么?

这是各节点的ifconfig的结果:


想问下ps-lite试图bind错误的network是什么意思,因为先前在博客上有看到说mxnet分布式连接的时候是随机端口的,跟这个有关系么?因为感觉ssh只是开启了22端口

每台机器和自己ssh免密登录有配置的,ssh localhost/本机公网ip都是通的

还有,我在各自节点上尝试了kvstore的创建好像也会有类似下面的问题,即当参数为local时运行正常,但当参数为dist时就抛出异常了:

比如说你的电脑同时有有线连接和wifi,那你ifconfig出来除了eth0就会还有别的network interface,你想让socket bind到eth0的有线连接,但是有可能它bind到wifi上面去了。
不过你这个ifconfig看起来也没有多余的interface所以应该也不是这个问题。

然后其实ssh的端口和通信用的端口是无关的。mxnet+pslite的逻辑是这样的:ssh或者mpi这些launcher只负责开启mxnet的python进程(并且传一些环境变量比如num_workers进去),ps-lite实际互相通信用的是自己的socket,它会随机搜索能用的端口进行绑定/侦听。你这边报错的那些通信和ssh launcher完全无关(除非ssh出错,remote端的进程根本就没启动)。

另外就是,我看你node1和node2貌似公网ip和你ifconfig打印出来的ip不一样,我自己没试过但是这个也可能导致bind失败。
你这两个node应该是在同一个子网下面吧?你试试把hosts文件里面的ip地址(39开头的公网地址)改成ifconfig打印出来的这两个172开头的地址。

这个是正常现象。
local其实就是完全没有分布式,就本地单独一个进程在跑,ps-lite和socket这些都没有用到,你编译的时候没有加distributed kvstore的flag,local这个也能用。
dist这个要启动必须要有launcher(比如你用的ssh launcher,或者其他mpi launcher之类的)同时启动多个进程并传一些environment variable给这些python进程。比如你这个报错就是说缺少DMLC_NUM_WORKER这个环境变量,所以他没法知道到底有多少个worker。
总之你不能像这样在python命令行里面用dist kvstore,必须通过launcher才行。

一个可行的debug方法是你把hosts文件里面的ip地址都改成127.0.0.1,然后单机启动多个进程跑分布式mxnet,看看能不能成功。如果可以的话麻烦你试试看这样能不能跑,然后贴一下结果看看。