求大佬帮忙看下,主要问题如标题所示,下面是环境和问题重现的详细log.已解决,分析见下面的回答,感谢各位的指点
环境
node1环境:
tools目录下运行diagnose脚本:
root@node1:/home/admin/mxnet/tools# python diagnose.py
----------Python Info----------
('Version :', '2.7.12')
('Compiler :', 'GCC 5.4.0 20160609')
('Build :', ('default', 'Nov 19 2016 06:48:10'))
('Arch :', ('64bit', 'ELF'))
------------Pip Info-----------
('Version :', '19.0.3')
('Directory :', '/usr/local/lib/python2.7/dist-packages/pip')
----------MXNet Info-----------
('Version :', '1.5.0')
('Directory :', '/home/admin/mxnet/python/mxnet')
Hashtag not found. Not installed from pre-built package.
----------System Info----------
('Platform :', 'Linux-4.4.0-93-generic-x86_64-with-Ubuntu-16.04-xenial')
('system :', 'Linux')
('node :', 'node1')
('release :', '4.4.0-93-generic')
('version :', '#116-Ubuntu SMP Fri Aug 11 21:17:51 UTC 2017')
----------Hardware Info----------
('machine :', 'x86_64')
('processor :', 'x86_64')
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 1
On-line CPU(s) list: 0
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
Stepping: 4
CPU MHz: 2499.994
BogoMIPS: 4999.98
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 33792K
NUMA node0 CPU(s): 0
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap avx512cd xsaveopt xsavec xgetbv1
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0002 sec, LOAD: 1.3775 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0004 sec, LOAD: 2.9173 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.5871 sec, LOAD: 1.2784 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.1930 sec, LOAD: 2.0958 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.5681 sec, LOAD: 2.3481 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 1.3597 sec, LOAD: 2.0816 sec.
node2环境:
root@node2:/home/admin/mxnet/tools# python diagnose.py
----------Python Info----------
('Version :', '2.7.12')
('Compiler :', 'GCC 5.4.0 20160609')
('Build :', ('default', 'Nov 19 2016 06:48:10'))
('Arch :', ('64bit', 'ELF'))
------------Pip Info-----------
('Version :', '19.0.3')
('Directory :', '/usr/local/lib/python2.7/dist-packages/pip')
----------MXNet Info-----------
('Version :', '1.5.0')
('Directory :', '/home/admin/mxnet/python/mxnet')
Hashtag not found. Not installed from pre-built package.
----------System Info----------
('Platform :', 'Linux-4.4.0-93-generic-x86_64-with-Ubuntu-16.04-xenial')
('system :', 'Linux')
('node :', 'node2')
('release :', '4.4.0-93-generic')
('version :', '#116-Ubuntu SMP Fri Aug 11 21:17:51 UTC 2017')
----------Hardware Info----------
('machine :', 'x86_64')
('processor :', 'x86_64')
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 1
On-line CPU(s) list: 0
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
Stepping: 1
CPU MHz: 2494.220
BogoMIPS: 4988.44
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 40960K
NUMA node0 CPU(s): 0
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0001 sec, LOAD: 2.0057 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.1153 sec, LOAD: 29.6195 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.1215 sec, LOAD: 2.8773 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.1650 sec, LOAD: 2.7257 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.2725 sec, LOAD: 5.3097 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.2257 sec, LOAD: 2.7739 sec.
问题的记录
尝试1——在tools目录下运行launch.py,命令如下:
python launch.py -n 2 -H hosts --launcher ssh which python
…/example/image-classification/train_mnist.py
打印的错误信息1如下:
root@node1:/home/admin/mxnet/tools# python launch.py -n 2 -H hosts --launcher ssh `which python` ../example/image-classification/train_mnist.py
Traceback (most recent call last):
File "../example/image-classification/train_mnist.py", line 25, in <module>
from common import find_mxnet, fit
File "/home/admin/mxnet/example/image-classification/common/find_mxnet.py", line 20, in <module>
import mxnet as mx
File "/home/admin/mxnet/python/mxnet/__init__.py", line 91, in <module>
from . import kvstore_server
File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 85, in <module>
_init_kvstore_server_module()
File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module
server.run()
File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 73, in run
check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
File "/home/admin/mxnet/python/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:29:01] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed
Stack trace returned 10 entries:
[bt] (0) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f62d3b7c54c]
[bt] (1) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f62d3b7d8c8]
[bt] (2) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Start(int)+0x9e8) [0x7f62d7076ea8]
[bt] (3) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::ZMQVan::Start(int)+0x6c) [0x7f62d708199c]
[bt] (4) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7f62d70714dc]
[bt] (5) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::kvstore::KVStoreDist::RunServer(std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&)+0x1b4) [0x7f62d700c884]
[bt] (6) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreRunServer+0x65) [0x7f62d6f76315]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f62dc5f0e40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f62dc5f08ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f62dc8003df]
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/admin/mxnet/tools/../3rdparty/dmlc-core/tracker/dmlc_tracker/tracker.py", line 366, in <lambda>
target=(lambda: subprocess.check_call(self.cmd, env=env, shell=True)), args=())
File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/bin/python ../example/image-classification/train_mnist.py' returned non-zero exit status 1
root@node1:/home/admin/mxnet/tools# INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus=None, image_shape='1, 28, 28', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=20, num_examples=60000, num_layers=None, optimizer='sgd', profile_server_suffix='', profile_worker_suffix='', save_period=1, test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
Traceback (most recent call last):
File "../example/image-classification/train_mnist.py", line 25, in <module>
from common import find_mxnet, fit
File "/home/admin/mxnet/example/image-classification/common/find_mxnet.py", line 20, in <module>
import mxnet as mx
File "/home/admin/mxnet/python/mxnet/__init__.py", line 91, in <module>
from . import kvstore_server
File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 85, in <module>
_init_kvstore_server_module()
File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module
server.run()
File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 73, in run
check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
File "/home/admin/mxnet/python/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:29:01] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed
Stack trace returned 10 entries:
[bt] (0) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f6169f8a54c]
[bt] (1) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f6169f8b8c8]
[bt] (2) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Start(int)+0x9e8) [0x7f616d484ea8]
[bt] (3) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::ZMQVan::Start(int)+0x6c) [0x7f616d48f99c]
[bt] (4) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7f616d47f4dc]
[bt] (5) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::kvstore::KVStoreDist::RunServer(std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&)+0x1b4) [0x7f616d41a884]
[bt] (6) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreRunServer+0x65) [0x7f616d384315]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f61729fee40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f61729fe8ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f6172c0e3df]
Traceback (most recent call last):
File "../example/image-classification/train_mnist.py", line 25, in <module>
from common import find_mxnet, fit
File "/home/admin/mxnet/example/image-classification/common/find_mxnet.py", line 20, in <module>
import mxnet as mx
File "/home/admin/mxnet/python/mxnet/__init__.py", line 91, in <module>
from . import kvstore_server
File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 85, in <module>
_init_kvstore_server_module()
File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module
server.run()
File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 73, in run
check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
File "/home/admin/mxnet/python/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:29:02] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed
Stack trace returned 10 entries:
[bt] (0) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f623a41254c]
[bt] (1) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f623a4138c8]
[bt] (2) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Start(int)+0x9e8) [0x7f623d90cea8]
[bt] (3) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::ZMQVan::Start(int)+0x6c) [0x7f623d91799c]
[bt] (4) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7f623d9074dc]
[bt] (5) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::kvstore::KVStoreDist::RunServer(std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&)+0x1b4) [0x7f623d8a2884]
[bt] (6) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreRunServer+0x65) [0x7f623d80c315]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f6242e86e40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f6242e868ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f62430963df]
INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus=None, image_shape='1, 28, 28', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=20, num_examples=60000, num_layers=None, optimizer='sgd', profile_server_suffix='', profile_worker_suffix='', save_period=1, test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
INFO:root:Epoch[0] Batch [0-100] Speed: 21531.49 samples/sec accuracy=0.788057
INFO:root:Epoch[0] Batch [100-200] Speed: 21412.12 samples/sec accuracy=0.899062
INFO:root:Epoch[0] Batch [200-300] Speed: 22038.35 samples/sec accuracy=0.932031
INFO:root:Epoch[0] Batch [0-100] Speed: 25505.21 samples/sec accuracy=0.771813
INFO:root:Epoch[0] Batch [300-400] Speed: 20160.85 samples/sec accuracy=0.933594
INFO:root:Epoch[0] Batch [100-200] Speed: 22780.91 samples/sec accuracy=0.915937
INFO:root:Epoch[0] Batch [400-500] Speed: 22050.10 samples/sec accuracy=0.948750
INFO:root:Epoch[0] Batch [200-300] Speed: 25171.39 samples/sec accuracy=0.930000
INFO:root:Epoch[0] Batch [500-600] Speed: 22135.37 samples/sec accuracy=0.949063
INFO:root:Epoch[0] Batch [300-400] Speed: 25702.61 samples/sec accuracy=0.934219
INFO:root:Epoch[0] Batch [600-700] Speed: 21914.58 samples/sec accuracy=0.952969
INFO:root:Epoch[0] Batch [400-500] Speed: 23535.12 samples/sec accuracy=0.946406
INFO:root:Epoch[0] Batch [700-800] Speed: 22056.04 samples/sec accuracy=0.953750
INFO:root:Epoch[0] Batch [500-600] Speed: 25957.79 samples/sec accuracy=0.946719
INFO:root:Epoch[0] Batch [800-900] Speed: 21995.24 samples/sec accuracy=0.958125
INFO:root:Epoch[0] Batch [600-700] Speed: 25738.46 samples/sec accuracy=0.952500
INFO:root:Epoch[0] Train-accuracy=0.925323
INFO:root:Epoch[0] Time cost=2.781
INFO:root:Epoch[0] Validation-accuracy=0.955812
...
尝试2——在image-classification目录下运行命令:
python /home/admin/mxnet/tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python train_mnist.py --kv-store dist_sync
得到错误信息2如下:
root@node1:/home/admin/mxnet/example/image-classification# python /home/admin/mxnet/tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python train_mnist.py --kv-store dist_sync
2019-03-13 11:55:01,733 INFO rsync /home/admin/mxnet/example/image-classification/ -> 39.98.181.196:/tmp/mxnet
2019-03-13 11:55:01,734 INFO rsync /home/admin/mxnet/example/image-classification/ -> 39.107.232.93:/tmp/mxnet
Traceback (most recent call last):
File "train_mnist.py", line 25, in <module>
from common import find_mxnet, fit
File "/home/admin/mxnet/example/image-classification/common/find_mxnet.py", line 20, in <module>
import mxnet as mx
File "/home/admin/mxnet/python/mxnet/__init__.py", line 91, in <module>
from . import kvstore_server
File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 85, in <module>
_init_kvstore_server_module()
File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module
server.run()
File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 73, in run
check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
File "/home/admin/mxnet/python/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:55:03] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed
Stack trace returned 10 entries:
[bt] (0) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7fee400ce54c]
[bt] (1) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7fee400cf8c8]
[bt] (2) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Start(int)+0x9e8) [0x7fee435c8ea8]
[bt] (3) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::ZMQVan::Start(int)+0x6c) [0x7fee435d399c]
[bt] (4) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7fee435c34dc]
[bt] (5) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::kvstore::KVStoreDist::RunServer(std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&)+0x1b4) [0x7fee4355e884]
[bt] (6) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreRunServer+0x65) [0x7fee434c8315]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fee48b42e40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fee48b428ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7fee48d523df]
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/admin/mxnet/tools/../3rdparty/dmlc-core/tracker/dmlc_tracker/tracker.py", line 366, in <lambda>
target=(lambda: subprocess.check_call(self.cmd, env=env, shell=True)), args=())
File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'python train_mnist.py --kv-store dist_sync' returned non-zero exit status 1
root@node1:/home/admin/mxnet/example/image-classification# Traceback (most recent call last):
File "train_mnist.py", line 25, in <module>
from common import find_mxnet, fit
File "/tmp/mxnet/common/find_mxnet.py", line 20, in <module>
import mxnet as mx
File "/home/admin/mxnet/python/mxnet/__init__.py", line 91, in <module>
from . import kvstore_server
Traceback (most recent call last):
File "train_mnist.py", line 97, in <module>
fit.fit(args, sym, get_mnist_iter)
File "/tmp/mxnet/common/fit.py", line 156, in fit
kv = mx.kvstore.create(args.kv_store)
File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 85, in <module>
_init_kvstore_server_module()
File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module
server.run()
File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 73, in run
check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
File "/home/admin/mxnet/python/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:55:04] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed
Stack trace returned 10 entries:
[bt] (0) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7fe87fcd054c]
[bt] (1) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7fe87fcd18c8]
[bt] (2) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Start(int)+0x9e8) [0x7fe8831caea8]
[bt] (3) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::ZMQVan::Start(int)+0x6c) [0x7fe8831d599c]
[bt] (4) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7fe8831c54dc]
[bt] (5) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::kvstore::KVStoreDist::RunServer(std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&)+0x1b4) [0x7fe883160884]
[bt] (6) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreRunServer+0x65) [0x7fe8830ca315]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fe888744e40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fe8887448ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7fe8889543df]
File "/home/admin/mxnet/python/mxnet/kvstore.py", line 674, in create
ctypes.byref(handle)))
File "/home/admin/mxnet/python/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:55:04] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed
Stack trace returned 10 entries:
[bt] (0) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f45d238c54c]
[bt] (1) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f45d238d8c8]
[bt] (2) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Start(int)+0x9e8) [0x7f45d5886ea8]
[bt] (3) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::ZMQVan::Start(int)+0x6c) [0x7f45d589199c]
[bt] (4) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7f45d58814dc]
[bt] (5) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::KVStore::Create(char const*)+0x701) [0x7f45d57f89b1]
[bt] (6) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreCreate+0x20) [0x7f45d5785510]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f45dae00e40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f45dae008ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f45db0103df]
Traceback (most recent call last):
File "train_mnist.py", line 97, in <module>
fit.fit(args, sym, get_mnist_iter)
File "/tmp/mxnet/common/fit.py", line 156, in fit
kv = mx.kvstore.create(args.kv_store)
File "/home/admin/mxnet/python/mxnet/kvstore.py", line 674, in create
ctypes.byref(handle)))
File "/home/admin/mxnet/python/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:55:04] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed
Stack trace returned 10 entries:
[bt] (0) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f5c8e1ed54c]
[bt] (1) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f5c8e1ee8c8]
[bt] (2) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Start(int)+0x9e8) [0x7f5c916e7ea8]
[bt] (3) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::ZMQVan::Start(int)+0x6c) [0x7f5c916f299c]
[bt] (4) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7f5c916e24dc]
[bt] (5) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::KVStore::Create(char const*)+0x701) [0x7f5c916599b1]
[bt] (6) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreCreate+0x20) [0x7f5c915e6510]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f5c96c61e40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f5c96c618ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f5c96e713df]
Traceback (most recent call last):
File "train_mnist.py", line 25, in <module>
from common import find_mxnet, fit
File "/tmp/mxnet/common/find_mxnet.py", line 20, in <module>
import mxnet as mx
File "/home/admin/mxnet/python/mxnet/__init__.py", line 91, in <module>
from . import kvstore_server
File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 85, in <module>
_init_kvstore_server_module()
File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module
server.run()
File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 73, in run
check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
File "/home/admin/mxnet/python/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:55:04] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed
Stack trace returned 10 entries:
[bt] (0) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f6b1784854c]
[bt] (1) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f6b178498c8]
[bt] (2) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Start(int)+0x9e8) [0x7f6b1ad42ea8]
[bt] (3) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::ZMQVan::Start(int)+0x6c) [0x7f6b1ad4d99c]
[bt] (4) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7f6b1ad3d4dc]
[bt] (5) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::kvstore::KVStoreDist::RunServer(std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&)+0x1b4) [0x7f6b1acd8884]
[bt] (6) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreRunServer+0x65) [0x7f6b1ac42315]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f6b202bce40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f6b202bc8ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f6b204cc3df]
^C
root@node1:/home/admin/mxnet/example/image-classification#
复现问题
1.各节点编译MXNet,config文件修改如下:
USE_OPENCV = 1
USE_BLAS = openblas
USE_CUDA = 0
USE_MKLDNN = 1
USE_DIST_KVSTORE = 1
在node1编译好后,用scp命令把MXNet传输到node2中.
2.然后安装了python的language packages
3.单机运行mnist例子运行正常
4.设置ssh之间的免密码登录
node1:ssh-copy-id id_rsa root@node2
node2:ssh-copy-id id_rsa root@node1
5.修改hostname
node1中:
修改/etc/hostname的内容为node1
在/etc/hosts中添加一行映射:公网ip node1
node2节点同理
6.各节点的tools目录和image-classification目录下都建立了hosts文件,里面存了各节点(包括当前节点)的公网ip
7.在tools目录下运行命令:
python launch.py -n 2 -H hosts --launcher ssh which python
…/example/image-classification/train_mnist.py
打印错误信息,但最后mnist有运行(不知道是不是只是单机运行了),而且运行的mnist在计算得到最终结果后没有自动结束.
8.在image-classification目录下运行命令:
python /home/admin/mxnet/tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python train_mnist.py --kv-store dist_sync
尝试过的解决方法
- 重装MXNet
- 重置系统
3.使用ssh-agent
现在的状况:
1.仍然无法运行分布式任务