Ubantu按官网教程实现分布式CPU部署,单机可用,分布式报错(已解决)

172开头的是各自节点的内网ip,现在通过ssh是连不到对方的内网ip的,好像要加到同一个安全组下才能内网互联,我试下

额,试了下
使用命令如下:
python launch.py -n 3 -H host1 --launcher ssh python /home/admin/mxnet/example/image-classification/train_mnist.py --kv-store dist_sync

结果如下:

root@node1:/home/admin/mxnet/tools# ssh localhost
Welcome to Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-93-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage
New release '18.04.2 LTS' available.
Run 'do-release-upgrade' to upgrade to it.


Welcome to Alibaba Cloud Elastic Compute Service !

Last login: Thu Mar 21 16:28:22 2019 from 218.17.207.121
root@node1:~# exit
logout
Connection to localhost closed.
root@node1:/home/admin/mxnet/tools# python launch.py -n 3 -H host1 --launcher ssh python /home/admin/mxnet/example/image-classification/train_mnist.py --kv-store dist_sync
Traceback (most recent call last):
  File "/home/admin/mxnet/example/image-classification/train_mnist.py", line 25, in <module>
    from common import find_mxnet, fit
  File "/home/admin/mxnet/example/image-classification/common/find_mxnet.py", line 20, in <module>
    import mxnet as mx
  File "/home/admin/mxnet/python/mxnet/__init__.py", line 91, in <module>
    from . import kvstore_server
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 85, in <module>
    _init_kvstore_server_module()
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module
    server.run()
  File "/home/admin/mxnet/python/mxnet/kvstore_server.py", line 73, in run
    check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
  File "/home/admin/mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [16:33:15] src/van.cc:291: Check failed: (my_node_.port) != (-1) bind failed

Stack trace returned 10 entries:
[bt] (0) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7efd08357d3c]
[bt] (1) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7efd083590b8]
[bt] (2) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Start(int)+0x9e8) [0x7efd0b856b38]
[bt] (3) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::ZMQVan::Start(int)+0x6c) [0x7efd0b86162c]
[bt] (4) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7efd0b85116c]
[bt] (5) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::kvstore::KVStoreDist::RunServer(std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&)+0x1b4) [0x7efd0b7ec514]
[bt] (6) /home/admin/mxnet/python/mxnet/../../lib/libmxnet.so(MXKVStoreRunServer+0x65) [0x7efd0b755fa5]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7efd10798e40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7efd107988ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7efd109a83df]


Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/admin/mxnet/tools/../3rdparty/dmlc-core/tracker/dmlc_tracker/tracker.py", line 366, in <lambda>
    target=(lambda: subprocess.check_call(self.cmd, env=env, shell=True)), args=())
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'python /home/admin/mxnet/example/image-classification/train_mnist.py --kv-store dist_sync' returned non-zero exit status 1

root@node1:/home/admin/mxnet/tools# 

host1文件内的内容为:

127.0.0.1
127.0.0.1
127.0.0.1

可以了,感谢指点:pray:.之前一直不确定是什么问题,原来是配好内网互联就行了,这么说也不对,按照你debug的思路,应该是ifconfig上显示的网卡,节点之间能互相免密码连通网卡对应的ip就行了,对于真机来说是公网ip,对于云服务器可能是内网ip.

我想问一下,所以你最后运行成功的时候hosts文件里面用的是39开头的公网ip,还是172开头的内网ip?

172开头的,昨天尝试的时候我没配公网ip的免密码互联,只配了内网的ssh免密码互联,然后在hosts文件里写进节点的内网ip(172开头),直接运行分布式就可以了.

所以现在还是没办法用39开头的公网ip来正常运行分布式是么?

对的,公网IP不行

1赞

好的,按照你给出的这些信息,我猜测问题的根源应该是这样:
ps-lite的server端试图bind network interface,它拿到的地址列表和ifconfig出来的应该是一样的,但是你的hosts文件里面的ip是公网ip,和ifconfig里面的不一样,这样就导致了server在列表中找不到这个地址而报错。
现在可能也找不到一个比较好的方法解决用公网ip运行分布式的问题,所以只能用内网ip了。

请问您是怎么实现内网的ssh免密码互联的,我这里ssh不能通过ip访问另一台机器

不好意思,这个是毕设做的MXNet课题,时间太久远也记不太清了。不过重新看了一遍上面的问题描述,隐约记得这个不是当时的卡口问题。可以谷歌下ssh免密互联哈,刚刚看了下有比较多相关资料可以参考,关键点应该是使用内网ip设置,另外可能需要设置下安全组和防火墙相关

谢谢您,我在云服务器上做的实验,问了客服他们机器不支持ip互联