异步计算 讨论区

http://zh.diveintodeeplearning.org/chapter_computational-performance/async-computation.html

执行教程里面的程序显示结果不一样:

start = time.time()
x = nd.random_uniform(shape=(2000, 2000))
for i in range(1000):
    y = x + 1
    y.wait_to_read()
print('No lazy evaluation: %f sec' % (time.time() - start))

start = time.time()
x = nd.random_uniform(shape=(2000, 2000))
for i in range(1000):
    y = x + 1
nd.waitall()
print('With evaluation: %f sec' % (time.time() - start))

在我的机子上的结果分别是:

No lazy evaluation: 2.929023 sec
With evaluation: 2.712635 sec
  1. 但在我机子上面前面几个示例和教程里的时间都是相近的。 这是什么原因呢?
  2. 此外,给我的感觉是nd的操作执行相对比较慢。(做了一个简单的对比实验)
    @szha 能否给点何时采用nd进行高效操作的建议呢?(给我的感觉就是在nn里面的提速,一不小心就被外部的nd操作给盖过去了), 谢谢 :grinning:
batch_size = 20
hidden_size = 1024
num_layers = 1

# choose cpu or gpu --- default cpu
gpu = True
ctx = mx.gpu() if gpu else mx.cpu()

ctx = mx.gpu()
x = torch.zeros(1).cuda()
x = nd.zeros((1,), ctx=ctx)

t1 = time.time()
states = [torch.zeros((num_layers, batch_size, hidden_size)).cuda(),
          torch.zeros((num_layers, batch_size, hidden_size)).cuda()]
print('torch:', time.time()-t1)

t1 = time.time()
states = [nd.zeros((num_layers, batch_size, hidden_size), ctx=ctx),
          nd.zeros((num_layers, batch_size, hidden_size), ctx=ctx)]
states[0].sum().asscalar(), states[0].sum().asscalar()
print('gluon without wait:', time.time()-t1)

t1 = time.time()
states = [nd.zeros((num_layers, batch_size, hidden_size), ctx=ctx),
          nd.zeros((num_layers, batch_size, hidden_size), ctx=ctx)]
states[0].wait_to_read(), states[1].wait_to_read()
states[0].sum().asscalar(), states[0].sum().asscalar()
print('gluon with wait:', time.time()-t1)

t1 = time.time()
states = [np.zeros((num_layers, batch_size, hidden_size)),
          np.zeros((num_layers, batch_size, hidden_size))]
print('numpy:', time.time()-t1)

t1 = time.time()
states = [np.zeros((num_layers, batch_size, hidden_size)),
          np.zeros((num_layers, batch_size, hidden_size))]
states = nd.array(states, ctx=ctx)
states[0].sum().asscalar(), states[0].sum().asscalar()
print('numpy to nd:', time.time()-t1)

对应的时间如下:

torch: 0.0001513957977294922
gluon without wait: 0.0022842884063720703
gluon with wait: 0.0009772777557373047
numpy: 4.792213439941406e-05
numpy to nd: 0.0009074211120605469

弱问一句,教程中svg格式的计算图是用什么工具画的。。。很好看,想在工作汇报中也用一下@@~

大概是这个文件:…/img/frontend-backend.svg

更新了一下图片,等待大佬回复@@。。。visio画的流程图太陈旧。。。

@mli

和你问题一模一样,甚至还有with evaluation时间更长的情况

但是其他示例时间都很正常

start = time()
​
for i in range(1000):
    y = x + 1
    y.wait_to_read()
​
print('No lazy evaluation: %f sec' % (time()-start))
​
start = time()
for i in range(1000):
    y = x + 1
nd.waitall()
print('With evaluation: %f sec' % (time()-start))
No lazy evaluation: 3.856193 sec
With evaluation: 4.174078 sec

是不是因为新版本出来的问题 :joy: 目前木有人回复我,我也没找到原因。其实对于这个延时操作还是存在些困惑的

感觉workload太小,至少每个跑到1s左右吧。

我用的draw.io

你用的什么版本?什么CPU?

感觉workload太小,至少每个跑到1s左右吧。

我扩大了下大小:
gpu上面(由于内存限制:这个没法跑到1s以上)

torch: 0.03922462463378906
gluon without wait: 0.009024620056152344
gluon with wait: 0.004993438720703125
numpy: 3.4332275390625e-05
numpy to nd: 0.05630612373352051

cpu上面:

torch: 0.07934093475341797
gluon without wait: 1.4026386737823486
gluon with wait: 1.4024074077606201
numpy: 0.0018773078918457031
numpy to nd: 1.586693525314331

这个小实验可能确实存在问题。~ 谢谢沐大

你用的什么版本?什么CPU?

我用的是MXNet 0.12.0 Release Candidate 0 (conda安装的),cpu版本为i5-6500。这个问题似乎不是我一个人存在,可能有些同学没有提出来

再次感谢沐大的回答~

OK~已经在绘图了。。。多谢大佬~

为什么说asscalar或者asnumpy是同步函数,或者说什么是同步函数?

我也遇到了这个问题,请问你懂了吗?

@mli 老大,可以解释下为什么内存会下降吗,谢谢啦

类似wait_to_read()那样?这样就不会一次性丢入太多工作到后端。

1赞

下降也许是因为网络模型的生命周期到了?释放了占用的内存

1赞

最后一个代码块里,trainer.step(x.shape[0])会更新net(x)里的参数的吧?
如果这个时候不能同步执行,每一个batch的net(x)还是延迟计算的话,整个训练的过程岂不就计算错了?
或者说,在这种情况下的延迟计算里,第一个batch后更新的参数会传入到第二个batch吗?

本节的程序只能在Linux上跑吗?在Windows上怎么改呢?

1赞

有人知道这是为啥吗?我的内存总共有256G。

大神您好,问一下问题,我如果用4个卡来分别计算的话,各卡之间的计算独立,但是每个卡上的运行程序中都有同步函数asnumpy之类的,这样会阻碍其他卡的计算吗?

最后两个代码段,在windows无gpu的机器上产生如下错误


FileNotFoundError Traceback (most recent call last)
in ()
----> 1 mem = get_mem()
2 for epoch in range(1, 3):
3 l_sum = 0
4 for X, y in data_iter():
5 with autograd.record():

in get_mem()
1 def get_mem():
----> 2 res = subprocess.check_output([‘ps’, ‘u’, ‘-p’, str(os.getpid())])
3 return int(str(res).split()[15]) / 1e3

~\AppData\Local\Continuum\Anaconda3\envs\gluon\lib\subprocess.py in check_output(timeout, *popenargs, **kwargs)
334
335 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
–> 336 **kwargs).stdout
337
338

~\AppData\Local\Continuum\Anaconda3\envs\gluon\lib\subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
401 kwargs[‘stdin’] = PIPE
402
–> 403 with Popen(*popenargs:, kwargs) as process:
404 try:
405 stdout, stderr = process.communicate(input, timeout=timeout)

~\AppData\Local\Continuum\Anaconda3\envs\gluon\lib\subprocess.py in init(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors)
707 c2pread, c2pwrite,
708 errread, errwrite,
–> 709 restore_signals, start_new_session)
710 except:
711 # Cleanup if the child failed starting.

~\AppData\Local\Continuum\Anaconda3\envs\gluon\lib\subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_start_new_session)
995 env,
996 os.fspath(cwd) if cwd is not None else None,
–> 997 startupinfo)
998 finally:
999 # Child is launched. Close the parent’s copy of those pipe

FileNotFoundError: [WinError 2] The system cannot find the file specified