深度卷积神经网络(AlexNet) 讨论区


大概是要买新机器了。。
看了看钱包。。。大佬们。。2060显卡够用么。、

1赞

跑了Alxnet这一章,居然显示内存不够,GPU是英伟达1080Ti的,这得需要多大的内存才能跑出来

我遇到类似问题,后来发现CUDA 9.2是有补丁的,打了补丁之后,就没这个提示了。 我的本也是2G显卡,只好把batchsize改小,你也可以尝试一下。

resize=None 报错,不知道为啥?

(mxnet_p36) ubuntu@ip-10-0-0-0:~/d2l-zh/chapter_convolutional-neural-networks$ python 1.py 
training on gpu(0)
infer_shape error. Arguments:
  data: (128, 1, 28, 28)
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/block.py", line 803, in _call_cached_op
    for is_arg, i in self._cached_op_args]
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/block.py", line 803, in <listcomp>
    for is_arg, i in self._cached_op_args]
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/parameter.py", line 494, in data
    return self._check_and_get(self._data, ctx)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/parameter.py", line 208, in _check_and_get
    "num_features, etc., for network layers."%(self.name))
mxnet.gluon.parameter.DeferredInitializationError: Parameter 'conv0_weight' has not been initialized yet because initialization was deferred. Actual initialization happens during the first forward pass. Please pass one batch of data through the network before accessing Parameters. You can also avoid deferred initialization by specifying in_units, num_features, etc., for network layers.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/block.py", line 789, in _deferred_infer_shape
    self.infer_shape(*args)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/block.py", line 862, in infer_shape
    self._infer_attrs('infer_shape', 'shape', *args)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/block.py", line 851, in _infer_attrs
    **{i.name: getattr(j, attr) for i, j in zip(inputs, args)})
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 996, in infer_shape
    res = self._infer_shape_impl(False, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1126, in _infer_shape_impl
    ctypes.byref(complete)))
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Error in operator pool1_fwd: [14:40:49] src/operator/nn/pooling.cc:155: Check failed: param.kernel[0] <= dshape[2] + 2 * param.pad[0] kernel size (3) exceeds input (2 padded to 2)

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3da6c2) [0x7fdac0c2b6c2]
[bt] (1) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3dac98) [0x7fdac0c2bc98]
[bt] (2) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x9bebec) [0x7fdac120fbec]
[bt] (3) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3195e3f) [0x7fdac39e6e3f]
[bt] (4) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x31989b8) [0x7fdac39e99b8]
[bt] (5) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(MXSymbolInferShape+0x1549) [0x7fdac39535d9]
[bt] (6) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7fdaf8bbeec0]
[bt] (7) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7fdaf8bbe87d]
[bt] (8) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7fdaf8dd3e2e]
[bt] (9) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x12865) [0x7fdaf8dd4865]



During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "1.py", line 59, in <module>
    d2l.train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx, num_epochs)
  File "../d2lzh/utils.py", line 687, in train_ch5
    y_hat = net(X)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/block.py", line 540, in __call__
    out = self.forward(*args)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/block.py", line 907, in forward
    return self._call_cached_op(x, *args)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/block.py", line 805, in _call_cached_op
    self._deferred_infer_shape(*args)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/block.py", line 793, in _deferred_infer_shape
    raise ValueError(error_msg)
ValueError: Deferred initialization failed because shape cannot be inferred. Error in operator pool1_fwd: [14:40:49] src/operator/nn/pooling.cc:155: Check failed: param.kernel[0] <= dshape[2] + 2 * param.pad[0] kernel size (3) exceeds input (2 padded to 2)

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3da6c2) [0x7fdac0c2b6c2]
[bt] (1) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3dac98) [0x7fdac0c2bc98]
[bt] (2) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x9bebec) [0x7fdac120fbec]
[bt] (3) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3195e3f) [0x7fdac39e6e3f]
[bt] (4) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x31989b8) [0x7fdac39e99b8]
[bt] (5) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(MXSymbolInferShape+0x1549) [0x7fdac39535d9]
[bt] (6) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7fdaf8bbeec0]
[bt] (7) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7fdaf8bbe87d]
[bt] (8) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7fdaf8dd3e2e]
[bt] (9) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x12865) [0x7fdaf8dd4865]

我4g显存都够……可能是Jupyter不能释放内存的问题。你重启以下Jupyter试试?

请问这个问题解决了吗

from mxnet.gluon import nn,data as gdata,utils as gutils,loss as gloss
from mxnet import autograd as ag
from mxnet import nd,image
import mxnet as mx
import os,sys,time

net = nn.Sequential()
with net.name_scope():
    net.add(
        #第一阶段
        nn.Conv2D(channels=96,kernel_size=11,strides=4,activation='relu'),
        nn.MaxPool2D(pool_size=3,strides=2),
        #第二阶段
        nn.Conv2D(channels=256,kernel_size=5,padding=2,activation='relu'),
        nn.MaxPool2D(pool_size=3,strides=2),
        #第三阶段
        nn.Conv2D(channels=384,kernel_size=3,padding=1,activation='relu'),
        nn.Conv2D(channels=384,kernel_size=3,padding=1,activation='relu'),
        nn.Conv2D(channels=256,kernel_size=3,padding=1,activation='relu'),
        nn.MaxPool2D(pool_size=3,strides=2),
        #第四阶段
        nn.Flatten(),
        nn.Dense(4096,activation='relu'),
        nn.Dropout(0.5),
        #第五阶段
        nn.Dense(4096,activation='relu'),
        nn.Dropout(0.5),
        nn.Dense(10),
    )

def load_data_fashion_mnist(batch_size, resize=None, root=os.path.join(
        '~', '.mxnet', 'datasets', 'fashion-mnist')):
    """Download the fashion mnist dataset and then load into memory."""
    root = os.path.expanduser(root)
    # https://mxnet.apache.org/api/python/docs/api/gluon/data/vision/transforms/index.html#module-mxnet.gluon.data.vision.transforms
    transformer = []
    if resize:
        transformer += [gdata.vision.transforms.Resize(resize)]
    transformer += [gdata.vision.transforms.ToTensor()]
    # 因为compose的输入是一个transformer的list,所以一开始的时候 transformer 初始化为一个list,
    # 每次有新的变形方式加入到这个list里面去,最后compose为一个transformer
    transformer = gdata.vision.transforms.Compose(transformer)

    mnist_train = gdata.vision.FashionMNIST(root=root, train=True)
    mnist_test = gdata.vision.FashionMNIST(root=root, train=False)
    # 数据读取的多进程的设置
    num_workers = 0 if sys.platform.startswith('win32') else 4

    # https://mxnet.apache.org/api/python/docs/api/gluon/data/index.html#mxnet.gluon.data.Dataset.transform_first
    # 用 transform_first 的原因就是保持label不变。多看源码
    train_iter = gdata.DataLoader(mnist_train.transform_first(transformer),
                                  batch_size, shuffle=True,
                                  num_workers=num_workers)
    test_iter = gdata.DataLoader(mnist_test.transform_first(transformer),
                                 batch_size, shuffle=False,
                                 num_workers=num_workers)

    return train_iter, test_iter

epochs,batch_size,lr,ctx = 10,128,0.01,mx.gpu()
net.initialize(ctx=ctx,force_reinit=True)

train_iter,test_iter = load_data_fashion_mnist(batch_size,resize=224)
trainer = mx.gluon.Trainer(net.collect_params(),'sgd',{'learning_rate':lr})

def accuracy(output,label):
    return nd.mean(output.argmax(axis=1)==label,ctx=mx.gpu()).asscalar()

def evaluate_accuracy(data_iter,net):
    acc = 0
    for data,label in data_iter:
        output=net(data)
        acc+=accuracy(output,label)
    return acc/len(data_iter)

def train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx,
              num_epochs):
    """Train and evaluate a model with CPU or GPU."""
    print('training on', ctx)
    loss = gloss.SoftmaxCrossEntropyLoss()
    for epoch in range(num_epochs):
        train_l_sum, train_acc_sum, n, start = 0.0, 0.0, 0, time.time()
        for X, y in train_iter:
            X, y = X.as_in_context(ctx), y.as_in_context(ctx) # 把数据copy到GPU上去
            with ag.record():
                y_hat = net(X)
                l = loss(y_hat, y) # 一个sample的loss,但是一个sample有y.size个分类
            l.backward()
            trainer.step(batch_size)
            y = y.astype('float32')
            train_l_sum += nd.mean(l).asscalar() # 一个batch所有sample的loss和
            train_acc_sum += nd.mean((y_hat.argmax(axis=1) == y)).asscalar() # 正确的个数
            # n += y.size # 总个数 n = batch_size * y.size
        test_acc = evaluate_accuracy(test_iter, net)
        print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, '
              'time %.1f sec'
              % (epoch + 1, train_l_sum / len(train_data), train_acc_sum / len(train_data), test_acc,
                 time.time() - start))

train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx, epochs)

想问下这里的 为啥报错了

training on gpu(0)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-74-f4549b4f9a86> in <module>()
----> 1 train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx, epochs)

8 frames
/usr/local/lib/python3.6/dist-packages/mxnet/gluon/parameter.py in _check_and_get(self, arr_list, ctx)
    225                 "Parameter '%s' was not initialized on context %s. "
    226                 "It was only initialized on %s."%(
--> 227                     self.name, str(ctx), str(self._ctx_list)))
    228         if self._deferred_init:
    229             raise DeferredInitializationError(

RuntimeError: Parameter 'sequential10_conv0_weight' was not initialized on context cpu_shared(0). It was only initialized on [gpu(0)].

我这里初始化都是在GPU上的,看不懂这个错误提示呀,有没有大佬指点一下啊

[quote="hehaitao, post:109, topic:1258"]
train_l_sum += nd.mean(l).asscalar() # 一个batch所有sample的loss和 
train_acc_sum += nd.mean((y_hat.argmax(axis=1) == y)).asscalar() # 正确的个数
[/quote]

好像是因为这里没有设置GPU 。。。不对,又跑了一遍,不对。ndarray数据初始化的时候设置GPU就可以了,后面的运算应该都在GPU上运行,还有为啥我用的也是Tesla K80,为啥我的运行时间很慢,loss显示居然是NAN…
下面这个图是原始的


,但是我没弄出来,GPU训练都没法正常进行。。。。

准确率比LENET好多了,训练了50轮,准确率能到0.926;之前用lenent怎么调顶破天也就0.9

请问这个问题解决了吗

将2828的图像改成了224224,请问多出的那些像素使用什么进行填充的呢?

@NoGpu @zys0620 @jameson
是net中全连接层(nn.Dense)初始化的问题,在resize之前我们已经运行下面的代码,从而net有了输入个数,此时net的模型形状就固定了

引用
我们构造一个高和宽均为224的单通道数据样本来观察每一层的输出形状。
X = nd.random.uniform(shape=(1, 1, 224, 224))
net.initialize()
for layer in net:
X = layer(X)
print(layer.name, ‘output shape:\t’, X.shape)

而下面的initialize中force_reinit也只是在模型参数形状不变的情况下对net进行重新初始化。

引用
net.initialize(force_reinit=True, ctx=ctx, init=init.Xavier())

在4.3的练习中也有类似问题

引用
4.3.4. 练习
如果在下一次前向计算net(X)前改变输入X的形状,包括批量大小和输入个数,会发生什么?

解决方法:(假设resize = 96)
1、不运行上面那段,直接运行下面的,使得net输入个数为96
2、把上面代码的224改为96
X = nd.random.uniform(shape=(1, 1, 96, 96))

注:后面看到resnet这节,对于nn.Conv2D等卷积层来说,特征便是通道,所以图像resize为96,只要保持通道数相同,就不会出现error

1赞

说的不对,希望老师指正

#1) 卷积操作的输入无论多少个channel 卷积操作的结果只会是一个channel的输出
#2)如果需要输出多个channel,可以利用多个滤镜进行卷积操作,然后把结果进行concat

我的mbp 16G内存,风扇呼呼的跑了很久也,没出结果。好可怜

书中5.6.1有一句:

包含许多特征的深度模型需要⼤量的有标签的数据才能表现得⽐其他经典⽅法更好

请问这是为什么呢?有没有理论上的解释呢?大量又是怎么定义的呢?

AlexNet的学习率比LeNet的要小很多,是不是参数越多,学习率应该越小,因为更多参数的更新更容易出现梯度下降的步长过大的情况?