深度卷积神经网络(AlexNet) 讨论区


#42

应该是内存或者显存不够了 改小batchsize试试


#43

输出256个channels,需要256个9655的卷积核。问题1:输入的9655图片卷积时共享每一个fliter。
问题2:输入的9655图片用共享的fliter做卷积后相加经过relu函数输出一个值,卷积完=后得到一个输入值,根据公式可以算出 [(n+2p-f)/s]+1得到12222的一个图片,256个fliter,所以最后输出是2562222个


#44

看到函数修改了,但load_data_fashion_mnist读取以后,data还是都变为0了,resize设为96以后,网络才可以训练:
def load_data_fashion_mnist(batch_size, resize=None, root="~/.mxnet/datasets/fashion-mnist"):
""“download the fashion mnist dataest and then load into memory”""
def transform_mnist(data, label):
# transform a batch of examples
if resize:
n = data.shape[0]
new_data = nd.zeros((n, resize, resize, data.shape[3]))
for i in range(n):
new_data[i] = image.imresize(data[i], resize, resize)
data = new_data
# change data from batch x height x weight x channel to batch x channel x height x weight
return nd.transpose(data.astype(‘float32’), (0,3,1,2))/255, label.astype(‘float32’)
mnist_train = gluon.data.vision.FashionMNIST(root=root, train=True, transform=transform_mnist)
mnist_test = gluon.data.vision.FashionMNIST(root=root, train=False, transform=transform_mnist)
train_data = DataLoader(mnist_train, batch_size, shuffle=True)
test_data = DataLoader(mnist_test, batch_size, shuffle=False)
return (train_data, test_data)

还有一个问题就是,跑这个程序的时候,内存占用异常的大。 这个函数将resize的过程放到了DataLoader函数之前,这个虽然速度可能有一些提升,但是内存占用的代价太大了,建议将resize的判断放到函数DataLoader里面。调用每一个batch的时候,将当前batch的数据resize。不然这个数据集resize到224,然后跑这个模型,占用的内存都超过32g了@szha


#45

对utils.py的DataLoader和load_data_fashion_mnist做如下修改,不再出现内存占用过多的问题。同时,resize = 224也能计算了(data不再变为0,原因还不知道),在我的GPU上,训练GoogLeNet一个epoch,resize = 224需要175s左右,resize = 96需要57s左右。

class DataLoader(object):
"""similiar to gluon.data.DataLoader, but might be faster.

The main difference this data loader tries to read more exmaples each
time. But the limits are 1) all examples in dataset have the same shape, 2)
data transfomer needs to process multiple examples at each time
"""
def __init__(self, dataset, batch_size, shuffle, resize):
    self.dataset = dataset
    self.batch_size = batch_size
    self.shuffle = shuffle
    self.resize = resize

def __iter__(self):
    data = self.dataset[:]
    X = data[0]
    y = nd.array(data[1])
    n = X.shape[0]
    if self.shuffle:
        idx = np.arange(n)
        np.random.shuffle(idx)
        X = nd.array(X.asnumpy()[idx])
        y = nd.array(y.asnumpy()[idx])

    for i in range(n//self.batch_size):
        batch_x = X[i*self.batch_size:(i+1)*self.batch_size].astype('float32')
        if self.resize:
            new_data = nd.zeros((self.batch_size, self.resize, self.resize, batch_x.shape[3]))
            for j in range(self.batch_size):
                new_data[j] = image.imresize(batch_x[j], self.resize, self.resize)
            batch_x = new_data
        yield (nd.transpose(batch_x, (0,3,1,2))/255, 
               y.astype('float32')[i*self.batch_size:(i+1)*self.batch_size])

def __len__(self):
    return len(self.dataset)//self.batch_size

def load_data_fashion_mnist(batch_size, resize=None, root="~/.mxnet/datasets/fashion-mnist"):
mnist_train = gluon.data.vision.FashionMNIST(root=root, train=True, transform=None)
mnist_test = gluon.data.vision.FashionMNIST(root=root, train=False, transform=None)
train_data = DataLoader(mnist_train, batch_size, shuffle=True, resize = resize)
test_data = DataLoader(mnist_test, batch_size, shuffle=False, resize = resize)
return (train_data, test_data)


16G内存炸裂??
使用重复元素的网络(VGG) 讨论区
#46

这一章按教程跑,电脑直接卡死了,按我前两位帖主的方法将resize放到DataLoader里面就ok了,原因前两位帖主说的很好了 :+1::+1::+1:
我这里有两个问题:1. fashion-mnist图片原始尺寸是28x28的,imresize是如何转换为224x224的?没查到api实现,自己尝试plt.show()出resize的图片没成功。2.我将resize=None,修改了下pool_size跑成功了但accuracy=0.1xx,是不是AlexNet这种模型对224x224输入输入是最佳的?96x96的输入训练结果貌似也没224x224的好?


#47


我是来膜拜44楼大佬的。。。直接复制来用之后,会出现对齐错误。。。稍微再改改就好了。。。
原来resize到224肯定挂的,16G内存完全不够。。。现在最多只吃了3G的内存。。。差距太大了。。。

而且kenerl挂了,都还能继续跑。。。为啥呢?:thinking:


#48

我这样设置好之后,batchsize到64直接爆显存了。。。
batchsize=32,kernel挂了还能跑。。。为啥呢?:thinking:


#49

还有,def load_data_fashion_mnist之后还有10行代码才到mnist_train=…
这中间的10行是不改呢,还是删了?注释掉?


#50

删除了,然后把相应的功能放到class DataLoader里了。

提醒一下:原来的transform是作为gluon.data.vision.FashionMNIST的参数的。而我将transform的操作放到class DataLoader内部,而在外部只是多加了一个resize的参数。
其实我这样写少了很多功能,万一tranform的操作需要更改的话,就要去改class DataLoader的定义了。 所以如果想实现跟gluon.data.vision.FashionMNIST的参数transform一样多的功能的话,最好把整个transform函数作为class DataLoader的一个参数,然后可以在 yield里调用这个transform。

如下修改:

class DataLoader(object):
    """similiar to gluon.data.DataLoader, but might be faster.

    The main difference this data loader tries to read more exmaples each
    time. But the limits are 1) all examples in dataset have the same shape, 2)
    data transfomer needs to process multiple examples at each time
    """
    def __init__(self, dataset, batch_size, shuffle, transform):
        self.dataset = dataset
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.transform = transform

    def __iter__(self):
        data = self.dataset[:]
        X = data[0]
        y = nd.array(data[1])
        n = X.shape[0]
        if self.shuffle:
            idx = np.arange(n)
            np.random.shuffle(idx)
            X = nd.array(X.asnumpy()[idx])
            y = nd.array(y.asnumpy()[idx])

        for i in range(n//self.batch_size):
            if self.transform is not None:
                yield self.transform(X[i*self.batch_size:(i+1)*self.batch_size], 
                                     y[i*self.batch_size:(i+1)*self.batch_size])
            else:
                yield (X[i*self.batch_size:(i+1)*self.batch_size],
                       y[i*self.batch_size:(i+1)*self.batch_size])

    def __len__(self):
        return len(self.dataset)//self.batch_size

def load_data_fashion_mnist(batch_size, resize=None, root="~/.mxnet/datasets/fashion-mnist"):
    """download the fashion mnist dataest and then load into memory"""
    def transform_mnist(data, label):
        # transform a batch of examples
        if resize:
            n = data.shape[0]
            new_data = nd.zeros((n, resize, resize, data.shape[3]))
            for i in range(n):
                new_data[i] = image.imresize(data[i], resize, resize)
            data = new_data
        # change data from batch x height x weight x channel to batch x channel x height x weight
        return nd.transpose(data.astype('float32'), (0,3,1,2))/255, label.astype('float32')
    
    mnist_train = gluon.data.vision.FashionMNIST(root=root, train=True, transform=None)
    mnist_test = gluon.data.vision.FashionMNIST(root=root, train=False, transform=None)
    train_data = DataLoader(mnist_train, batch_size, shuffle=True, transform = transform_mnist)
    test_data = DataLoader(mnist_test, batch_size, shuffle=False, transform = transform_mnist)
    return (train_data, test_data)

#51

刚刚试过了下,我也遇到了这个问题。而且更加诡异的是resize成96的时候,ndarray 除以 255没问题。而当resize成224的时候,却遇到了除以255等于0的情况,希望 @mli@szha 大大们能查看一下。我的版本是mxnet1.0 cuda-8.0, 跑在ubuntu 16.04, GTX1070上。


#52

为啥会除0?


#53

不是除以0,而是resize 成224之后的ndarray 除以 255 会等于0。可能是我之前没说清楚。因为我也遇到了很多小伙伴在train AlexNet的时候,train不不动的情况,loss一直在2.3左右,accuracy 在0.1左右。不管怎么调learning rate都没用(0.01,0.001)。后来发现问题出在在utils.py里面的 load_data_fashion_mnist函数里。这个函数返回的矩阵的元素都是0。比如我把其中的一行print出来。其中变量cc的print结果是0,而dd,用numpy的array则是正常结果。

简单来说出现的情况是,调用 load_data_fashion_mnist(batch_size=64, resize=96)没bug,
load_data_fashion_mnist(batch_size=64 resize=224) 有bug. bug出现的原因是,resize之后的NDArray / 255会都变成0. 可是如果用numpy的array来除以255,再转成NDArray则okay.

debug代码如下,

`def load_data_fashion_mnist(batch_size, resize=None, root="~/.mxnet/datasets/fashion-mnist"):
“”“download the fashion mnist dataest and then load into memory”""

def transform_mnist(data, label):
    # transform a batch of examples
   
   
    if resize:
        n = data.shape[0]
        new_data = nd.zeros((n, resize, resize, data.shape[3]))
        for i in range(n):
            new_data[i] = image.imresize(data[i], resize, resize)
        data = new_data
     
    
        print "data"
        print(np.unique(data[0,:,:,:].asnumpy()))
             
        bb = nd.transpose(data.astype('float32'), (0,3,1,2))
        print "bb"
        print(np.unique(bb[0,:,:,:].asnumpy())) # correct result
        
        cc = bb.astype('float32') / 255
        print "cc"
        print(np.unique(cc[0,:,:,:].asnumpy()))  # result: [0.0]
        
        dd = bb.asnumpy()
        dd = dd / 255
        print "dd"
        print(np.unique(dd[0,:,:,:])) # correct result
        sys.exit(1)
    change data from batch x height x weight x channel to batch x channel x height x weight
    return nd.transpose(data.astype('float32'), (0,3,1,2))/255.0, label.astype('float32')
mnist_train = gluon.data.vision.FashionMNIST(root=root, train=True, transform=transform_mnist)
mnist_test = gluon.data.vision.FashionMNIST(root=root, train=False, transform=transform_mnist)
train_data = DataLoader(mnist_train, batch_size, shuffle=True)
test_data = DataLoader(mnist_test, batch_size, shuffle=False)
return (train_data, test_data)

`

image


#55

多谢详细回复。 @astonzhang 你能看一眼吗?


#56

多谢大神回复,前面几楼的帖子提到过,可能的一个原因是直接对整个数据集resize成224的时候占用内存过多,内存爆了导致的bug。如果把resize这一操作放到DataLoader里,则不会出现这一情况。


#57

是CPU内存不够?那个数据集不大吧。


#58

有可能是我都没关心过CPU内存。AWS上内存基本可以认为无限大


#59

按照44楼,48楼 的回复,可能是的。因为我刚按照 @xiaoming 的代码(如下)跑了一下就没问题。就是把transformer 这一步放到DataLoader里面做就行。

class DataLoader(object):
"""similiar to gluon.data.DataLoader, but might be faster.

The main difference this data loader tries to read more exmaples each
time. But the limits are 1) all examples in dataset have the same shape, 2)
data transfomer needs to process multiple examples at each time
"""
def __init__(self, dataset, batch_size, shuffle, transform):
    self.dataset = dataset
    self.batch_size = batch_size
    self.shuffle = shuffle
    self.transform = transform

def __iter__(self):
    data = self.dataset[:]
    X = data[0]
    y = nd.array(data[1])
    n = X.shape[0]
    if self.shuffle:
        idx = np.arange(n)
        np.random.shuffle(idx)
        X = nd.array(X.asnumpy()[idx])
        y = nd.array(y.asnumpy()[idx])

    for i in range(n//self.batch_size):
        if self.transform is not None:
            yield self.transform(X[i*self.batch_size:(i+1)*self.batch_size], 
                                 y[i*self.batch_size:(i+1)*self.batch_size])
        else:
            yield (X[i*self.batch_size:(i+1)*self.batch_size],
                   y[i*self.batch_size:(i+1)*self.batch_size])

def __len__(self):
    return len(self.dataset)//self.batch_size

def load_data_fashion_mnist(batch_size, resize=None, root="~/.mxnet/datasets/fashion-mnist"):
"""download the fashion mnist dataest and then load into memory"""
def transform_mnist(data, label):
    # transform a batch of examples
    if resize:
        n = data.shape[0]
        new_data = nd.zeros((n, resize, resize, data.shape[3]))
        for i in range(n):
            new_data[i] = image.imresize(data[i], resize, resize)
        data = new_data
    # change data from batch x height x weight x channel to batch x channel x height x weight
    return nd.transpose(data.astype('float32'), (0,3,1,2))/255, label.astype('float32')

mnist_train = gluon.data.vision.FashionMNIST(root=root, train=True, transform=None)
mnist_test = gluon.data.vision.FashionMNIST(root=root, train=False, transform=None)
train_data = DataLoader(mnist_train, batch_size, shuffle=True, transform = transform_mnist)
test_data = DataLoader(mnist_test, batch_size, shuffle=False, transform = transform_mnist)
return (train_data, test_data)

这个代码和所提供的代码有问题的代码的区别就是原代码的transform这一步是在

mnist_train = gluon.data.vision.FashionMNIST(root=root, train=True, transform=transform)
train_data = DataLoader(mnist_train, batch_size, shuffle=True)

这里,这里返回的mnist_train会是全0的一个数组。但是如果transform在DataLoader做就没问题。

mnist_train = gluon.data.vision.FashionMNIST(root=root, train=True, transform=None)
train_data = DataLoader(mnist_train, batch_size, shuffle=True, transform = transform_mnist)

我的机器配置是32GB内存。如果大神觉得在DataLoader里做transform okay的,我们可以交个pull request。毕竟如果不用aws,自己配机器或者用实验室机器的小伙伴内存应该大多在32G下吧。


#60

事实上默认是应该在transform里面做resize。tutorial里面是提前做了,主要好处是快一点,因为transform本身比较贵,而且不是多线程,导致教程里这种简单实验会慢很多。但确实忘了考虑CPU内存笑话。

gluon的新image io应该会解决这个性能问题。


#61

谢谢,我测试了你提交的PR。不过best practice应该是在一开始就transform,改到loader时再transform后有可能会给新人造成困惑。

一种可能就是把教程里resize尺寸224改小,虽然不是AlexNet的输入尺寸,但反正数据集也不一样应该无所谓吧

@mli 沐神觉得呢?需要merge不:


#62

关于输入的尺寸疑问

AlexNet论文中虽然写的是224*224的尺寸输入, 但是224 - 11 / 4 不能整数, 227-11/4 + 1 = 55(传递给下一层的尺寸), 所以227才是正确输入。

想请教一下, mxnet对不能整除的尺寸是如何处理的?