机器翻译部分的代码如何修改batch_size?

我自己改的总是运行不对,下边是我根据课程 神经机器翻译 内的代码自己修改的,大家可不可以看看哪里有问题?
1.最前边设置

batch_size = 2

2.Encoder 部分没有修改的
3.Decoder 部分修改为:

class Decoder(Block):

def __init__(self, hidden_dim, output_dim, num_layers, max_seq_len,
             drop_prob, alignment_dim, encoder_hidden_dim):
    super(Decoder, self).__init__()
    self.max_seq_len = max_seq_len
    self.encoder_hidden_dim = encoder_hidden_dim
    self.hidden_size = hidden_dim
    self.num_layers = num_layers
    with self.name_scope():
        self.embedding = nn.Embedding(output_dim, hidden_dim)
        self.dropout = nn.Dropout(drop_prob)
        # 注意力机制。
        self.attention = nn.Sequential()
        with self.attention.name_scope():
            self.attention.add(nn.Dense(
                alignment_dim, in_units=hidden_dim + encoder_hidden_dim,
                activation="tanh", flatten=False))
            self.attention.add(nn.Dense(1, in_units=alignment_dim,
                                        flatten=False))

        self.rnn = rnn.GRU(hidden_dim, num_layers, dropout=drop_prob,
                           input_size=hidden_dim)
        self.out = nn.Dense(output_dim, in_units=hidden_dim)
        self.rnn_concat_input = nn.Dense(
            hidden_dim, in_units=hidden_dim + encoder_hidden_dim,
            flatten=False)

def forward(self, cur_input, state, encoder_outputs):
    # 当RNN为多层时,取最靠近输出层的单层隐含状态。
    single_layer_state = [state[0][-1].expand_dims(0)]

    encoder_outputs = encoder_outputs.reshape((self.max_seq_len, batch_size,
                                               self.encoder_hidden_dim))

    hidden_broadcast = nd.broadcast_axis(single_layer_state[0], axis=0,
                                         size=self.max_seq_len)

    encoder_outputs_and_hiddens = nd.concat(encoder_outputs,
                                            hidden_broadcast, dim=2)

    energy = self.attention(encoder_outputs_and_hiddens)

    batch_attention = nd.softmax(energy, axis=0).reshape(
        (batch_size, 1, self.max_seq_len))

    batch_encoder_outputs = encoder_outputs.swapaxes(0, 1)
    decoder_context = nd.batch_dot(batch_attention, batch_encoder_outputs)

    input_and_context = nd.concat(self.embedding(cur_input).reshape(
        (batch_size, 1, self.hidden_size)), decoder_context, dim=2)
    concat_input = self.rnn_concat_input(input_and_context)
    concat_input = self.dropout(concat_input)

    # 当RNN为多层时,用单层隐含状态初始化各个层的隐含状态。
    state = [nd.broadcast_axis(single_layer_state[0], axis=0,
                               size=self.num_layers)]

    output, state = self.rnn(concat_input, state)
    output = self.dropout(output)
    output = self.out(output)

    return output, state

def begin_state(self, *args, **kwargs):
    return self.rnn.begin_state(*args, **kwargs)

主要就是把原来一些大小=1的部分改为了2(batch_size的大小)

然后就是train部分,这部分改动就是把DataLoader里的batch_size设为2,encoder.begin_state里的batchsize设为2,以及每个step设为2,最后就是

 decoder_input = nd.array([output_vocab.token_to_idx[BOS],output_vocab.token_to_idx[BOS]],
                                         ctx=ctx)

因为我觉得如果有2个那么应该输入的BOS也应该是batch_size个吧(?)

不过这个改法不能得到正确运行,报错是在Decoder部分的output, state = self.rnn(concat_input, state)里,因为broadcast过的state
state = [nd.broadcast_axis(single_layer_state[0], axix=0, size=self.num_layers)] 维度被扩展成了(batch_size*batch_size*num_layers) 的大小,而期待大小是(batch_size*1*num_layers) .

感觉这部分反复维度转换,弄得头晕,我实在是看不出来哪里有错,先谢谢大家了^_^

我觉得也是,不过改代码可能还需要一些work,为了解释清晰教程里是batch_size 1。

decoder部分是需要改的地方,好在Gluon写起来比较Pythonic,可以每一步都打印一下变量形状(主要需要分清num_steps, batch_size, hidden_size对应的维度)。

还有一个建议是从简单到复杂,例如先不使用attention,等batch_size = 2 work了再加attention。

等你写好了不妨分享给我们吧 :grinning:

在老师的鼓励下终于改好了!
我就不多解释了,每一次维度变换的原因我都在代码的注释中详细解释了
代码如下:

import mxnet as mx
from mxnet import autograd, gluon, nd
from mxnet.gluon import nn, rnn, Block
from mxnet.contrib import text

from io import open
import collections
import datetime

batch_size = 2
PAD = '<pad>'
BOS = '<bos>'
EOS = '<eos>'
epochs = 100
epoch_period = 10

learning_rate = 0.005
# 输入或输出序列的最大长度(含句末添加的EOS字符)。

encoder_num_layers = 1
decoder_num_layers = 2

encoder_drop_prob = 0.1
decoder_drop_prob = 0.1

encoder_hidden_dim = 256
decoder_hidden_dim = 256
alignment_dim = 25
max_seq_len = 7  # 测试,7是最大长度

ctx = mx.cpu()


def read_data(max_seq_len):
    input_tokens = []
    output_tokens = []
    input_seqs = []
    output_seqs = []

    with open('../data/fr-en-small.txt') as f:
        lines = f.readlines()
        for line in lines:
            input_seq, output_seq = line.rstrip().split('\t')
            cur_input_tokens = input_seq.split(' ')
            cur_output_tokens = output_seq.split(' ')

            if len(cur_input_tokens) < max_seq_len and \
                            len(cur_output_tokens) < max_seq_len:
                input_tokens.extend(cur_input_tokens)
                # 句末附上EOS符号。
                cur_input_tokens.append(EOS)
                # 添加PAD符号使每个序列等长(长度为max_seq_len)。
                while len(cur_input_tokens) < max_seq_len:
                    cur_input_tokens.append(PAD)
                input_seqs.append(cur_input_tokens)
                output_tokens.extend(cur_output_tokens)
                cur_output_tokens.append(EOS)
                while len(cur_output_tokens) < max_seq_len:
                    cur_output_tokens.append(PAD)
                output_seqs.append(cur_output_tokens)

        fr_vocab = text.vocab.Vocabulary(collections.Counter(input_tokens),
                                         reserved_tokens=[PAD, BOS, EOS])
        en_vocab = text.vocab.Vocabulary(collections.Counter(output_tokens),
                                         reserved_tokens=[PAD, BOS, EOS])
    return fr_vocab, en_vocab, input_seqs, output_seqs


input_vocab, output_vocab, input_seqs, output_seqs = read_data(max_seq_len)
print(len(input_seqs))
# X 和 Y 对应两种语言
X = nd.zeros((len(input_seqs), max_seq_len), ctx=ctx)
Y = nd.zeros((len(output_seqs), max_seq_len), ctx=ctx)
for i in range(len(input_seqs)):
    X[i] = nd.array(input_vocab.to_indices(input_seqs[i]), ctx=ctx)
    Y[i] = nd.array(output_vocab.to_indices(output_seqs[i]), ctx=ctx)

dataset = gluon.data.ArrayDataset(X, Y)


class Encoder(Block):
    """编码器"""

    def __init__(self, input_dim, hidden_dim, num_layers, drop_prob):
        super(Encoder, self).__init__()
        with self.name_scope():
            self.embedding = nn.Embedding(input_dim, hidden_dim)
            self.dropout = nn.Dropout(drop_prob)
            self.rnn = rnn.GRU(hidden_dim, num_layers, dropout=drop_prob,
                               input_size=hidden_dim)

    def forward(self, inputs, state):
        # inputs尺寸: (batch_size, max_len),emb尺寸: (max_len, batch_size, 256)
        emb = self.embedding(inputs).swapaxes(0, 1)

        emb = self.dropout(emb)
        output, state = self.rnn(emb, state)
        return output, state

    def begin_state(self, *args, **kwargs):
        return self.rnn.begin_state(*args, **kwargs)


class Decoder(Block):
    """含注意力机制的解码器"""

    def __init__(self, hidden_dim, output_dim, num_layers, max_seq_len,
                 drop_prob, alignment_dim, encoder_hidden_dim, batch_size):
        super(Decoder, self).__init__()
        self.max_seq_len = max_seq_len
        self.encoder_hidden_dim = encoder_hidden_dim
        self.hidden_size = hidden_dim
        self.num_layers = num_layers
        self.batch_size = batch_size

        with self.name_scope():
            self.embedding = nn.Embedding(output_dim, hidden_dim)
            self.dropout = nn.Dropout(drop_prob)
            # 注意力机制。
            self.attention = nn.Sequential()
            with self.attention.name_scope():
                self.attention.add(nn.Dense(
                    alignment_dim, in_units=hidden_dim + encoder_hidden_dim,
                    activation="tanh", flatten=False))
                self.attention.add(nn.Dense(1, in_units=alignment_dim,
                                            flatten=False))

            self.rnn = rnn.GRU(hidden_dim, num_layers, dropout=drop_prob,
                               input_size=hidden_dim)
            self.out = nn.Dense(output_dim, in_units=hidden_dim)
            self.rnn_concat_input = nn.Dense(
                hidden_dim, in_units=hidden_dim + encoder_hidden_dim,
                flatten=False)

    def forward(self, cur_input, state, encoder_outputs, is_train):

        # 当RNN为多层时,取最靠近输出层的单层隐含状态。
        single_layer_state = [state[0][-1].expand_dims(0)]

        # 用 is_train 来控制是训练还是测试,训练时每次读取batch_size个数据,测试时产生一个
        if is_train:
            encoder_outputs = encoder_outputs.reshape((self.max_seq_len, self.batch_size,
                                                       self.encoder_hidden_dim))
        else:
            encoder_outputs = encoder_outputs.reshape((self.max_seq_len, 1,
                                                       self.encoder_hidden_dim))

        # single_layer_state尺寸: [(1, batch_size, decoder_hidden_dim)], 1是num_layer大小
        # hidden_broadcast尺寸: (max_seq_len, batch_size, decoder_hidden_dim)
        hidden_broadcast = nd.broadcast_axis(single_layer_state[0], axis=0,
                                             size=self.max_seq_len)
        # print('hidden_broadcast',hidden_broadcast.shape)
        # encoder_outputs_and_hiddens尺寸:
        # (max_seq_len, batch_size, encoder_hidden_dim + decoder_hidden_dim)
        encoder_outputs_and_hiddens = nd.concat(encoder_outputs,
                                                hidden_broadcast, dim=2)

        # energy尺寸: (max_seq_len, batch_size, 1),1是attention输出的一个标量e_t't
        energy = self.attention(encoder_outputs_and_hiddens)

        if is_train:
            batch_attention = nd.softmax(energy, axis=0).reshape(
                (self.batch_size, 1, self.max_seq_len))
        else:
            batch_attention = nd.softmax(energy, axis=0).reshape(
                (1, 1, self.max_seq_len))

        # 改变shape,让batch在dim==0上
        # batch_encoder_outputs尺寸: (batch_size, max_seq_len, encoder_hidden_dim)
        batch_encoder_outputs = encoder_outputs.swapaxes(0, 1)

        # decoder_context尺寸: (batch_size, 1, encoder_hidden_dim)
        decoder_context = nd.batch_dot(batch_attention, batch_encoder_outputs)

        # input_and_context尺寸: (batch_size, 1, encoder_hidden_dim + decoder_hidden_dim)
        # embedding size: (batch_size*1*256)
        if is_train:
            input_and_context = nd.concat(self.embedding(cur_input).reshape(
                (self.batch_size, 1, self.hidden_size)), decoder_context, dim=2)
        else:
            input_and_context = nd.concat(self.embedding(cur_input).reshape(
                (1, 1, self.hidden_size)), decoder_context, dim=2)

        # concat_input尺寸: (batch_size, 1, decoder_hidden_dim)
        concat_input = self.rnn_concat_input(input_and_context)
        concat_input = self.dropout(concat_input)

        # 当RNN为多层时,用单层隐含状态初始化各个层的隐含状态。
        state = [nd.broadcast_axis(single_layer_state[0], axis=0,
                                   size=self.num_layers)]

        # 这里做一个reshape,把 (batch_size, 1, decoder_hidden_dim) -> (1, batch_size, decoder_hidden_dim)
        # 因为 rnn.GRU 要求传入 shape=(sequence_length, batch_size, input_size),(when layout is “TNC”)
        concat_input_exchange = concat_input.swapaxes(0, 1)
        output, state = self.rnn(concat_input_exchange, state)
        output = self.dropout(output)

        # 这里做一个reshape,因为self.out使用了Dense,所以需要把 batch_size 放入 dim0
        output = output.swapaxes(0, 1)
        output = self.out(output)
        # output尺寸: (batch_size, output_size),hidden尺寸: [(1, 1, decoder_hidden_dim)]

        return output, state

    def begin_state(self, *args, **kwargs):
        return self.rnn.begin_state(*args, **kwargs)


class DecoderInitState(Block):
    # 解码器隐含状态的初始化

    def __init__(self, encoder_hidden_dim, decoder_hidden_dim):
        super(DecoderInitState, self).__init__()
        with self.name_scope():
            self.dense = nn.Dense(decoder_hidden_dim,
                                  in_units=encoder_hidden_dim,
                                  activation="tanh", flatten=False)

    def forward(self, encoder_state):
        return [self.dense(encoder_state)]


def translate(encoder, decoder, decoder_init_state, fr_ens, ctx, max_seq_len):
    for fr_en in fr_ens:
        print('Input :', fr_en[0])
        input_tokens = fr_en[0].split(' ') + [EOS]
        # 添加PAD符号使每个序列等长(长度为max_seq_len)。
        while len(input_tokens) < max_seq_len:
            input_tokens.append(PAD)
        inputs = nd.array(input_vocab.to_indices(input_tokens), ctx=ctx)
        encoder_state = encoder.begin_state(func=mx.nd.zeros, batch_size=1,
                                            ctx=ctx)
        encoder_outputs, encoder_state = encoder(inputs.expand_dims(0),
                                                 encoder_state)
        encoder_outputs = encoder_outputs.flatten()
        # 解码器的第一个输入为BOS字符。
        decoder_input = nd.array([output_vocab.token_to_idx[BOS]], ctx=ctx)
        decoder_state = decoder_init_state(encoder_state[0])
        output_tokens = []

        for i in range(max_seq_len):
            decoder_output, decoder_state = decoder(
                decoder_input, decoder_state, encoder_outputs, False)
            pred_i = int(decoder_output.argmax(axis=1).asnumpy())
            # 当任一时刻的输出为EOS字符时,输出序列即完成。
            if pred_i == output_vocab.token_to_idx[EOS]:
                break
            else:
                output_tokens.append(output_vocab.idx_to_token[pred_i])
            decoder_input = nd.array([pred_i], ctx=ctx)

        print('Output:', ' '.join(output_tokens))
        print('Expect:', fr_en[1], '\n')


def train(encoder, decoder, decoder_init_state, max_seq_len, ctx, eval_fr_ens):
    # 对于三个网络,分别初始化它们的模型参数并定义它们的优化器。
    encoder.collect_params().initialize(mx.init.Xavier(), ctx=ctx)
    decoder.collect_params().initialize(mx.init.Xavier(), ctx=ctx)
    decoder_init_state.collect_params().initialize(mx.init.Xavier(), ctx=ctx)
    encoder_optimizer = gluon.Trainer(encoder.collect_params(), 'adam',
                                      {'learning_rate': learning_rate})
    decoder_optimizer = gluon.Trainer(decoder.collect_params(), 'adam',
                                      {'learning_rate': learning_rate})
    decoder_init_state_optimizer = gluon.Trainer(
        decoder_init_state.collect_params(), 'adam',
        {'learning_rate': learning_rate})

    softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()

    prev_time = datetime.datetime.now()
    data_iter = gluon.data.DataLoader(dataset, batch_size=2, shuffle=True)

    total_loss = 0.0
    for epoch in range(1, epochs + 1):
        for x, y in data_iter:
            with autograd.record():
                loss = nd.array([0], ctx=ctx)
                encoder_state = encoder.begin_state(
                    func=mx.nd.zeros, batch_size=batch_size, ctx=ctx)

                encoder_outputs, encoder_state = encoder(x, encoder_state)

                encoder_outputs = encoder_outputs.flatten()
                # 解码器的第一个输入为BOS字符。
                decoder_input = nd.array([[output_vocab.token_to_idx[BOS]] * batch_size], ctx=ctx)

                # 这里取[0]是为了假设在encoder部分有多层循环网络时取最内一层
                decoder_state = decoder_init_state(encoder_state[0])
                # decoder_state=1*2*256
                # print(decoder_state)
                # for batch in range(batch_size):
                for i in range(max_seq_len):

                    # print('aaa',[decoder_state[0][:,batch,:].reshape(shape=(1,1,256))])
                    decoder_output, decoder_state = decoder(
                        decoder_input, decoder_state, encoder_outputs, True)
                    # 解码器使用当前时刻的预测结果作为下一时刻的输入。
                    decoder_input = decoder_output.argmax(axis=1)
                    loss = loss + softmax_cross_entropy(decoder_output, y[:, i])
                    cnt = batch_size
                    for batch in range(batch_size):
                        if y[batch][i].asscalar() == output_vocab.token_to_idx[EOS]:
                            cnt -= 1
                            if cnt == 0:
                                break

            loss.backward()
            encoder_optimizer.step(batch_size=batch_size)
            decoder_optimizer.step(batch_size=batch_size)
            decoder_init_state_optimizer.step(batch_size=batch_size)
            total_loss += nd.sum(loss).asscalar() / max_seq_len

        if epoch % epoch_period == 0 or epoch == 1:
            cur_time = datetime.datetime.now()
            h, remainder = divmod((cur_time - prev_time).seconds, 3600)
            m, s = divmod(remainder, 60)
            time_str = 'Time %02d:%02d:%02d' % (h, m, s)
            if epoch == 1:
                print_loss_avg = total_loss / len(data_iter)
            else:
                print_loss_avg = total_loss / epoch_period / len(data_iter)
            loss_str = 'Epoch %d, Loss %f, ' % (epoch, print_loss_avg)
            print(loss_str + time_str)
            if epoch != 1:
                total_loss = 0.0
            prev_time = cur_time

            translate(encoder, decoder, decoder_init_state, eval_fr_ens, ctx, max_seq_len)


encoder = Encoder(len(input_vocab), encoder_hidden_dim, encoder_num_layers,
                  encoder_drop_prob)

decoder = Decoder(decoder_hidden_dim, len(output_vocab),
                  decoder_num_layers, max_seq_len, decoder_drop_prob,
                  alignment_dim, encoder_hidden_dim, batch_size=batch_size)
decoder_init_state = DecoderInitState(encoder_hidden_dim, decoder_hidden_dim)

eval_fr_ens = [['elle est japonaise .', 'she is japanese .'],
               ['ils regardent .', 'they are watching .']]
train(encoder, decoder, decoder_init_state, max_seq_len, ctx, eval_fr_ens)

我测试后是没问题的 :grinning:
不过在这里有个问题想请教一下张老师,在batch_size大于1的时候,想要执行“碰到EOS就停止生成”这个操作,多batch是如何执行的?我有几个想法,一个是在decoder的时候对batch中每一项都判断,可是这样就相当于decoder变为了batch_size=1,感觉没有效率,就放弃了这个想法;然后我就想以每个batch中最长的一个句子来判断,我的实现也是这么写的,大不了短句子预测几个PAD。可是,在真实情况下,一个batch中句子长短差异很大,这么做会不会影响(尤其是短句子)的训练结果?

1赞

最后手动 @astonzhang ,多谢老师的鼓励!也希望可以看看我这个问题 :smiley:

赞!

你的第二个想法挺好,短句子在eos后面放eos也行。更好的minibatch的方法是:假设batch_size=5,一次sample更多seqs,例如100个,把这100个从短到长sort一遍,1到5放在batch 1里,6到10放在batch 2里,等等。然后再一个个batch处理,这样每个batch里的序列样本长度相近。

1赞

题主你好,我也是遇见了你这样的问题,最后也是仿造nmt那个改了下,把batch_size调成了能大于1的,处理方式和你的也差不多,但是我现在就是感觉速度比较慢,请问近来你有没有继续改进你这个方法?

@eatingAbiscuit
@kenjewu
现在的教程里拓展了batch size>1

哇,有一份实现优美的代码了!老师什么时候打算开始第二季的征程? :grinning:@astongzhang

还在计划中

有什么好的idea欢迎填写反馈 :grinning:

1赞

刚刚写了反馈 :grinning:~,我觉得其实第一季的基础知识讲的不少了,第二季可以更多的看看在工业界和学术界的应用了。比如看看MXNET的代码本身;比如带着大家去实现几个新的paper,看看究竟是怎么做的。其实可以把实现model_zoo的过程直播出来,看看是如何一步一步仿真出一个paper,而且还能节省很多时间~

1赞

谢谢张老师

起初我也有这个想法。 其实可以先按长度排序,再来分批。但可能有点不好是每epoch的批是一样。可以排序后再首尾shift若干个,再分批。这样就达到目的