Keras：将MDN层添加到LSTM网络

Question

我的问题简要说明：根据舞蹈序列训练数据，下面详述的长短期记忆网络是否经过适当设计以产生新的舞蹈序列？

背景：我正在与一位希望使用神经网络产生新舞蹈序列的舞者合作。她向我发送了2016 chor-rnn paper，它使用最后的混合密度网络层的LSTM网络完成了这项任务。然而，在我的LSTM网络中添加MDN层之后，我的损失变为负值，结果看起来很混乱。这可能是由于非常小的训练数据，但我想在扩大训练数据大小之前验证模型基础。如果有人可以建议下面的模型是否忽略了一些基本的东西（很有可能），我会非常感谢他们的反馈。

我输入网络的样本数据（下面的X）具有形状（626,55,3），其对应于55个身体位置的626个时间快照，每个具有3个坐标（x，y，然后是z）。所以X1 [11] [2]是时间1的第11个身体部位的z位置：

import requests
import numpy as np

# download the data
requests.get('https://s3.amazonaws.com/duhaime/blog/dancing-with-robots/dance.npy')

# X.shape = time_intervals, n_body_parts, 3
X = np.load('dance.npy')

为了确保正确提取数据，我想象了X的前几帧：

import mpl_toolkits.mplot3d.axes3d as p3
import matplotlib.pyplot as plt
from IPython.display import html
from matplotlib import animation
import matplotlib

matplotlib.rcParams['animation.embed_limit'] = 2**128

def update_points(time, points, X):
  arr = np.array([[ X[time][i][0], X[time][i][1] ] for i in range(int(X.shape[1]))])
  points.set_offsets(arr) # set x, y values
  points.set_3d_properties(X[time][:,2][:], zdir='z') # set z value

def get_plot(X, lim=2, frames=200, duration=45):
  fig = plt.figure()
  ax = p3.Axes3D(fig)
  ax.set_xlim(-lim, lim)
  ax.set_ylim(-lim, lim)
  ax.set_zlim(-lim, lim)
  points = ax.scatter(X[0][:,0][:], X[0][:,1][:], X[0][:,2][:], depthshade=False) # x,y,z vals
  return animation.FuncAnimation(fig,
    update_points,
    frames,
    interval=duration,
    fargs=(points, X),
    blit=False  
  ).to_jshtml()

HTML(get_plot(X, frames=int(X.shape[0])))

这会产生一个像这样的小舞蹈序列：

到现在为止还挺好。接下来，我将x，y和z维度的特征居中：

X -= np.amin(X, axis=(0, 1))
X /= np.amax(X, axis=(0, 1))

使用X可视化生成的HTML(get_plot(X, frames=int(X.shape[0])))显示这些线条使数据居中。接下来，我使用Keras中的Sequential API构建模型本身：

from keras.models import Sequential, Model
from keras.layers import Dense, LSTM, Dropout, Activation
from keras.layers.advanced_activations import LeakyReLU
from keras.losses import mean_squared_error
from keras.optimizers import Adam
import keras, os

# config
look_back = 32 # number of previous time frames to use to predict the positions at time i
lstm_cells = 256 # number of cells in each LSTM "layer"
n_features = int(X.shape[1]) * int(X.shape[2]) # number of coordinate values to be predicted by each of `m` models
input_shape = (look_back, n_features) # shape of inputs
m = 32 # number of gaussian models to build

# set boolean controlling whether we use MDN or not
use_mdn = True

model = Sequential()
model.add(LSTM(lstm_cells, return_sequences=True, input_shape=input_shape))
model.add(LSTM(lstm_cells, return_sequences=True))
model.add(LSTM(lstm_cells))

if use_mdn:
  model.add(MDN(n_features, m))
  model.compile(loss=get_mixture_loss_func(n_features, m), optimizer=Adam(lr=0.000001))
else:
  model.add(Dense(n_features, activation='tanh'))
  model.compile(loss=mean_squared_error, optimizer='sgd')

model.summary()

建立模型后，我将数据安排在X中以备培训。在这里，我们想要通过检查先前look_back时间片上每个身体部位的位置来预测55个身体部位的x，y，z位置：

# get training data in right shape
train_x = []
train_y = []

n_time, n_obs, n_attrs = [int(i) for i in X.shape]

for i in range(look_back, n_time-1, 1):
  train_x.append( X[i-look_back:i].reshape(look_back, n_obs * n_attrs) )
  train_y.append( X[i+1].ravel() )

train_x = np.array(train_x)
train_y = np.array(train_y)

最后我训练模型：

from livelossplot import PlotLossesKeras

# fit the model
model.fit(train_x, train_y, epochs=1024, batch_size=1, callbacks=[PlotLossesKeras()])

训练后，我可视化模型创建的新时间片：

# generate `n_frames` of new output time slices
n_frames = 3000

# seed the data to plot with the first `look_back` animation frames
data = X[0:look_back]

x0, x1, x2 = [int(i) for i in train_x.shape]
d0, d1, d2 = [int(i) for i in data.shape]

for i in range(look_back, n_frames, 1):
  # get the model's prediction for the next position of points at time `i`
  result = model.predict(train_x[i].reshape(1, x1, x2))
  # if using the mixed density network, pull out vals that describe vertex positions
  if use_mdn:
    result = np.apply_along_axis(sample_from_output, 1, result, n_features, m, temp=1.0)
  # reshape the result into the form of rows in `X`
  result = result.reshape(1, d1, d2)
  # push the result into the shape of `train_x` observations
  stacked = np.vstack((data[i-look_back+1:i], result)).reshape(1, x1, x2)
  # add the result to the `train_x` observations
  train_x = np.vstack((train_x, stacked))
  # add the result to the dataset for plotting
  data = np.vstack((data[:i], result))

如果我将use_mdn设置为上面的False而是使用简单的平方误差损失（L2 Loss），那么得到的可视化似乎有点令人毛骨悚然，但仍然具有普遍的人类形状。

但是，如果我将use_mdn设置为True，并使用自定义MDN损失函数，则结果非常奇怪。我认识到MDN层添加了大量需要训练的参数，并且可能需要数量级更多的训练数据来实现与L2损失函数输出一样的人形输出。

也就是说，我想问一下，与神经网络模型合作的其他人是否比我自己更广泛地看到了上述方法的任何根本错误。对这个问题的任何见解都会非常有帮助。