论文精读Benchmarking Deep Learning Interpretability in Time Series Predictions

Posted 2023-03-07 程序媛小哨

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了论文精读Benchmarking Deep Learning Interpretability in Time Series Predictions相关的知识，希望对你有一定的参考价值。

【论文精读】Benchmarking Deep Learning Interpretability in Time Series Predictions

Abstract

Saliency methods are used extensively to highlight the importance of input features in model predictions. These methods are mostly used in vision and language tasks, and their applications to time series data is relatively unexplored. In this paper, we set out to extensively compare the performance of various saliency-based interpretability methods across diverse neural architectures, including Recurrent Neural Network, Temporal Convolutional Networks, and Transformers in a new benchmark of synthetic time series data. We propose and report multiple metrics to empirically evaluate the performance of saliency methods for detecting feature importance over time using both precision (i.e., whether identified features contain meaningful signals) and recall (i.e., the number of features with signal identified as important). Through several experiments, we show that
(i) in general, network architectures and saliency methods fail to reliably and accurately identify feature importance over time in time series data,
(ii) this failure is mainly due to the conflation of time and feature domains, and
(iii) the quality of saliency maps can be improved substantially by using our proposed two-step temporal saliency rescaling (TSR) approach that first calculates the importance of each time step before calculating the importance of each feature at a time step.

显著性方法Saliency methods被广泛用于强调模型预测中输入特征的重要性。这些方法主要用于视觉和语言任务，它们在时间序列数据上的应用还相对未被探索。在本文中，我们着手在不同的神经架构上广泛比较各种基于显著性的可解释性方法的性能，包括循环神经网络Recurrent Neural Network、时态卷积网络Temporal Convolutional Networks和Transformers在合成时间序列数据的新基准中。
我们提出并报告了多个指标，以经验地评估显著性方法的性能，使用精度precision(即，识别的特征是否包含有意义的信号)和召回率recall(即，信号被识别为重要的特征的数量)来检测特征的重要性。通过多次实验，我们证明:
(i)总体而言，网络架构和显著性方法无法可靠准确地识别时间序列数据中特征重要性随时间的变化，
(ii)这种失败主要是由于时间和特征域的合并，
(iii)显著性图saliency maps的质量可以通过使用我们提出的两步时间显著性重新缩放(two-step temporal saliency
rescaling TSR)方法大幅提高，该方法首先计算每个时间步的重要性，然后计算每个时间步的每个特征的重要性。

1 Introduction

As the use of Machine Learning models increases in various domains [1, 2], the need for reliable model explanations is crucial [3, 4]. This need has resulted in the development of numerous interpretability methods that estimate feature importance [5–13]. As opposed to the task of understanding the prediction performance of a model, measuring and understanding the performance of interpretability methods is challenging [14–18] since there is no ground truth to use for such comparisons. For instance, while one could identify sets of informative features for a specific task a priori, models may not necessarily have to draw information from these features to make accurate predictions. In multivariate time series data, these challenges are even more profound since we cannot rely on human perception as one would when visualizing interpretations by overlaying saliency maps over images or when highlighting relevant words in a sentence.

随着机器学习模型在各个领域的使用的增加[1,2]，对可靠模型解释的需求是至关重要的[3,4]。这种需求导致了许多估计特征重要性的可解释性方法的发展[5-13]。与理解模型的预测性能相反，测量和理解可解释性方法的性能具有挑战性[14-18]，因为没有用于此类比较的标准答案ground truth。
例如，虽然人们可以先验地为特定任务识别一组信息特征，但模型可能不一定要从这些特征中提取信息来做出准确的预测。在多元时间序列数据中，这些挑战甚至更加深刻，因为我们不能像在图像上叠加显著性图或在句子中突出相关单词时那样依赖于人类的感知。

In this work, we compare the performance of different interpretability methods both perturbation-based and gradient-based methods, across diverse neural architectures including Recurrent Neural Network, Temporal Convolutional Networks, and Transformers when applied to the classification of multivariate time series. We quantify the performance of every (architectures, estimator) pair for time series data in a systematic way. We design and generate multiple synthetic datasets to capture different temporal-spatial aspects (e.g., Figure 1). Saliency methods must be able to distinguish important and non-important features at a given time, and capture changes in the importance of features over time. The positions of informative features in our synthetic datasets are known a priori (colored boxes in Figure 1); however, the model might not need all informative features to make a prediction. To identify features needed by the model, we progressively mask the features identified as important by each interpretability method and measure the accuracy degradation of the trained model.
We then calculate the precision and recall for (architectures, estimator) pairs at different masks by comparing them to the known set of informative features.

在这项工作中，我们比较了不同的可解释性方法，包括基于扰动的方法perturbation-based和基于梯度的方法gradient-based，在不同的神经架构中，包括循环神经网络Recurrent Neural Network、时态卷积网络Temporal Convolutional Networks和Transformers，当应用于多元时间序列multivariate time series分类时的性能。我们以系统的方式量化时间序列数据的每个(架构，估计器)对的性能。我们设计并生成多个合成数据集来捕捉不同的时间-空间方面(例如，图1)。

显著性方法 Saliency methods必须能够在给定时间区分重要和不重要的特征，并捕捉特征重要性随时间的变化。信息特征在我们的合成数据集中的位置是先验已知的(图1中的彩色方框);然而，该模型可能不需要所有的信息特征来进行预测。

为了识别模型所需的特征，我们逐步屏蔽由每种可解释性方法识别为重要的特征，并测量训练模型的精度退化。然后，通过将(架构，估计器)对与已知的信息特征集进行比较，计算不同掩码下(架构，估计器)对的精度和召回率。

Figure 1: Different evaluation datasets used for benchmarking saliency methods. Some datasets have multiple variations shown as sub-levels. N/S: normal and small shapes, T/F: temporal and feature positions, M: moving shape. All datasets are trained for binary classification, except MNIST.
Examples are shown above each dataset, where dark red/blue shapes represent informative features.
图1:用于基准显著性方法的不同评估数据集。一些数据集有多个变量，显示为子级别。N/S:正常形状和小形状，T/F:时间和特征位置，M:移动形状。除MNIST外，所有数据集都进行了二进制分类训练。每个数据集上面显示了示例，其中深红/蓝色形状表示信息特征。

Based on our extensive experiments, we report the following observations:
(i) feature importance estimators that produce high-quality saliency maps in images often fail to provide similar high-quality interpretation in time series data,
(ii) saliency methods tend to fail to distinguish important vs. nonimportant features in a given time step; if a feature in a given time is assigned to high saliency, then almost all other features in that time step tend to have high saliency regardless of their actual values,
(iii) model architectures have significant effects on the quality of saliency maps.

基于我们广泛的实验，我们报告了以下观察结果:
(i)在图像中产生高质量显著性图的特征重要性估计器通常无法在时间序列数据中提供类似的高质量解释，
(ii)显著性方法往往无法在给定的时间步长中区分重要与不重要的特征;如果给定时间内的一个特征被分配为高显著性，那么该时间步中的几乎所有其他特征都倾向于具有高显著性，而不管它们的实际值如何，
(iii)模型架构对显著性图的质量有显著影响。

After the aforementioned analysis and to improve the quality of saliency methods in time series data, we propose a two-step Temporal Saliency Rescaling (TSR) approach that can be used on top of any existing saliency method adapting it to time series data. Briefly, the approach works as follows:
(a) we first calculate the time-relevance score for each time by computing the total change in saliency values if that time step is masked; then
(b) in each time-step whose time-relevance score is above a certain threshold, we calculate the feature-relevance score for each feature by computing the total change in saliency values if that feature is masked. The final (time, feature) importance score is the product of associated time and feature relevance scores. This approach substantially improves the quality of saliency maps produced by various methods when applied to time series data. Figure 4 shows the initial performance of multiple methods, while Figure 5 shows their performance coupled with our proposed TSR method.

在上述分析之后，为了提高时间序列数据中显著性方法的质量，我们提出了一种两步时态显著性重新缩放(TSR) two-step Temporal Saliency Rescaling方法，可以在任何现有的显著性方法之上使用，使其适应时间序列数据。简单地说，该方法的工作原理如下:
(a)我们首先通过计算显着性值的总变化来计算每个时间的时间相关性得分，
如果该时间步长被掩盖;然后
(b)在每个时间步中，如果时间相关性得分高于某个阈值，我们通过计算显著性值的总变化来计算每个特征的特征相关性得分，如果该特征被掩盖。
最终的(时间，特征)重要性得分是相关时间和特征相关性得分的乘积。当应用于时间序列数据时，这种方法极大地提高了由各种方法生成的显著性图的质量。
图4显示了多种方法的初始性能，
而图5显示了它们与我们提出的TSR方法相结合的性能。

2 Background and Related Work

The interest in interpretability resulted in several diverse lines of research, all with a common goal of understanding how a network makes a prediction. [19–23] focus on making neural models more interpretable. [24, 9, 11, 6, 7, 25] estimate the importance of an input feature for a specified output.
Kim et al [26] provides an interpretation in terms of human concepts. One key question is whether or not interpretability methods are reliable. Kindermans et al [17] shows that the explanation can be manipulated by transformations that do not affect the decision-making process. Ghorbani et al
[15] introduces an adversarial attack that changes the interpretation without changing the prediction.
Adebayo et al [16] measures changes in the attribute when randomizing model parameters or labels.
Similar to our line of work, modification-based evaluation methods [27–29] involves: applying saliency method, ranking features according to the saliency values, recursively eliminating higher ranked features and measure degradation to the trained model accuracy. Hooker et al [14] proposes retraining the model after feature elimination.
Recent work [23, 30, 31] have identified some limitations in time series interpretability. We provide the first benchmark that systematically evaluates different saliency methods across multiple neural architectures in a multivariate time series setting, identifies common limitations, and proposes a solution to adapt existing methods to time series.

对可解释性的兴趣导致了几种不同的研究方向，所有这些研究都有一个共同的目标，即理解网络如何进行预测。

[19-23]致力于使神经模型更具可解释性。[24,9,11,6,7,25]估计输入特征对指定输出的重要性。
Kim et al[26]从人类概念的角度进行了解释。一个关键问题是可解释性方法是否可靠。
Kindermans等[17]表明，解释可以被不影响决策过程的转换所操纵。
Ghorbani等人[15]引入了一种对抗性攻击，在不改变预测的情况下改变解释。
Adebayo et al[16]在随机化模型参数或标签时测量属性的变化。

与我们的工作类似，基于修改的评估方法[27-29]包括:应用显著性方法，根据显著性值对特征进行排名，递归地消除排名较高的特征，并测量训练模型精度的退化。
Hooker等[14]提出在特征消除后对模型进行再训练。
最近的研究[23,30,31]指出了时间序列可解释性的一些局限性。
我们提供了第一个基准，系统地评估了多元时间序列设置中跨多个神经架构的不同显著性方法，确定了常见的局限性，并提出了一种解决方案，使现有方法适应时间序列。

2.1 Saliency Methods

We compare popular backpropagation-based and perturbation based post-hoc saliency methods; each method provides feature importance, or relevance, at a given time step to each input feature. All methods are compared with random assignment as a baseline control.
In this benchmark, the following saliency methods† are included:

Gradient-based: Gradient (GRAD) [5] the gradient of the output with respect to the input. Integrated Gradients (IG) [9] the average gradient while input changes from a non-informative reference point. SmoothGrad (SG) [10] the gradient is computed n times, adding noise to the input each time. DeepLIFT (DL) [11] defines a reference point, relevance is the difference between the activation of each neuron to its reference activation. Gradient SHAP (GS) [12] adds noise to each input, selects a point along the path between a reference point and input, and computes the gradient of outputs with respect to those points. Deep SHAP (DeepLIFT + Shapley values) (DLS) [12] takes a distribution of baselines computes the attribution for each input-baseline pair and averages the resulting attributions per input.

Perturbation-based: F eature Occlusion (FO) [24] computes attribution as the difference in output after replacing each contiguous region with a given baseline. For time series we considered continuous regions as features with in same time step or multiple time steps grouped together. F eature Ablation (F A) [32] computes attribution as the difference in output after replacing each feature with a baseline. Input features can also be grouped and ablated together rather than individually. F eature permutation (FP) [33] randomly permutes the feature value individually, within a batch and computes the change in output as a result of this modification.

Other: Shapley V alue Sampling (SVS) [34] an approximation of Shapley values that involves sampling some random permutations of the input features and average the marginal contribution of features based the differences on these permutations.

我们比较了流行的
基于反向传播backpropagation-based和
基于扰动perturbation based的
事后post-hoc显著性方法;
每种方法在给定的时间步骤中为每个输入特征提供特征的重要性或相关性。
所有方法均以随机分配作为基线对照。

在这个基准测试中，包括以下显着性方法:

Gradient-based基于梯度:
Gradient(GRAD)[5]输出相对于输入的梯度。综合梯度Integrated Gradients(IG)[9]当输入从非信息参考点变化时的平均梯度。
SmoothGrad (SG)[10]梯度计算n次，每次都向输入添加噪声。
DeepLIFT (DL)[11]定义了一个参考点，相关性是每个神经元的激活与其参考激活之间的差异。
Gradient SHAP (GS)[12]为每个输入添加噪声，沿着参考点和输入之间的路径选择一个点，并计算输出相对于这些点的梯度。
Deep SHAP (DeepLIFT + Shapley值)(DLS)[12]采用基线分布计算每个输入-基线对的属性，并对每个输入的结果属性求平均。
Perturbation-based基于扰动:
Feature Occlusion(FO)[24]计算属性作为用给定基线替换每个连续区域后的输出差异。对于时间序列，我们将连续区域视为同一时间步长或多个时间步长组合在一起的特征。
Feature Ablation(FA)[32]计算属性为用基线替换每个特征后输出的差异。输入特征也可以分组并一起消融，而不是单独。
Feature permutation (FP)[33]在批处理中随机地逐个排列特征值，并计算这种修改导致的输出变化。
其他:Shapley值采样Shapley Value Sampling(SVS)[34]沙普利值的近似值，包括对输入特征的一些随机排列进行采样，并根据这些排列的差异平均特征的边际贡献。

2.2 Neural Net Architectures

In this benchmark, we consider 3 main neural architectures groups; Recurrent networks, Convolution neural networks (CNN) and Transformer. For each group we investigate a subset of models that are commonly used for time series data. Recurrent models include: LSTM [35] and LSTM with Input-Cell Attention [23] a variant of LSTM with that attends to inputs from different time steps.
For CNN, Temporal Convolutional Network (TCN) [36–38] a CNN that handles long sequence time series. Finally, we consider the original Transformers [39] implementation.

在这个基准测试中，我们考虑了3个主要的神经架构组;
循环网络(RNN)，卷积神经网络(CNN)和Transformers。
对于每一组，我们研究了通常用于时间序列数据的模型子集。

循环模型包括:LSTM[35]和LSTM with Input-Cell Attention [23]， LSTM的一种变体，用于处理来自不同时间步的输入。

对于CNN，时态卷积网络(Temporal Convolutional Network, TCN)[36-38]是一种处理长序列时间序列的CNN。

最后，我们考虑原始的Transformers[39]实现。

Problem Definition

We study a time series classification problem where all time steps contribute to making the final output; labels are available after the last time step. In this setting, a network takes multivariate time series input X = [x1, . . . , xT ] ∈ RN×T , where T is the number of time steps and N is the number of features. Let xi,t be the input feature i at time t. Similarly, let X:,t ∈ RN and Xi,: ∈ RT be the feature vector at time t, and the time vector for feature i, respectively. The network produces an output S(X) = [S1(X), …, SC(X)], where C is the total number of classes (i.e. outputs). Given a target class c, the saliency method finds the relevance R(X) ∈ RN×T which assigns relevance scores Ri,t(X) for input feature i at time t.

我们研究了一个时间序列分类问题，其中所有的时间步长都有助于产生最终的输出;标签在最后一个时间步骤之后可用。在此设置中，网络接受多元时间序列输入 $X = [x_1,……, x_T]∈R^N×T$ ，其中T为时间步数，N为特征数。设xi,t为t时刻的输入特征i。同理，设X:，t∈RN, xi，:∈RT为t时刻的特征向量，xi，:∈RT为特征i的时间向量。网络产生输出S(X) = [S1(X)，…， SC(X)]，其中C是类的总数(即输出)。给定一个目标类c，显著性方法找到相关性R(X)∈RN×T，它为t时刻的输入特征i分配相关性分数Ri,t(X)。

4 Benchmark Design and Evaluation Metrics

4.1 Dataset Design

Since evaluating interpretability through saliency maps in multivariate time series datasets is nontrivial, we design multiple synthetic datasets where we can control and examine different design aspects that emerge in typical time series datasets. We extend the synthetic data proposed by Ismail et al [23] for binary classification. We consider how the discriminating signal is distributed over both time and feature axes, reflecting the importance of time and feature dimensions separately. We also examine how the signal is distributed between classes: difference in value, position, or shape. Additionally, we modify the classification difficulty by decreasing the number of informative features (reducing feature redundancy), i.e., small box datasets. Along with synthetic datasets, we included MNIST as a multivariate time series as a more general case (treating one of the image axes as time). Different dataset combinations are shown in Figure 1.

由于在多元时间序列数据集中通过显著性图评估可解释性并非易事，因此我们设计了多个合成数据集，在这些数据集中，我们可以控制和检查典型时间序列数据集中出现的不同设计方面。
我们扩展了Ismail等[23]提出的合成数据用于二元分类。我们考虑了区分信号如何在时间轴和特征轴上分布，分别反映了时间和特征维度的重要性。
我们还研究了信号在类之间是如何分布的:值、位置或形状的差异。此外，我们通过减少信息特征的数量(减少特征冗余)来修改分类难度，即小盒数据集。除了合成数据集，我们还将MNIST作为多元时间序列作为更一般的情况(将图像轴中的一个作为时间)。不同的数据集组合如图1所示。

Each synthetic dataset is generated by seven different processes as shown in Figure 2, giving a total of 70 datasets. Each feature is independently sampled from either:
(a) Gaussian with zero mean and unit variance.
(b) Independent sequences of a standard autoregressive time series with Gaussian noise.
( c ) A standard continuous autoregressive time series with Gaussian noise.
(d) Sampled according to a Gaussian Process mixture model.
(e) Nonuniformly sampled from a harmonic function.
(f) Sequences of standard non–linear autoregressive moving average (NARMA) time series with Gaussian noise.
(g) Nonuniformly sampled from a pseudo period function with Gaussian noise. Informative features are then highlighted by the addition of a constant µ to positive class and subtraction of µ from negative class (unless specified, µ = 1); the embedding size for each sample is N = 50, and the number of time steps is T = 50. Figures throughout the paper show data generated as Gaussian noise unless otherwise specified. Further details are provided in the supplementary material.

每个合成数据集由7个不同的过程生成，如图2所示，总共有70个数据集。每个特征都是独立采样的:
(a)均值为零，单位方差为零的高斯分布。
(b)带有高斯噪声的标准自回归时间序列的独立序列。
(c )具有高斯噪声的标准连续自回归时间序列。
(d)根据高斯过程混合模型采样。
(e)从谐波函数中非均匀采样。
(f)带有高斯噪声的标准非线性自回归移动平均时间序列序列。
(g)从具有高斯噪声的伪周期函数中非均匀采样。然后，通过在正类中添加常数µ和从负类中减去µ(除非指定，µ= 1)来突出信息特征;每个样本的嵌入大小为N = 50，时间步数为T = 50。除非另有说明，本文中的图都是作为高斯噪声生成的数据。进一步的细节载于补充材料。

4.2 Feature Importance Identification

Modification-based evaluation metrics [27–29] have two main issues. First, they assume that feature ranking based on saliency faithfully represents feature importance. Consider the saliency distributions shown in Figure 3. Saliency decays exponentially with feature ranking, meaning that features that are closely ranked might have substantially different saliency values. A second issue, as discussed by Hooker et al [14], is that eliminating features changes the test data distribution violating the assumption that both training and testing data are independent and identically distributed (i.i.d.).
Hence, model accuracy degradation may be a result of changing data distribution rather than removing salient features. In our synthetic dataset benchmark, we address these two issues by the following:

Sort relevance $R (X)$ , so that $R_e (x_i,t)$ is the $e^th$ element in ordered set $\\R_e (x_i,t))\\^T ×N_e=1$ .
Find top $k$ relevant features in the order set such that $\\frac\\sum_e=1^kR_e\\left( x_i,t \\right)\\sum_i=1,t=1^N,TR\\left( x_i,t \\right)\\approx d$ (where $d$ is a pre-determined percentage).
Replace $x_i,t$ , where $R_e (x_i,t) ∈ \\R_e (x_i,t))\\^k_e=1$ with the original distribution (known since this is a synthetic dataset).
Calculate the drop in model accuracy after the masking, this is repeated at different values of $d = [0, 10, ..., 100]$ .

基于修改的评价指标[27-29]有两个主要问题。
首先，他们假设基于显著性的特征排名忠实地代表了特征的重要性。考虑图3所示的显著性分布。显著性随特征排名呈指数级衰减，这意味着排名紧密的特征可能具有本质上不同的显著性值。
第二个问题，正如Hooker等[14]所讨论的，消除特征改变了测试数据的分布，违反了训练数据和测试数据都是独立和同分布(i.i.d)的假设。

因此，模型精度下降可能是改变数据分布的结果，而不是去除显著特征的结果。在我们的合成数据集基准测试中，我们通过以下方式解决这两个问题:

排序相关性 $R (X)$ ，使 $R_e (x_i,t)$ is the $e^th$ 是有序集 $\\R_e (x_i,t))\\^T ×N_e=1$ 中的第 $e^th$ 个元素。
找到顺序集中的前 $k$ 个相关特征，使得