DCANet: Learning Connected Attentions for Convolutional Neural Networks 笔记
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了DCANet: Learning Connected Attentions for Convolutional Neural Networks 笔记相关的知识,希望对你有一定的参考价值。
参考技术A这是发在 ECCV 2020 上的文章
自注意力只考虑了当前特征,不能够完全理由注意力机制的优势。
在这篇文章中,作者提出了 深度连接注意力网络(Deep Connection Attention Network (DCANet)),在不修改CNNs内在结构的情况下利增强使用用注意力模型。为了实现这一点,将相邻的注意力块互连,从而使信息在注意力块之间流动。
有了DCANet,所有CNNs模型中的注意力块被联合训练,这能够提高注意力学习的能力。而且DCANet是具有泛化性的。它不局限于特定的注意力模块或基础网络架构
一个问题需要解决:
我们是否完全利用了自注意力机制?
从两个方面回答这个问题:
实验结果表明人类皮层中同时存在的两个刺激不是独立处理的,它们互相有关系。但是在设计自注意力机制的时候是没有考虑到人类视觉注意力机制的特性。
现有的注意力网络仅仅包括一个卷积块紧接着的是一个注意力块, 注意力块只是从现有特征图里面学习,而没有与其它分享信息。
另外,使用SENet 研究自我注意力,SENet是研究通道关系的简单注意力网络。
在图1中可视化注意图
有趣的是,观察到SE块几乎无法将注意力转移到关键区域,甚至在不同阶段都会显着改变焦点。
从柱状图从可以发现,SE的值聚集在0.5,表示注意力块的不充分的学习能力。合理的解释是,在学习自注意力中缺乏额外的信息会影响其辨别能力。
因此,考虑到了连接注意力块,提出了 DCANet 解决这个问题。
创新点如下:
深度连接注意力网络在概念上很简单,但从经验上讲却很强大。 通过分析各种注意框的内部结构,提出了一种不限于特定注意框的通用连接方案。
不同的注意力块有不同的目的,实施起来也不一样。例如,SE块包含了两个全连接层,GC块包含了几个不同的卷积层。
因此,直接提供一个通用性足以覆盖大多数注意力块的标准连接方案并不容易。
研究了各种注意模块并开发了一个通用的注意框架,其中注意块由三个部分组成:
对于一个由卷积块产生的特征图 ,通过一个提取器 提取特征。
其中 是提取设置的参数, 是输出。当 是不需要参数 进行操作的时候,这就像池化操作一样。
定义 作为特征转换操作,并且一个注意力块的输出能够被表示为:
其中, 是参数, 是输出。
注意力引导输出 可以表示为
不考虑实验细节,一个注意力块能够被表示为:
作者提出的连接注意力块的表示如下:
其中, 是连接函数, 和 是可学习的参数, 是由先前的注意力块产生的特征图。在某些情况下(如SE block和GE block), 被放缩到到(0,1)的范围。对于这些注意力块,用 乘以来匹配尺度,其中是 前一个注意力块中提取分量的输出。另外还注意到,如果将 α和β分别设置为1和0,此时可以不使用attention connection,并且可以将DCA增强注意块减少为vanilla注意力模块。vanilla network是DCA增强注意力网络的一个特例情况。
介绍 的两种连接方式
在不同阶段,由CNN产生的特征图不一样,因此与之相关的注意力图也不一样。如果大小不一样,就很难把特征图连接起来。 为了解决此问题,我们自适应地沿通道和空间维度匹配注意力图的形状。
对于通道,使用全连接层(然后是layer normalization和ReLU激活函数)来匹配大小,从而 使得通道转换为 通道,其中 和 分别表示之前和当前通道的数量。那么,参数就是 。为了减少参数负担,使用两个轻量级全连接层,输出大小分别是 和 。那么参数就是
为了匹配空间分辨率,一个简单且有效的策略是使用平均池化层。最大池化也能工作的很好,但是它仅仅考虑了部分信息而不是整个注意力信息
ps. (后面这句话考虑整个注意力信息不是特别理解,因为平均赤化不也是考虑的部分信息吗,在一个核里面做评价呢)
另一个方法就是可学习的卷积操作,但是它引入了参数。
有些注意力块集中于多个注意力维度。要构建一个多维注意块,我们将注意图与每个维度连接起来,并确保不同维度上的连接彼此独立。
注意连接的分离带来两个
优点:1)减少了参数数量和计算开销;
2)每个维度都可以专注于其固有属性
参考资料:
DCANet: Learning Connected Attentions for Convolutional Neural Networks
https://wemp.app/posts/529e101b-1082-4356-ae7b-6c29a7604209
Predictive learning vs. representation learning
Posted in Machine Learning. – February 4, 2013
When you take a machine learning class, there’s a good chance it’s divided into a unit on supervised learning and a unit on unsupervised learning. We certainly care about this distinction for a practical reason: often there’s orders of magnitude more data available if we don’t need to collect ground-truth labels. But we also tend to think it matters for more fundamental reasons. In particular, the following are some common intuitions:
- In supervised learning, the particular algorithm is usually less important than engineering and tuning it really well. In unsupervised learning, we’d think carefully about the structure of the data and build a model which reflects that structure.
- In supervised learning, except in small-data settings, we throw whatever features we can think of at the problem. In unsupervised learning, we carefully pick the features we think best represent the aspects of the data we care about.
- Supervised learning seems to have many algorithms with strong theoretical guarantees, and unsupervised learning very few.
- Off-the-shelf algorithms perform very well on a wide variety of supervised tasks, but unsupervised learning requires more care and expertise to come up with an appropriate model.
I’d argue that this is deceptive. I think real division in machine learning isn’t between supervised and unsupervised, but what I’ll term predictive learning and representation learning. I haven’t heard it described in precisely this way before, but I think this distinction reflects a lot of our intuitions about how to approach a given machine learning problem.
In predictive learning, we observe data drawn from some distribution, and we are interested in predicting some aspect of this distribution. In textbook supervised learning, for instance, we observe a bunch of pairs , and given some new example , we’re interested in predicting something about the corresponding . In density modeling (a form of unsupervised learning), we observe unlabeled data , and we are interested in modeling the distribution the data comes from, perhaps so we can perform inference in that distribution. In each of these cases, there is a well-defined predictive task where we try to predict some aspect of the observable values possibly given some other aspect.
In representation learning, our goal isn’t to predict observables, but to learn something about the underlying structure. In cognitive science and AI, a representation is a formal system which maps to some domain of interest in systematic ways. A good representation allows us to answer queries about the domain by manipulating that system. In machine learning, representations often take the form of vectors, either real- or binary-valued, and we can manipulate these representations with operations like Euclidean distance and matrix multiplication. For instance, PCA learns representations of data points as vectors. We can ask how similar two data points are by checking the Euclidean distance between them.
In representation learning, the goal isn’t to make predictions about observables, but to learn a representation which would later help us to answer various queries. Sometimes the representations are meant for people, such as when we visualize data as a two-dimensional embedding. Sometimes they’re meant for machines, such as when the binary vector representations learned by deep Boltzmann machines are fed into a supervised classifier. In either case, what’s important is that mathematical operations map to the underlying relationships in the data in systematic ways.
Whether your goal is prediction or representation learning influences the sorts of techniques you’ll use to solve the problem. If you’re doing predictive learning, you’ll probably try to engineer a system which exploits as much information as possible about the data, carefully using a validation set to tune parameters and monitor overfitting. If you’re doing representation learning, there’s no good quantitative criterion, so you’ll more likely build a model based on your intuitions about the domain, and then keep staring at the learned representations to see if they make intuitive sense.
In other words, it parallels the differences I listed above between supervised and unsupervised learning. This shouldn’t be surprising, because the two dimensions are strongly correlated: most supervised learning is predictive learning, and most unsupervised learning is representation learning. So to see which of these dimensions is really the crux of the issue, let’s look at cases where the two differ.
Language modeling is a perfect example of an application which is unsupervised but predictive. The goal is to take a large corpus of unlabeled text (such as Wikipedia) and learn a distribution over English sentences. The problem is motivated by Bayesian models for speech recognition: a distribution over sentences can be used as a prior for what a person is likely to say. The goal, then, is to model the distribution, and any additional structure is unnecessary. Log-linear models, such as that of Mnih et al. [1], are very good at this, and recurrent neural nets [2] are even better. These are the sorts of approaches we’d normally apply in a supervised setting: very good at making predictions, but often hard to interpret. One state-of-the-art algorithm for density modeling of text is PAQ [3], which is a heavily engineered ensemble of sequential predictors, somewhat reminiscent of the winning entries of the Netflix competition.
On the flip side, supervised neural nets are often used to learn representations. One example is Collobert-Weston networks [4], which attempt to solve a number of supervised NLP tasks by learning representations which are shared between them. Some of the tasks are fairly simple and have a large amount of labeled data, such as predicting which of two words should be used to fill in the blank. Others are harder and have less data available, such as semantic role labeling. The simpler tasks are artificial, and they are there to help learn a representation of words and phrases as vectors, where similar words and phrases map to nearby vectors; this representation should then help performance on the harder tasks. We don’t care about the performance on those tasks per se; we care whether the learned embeddings reflect the underlying structure. To debug and tune the algorithm, we’d focus on whether the representations make intuitive sense, rather than on the quantitative performance. There are no theoretical guarantees that such an approach would work — it all depends on our intuitions of how the different tasks are related.
Based on these two examples, it seems like it’s the predictive/representation dimension which determines how we should approach the problem, rather than supervised/unsupervised.
In machine learning, we tend to think there’s no solid theoretical framework for unsupervised learning. But really, the problem is that we haven’t begun to formally characterize the problem of representation learning. If you just want to build a density modeler, that’s about as well understood as the supervised case. But if the goal is to learn representations which capture the underlying structure, that’s much harder to formalize. In my next post, I’ll try to take a stab at characterizing what representation learning is actually about.
[1] Mnih, A., and Hinton, G. E. Three new graphical models for statistical language modeling. NIPS 2009
[2] Sutskever, I., Martens, J., and Hinton, G. E. Generating text with recurrent neural networks. ICML 2011
[3] Mahoney, M. Adaptive weighting of context models for lossless data compression. Florida Institute of Technology Tech report, 2005
[4] Collobert, R., and Weston, J. A unified architecture for natural language processing: deep neural networks with multitask learning. ICML 2008
以上是关于DCANet: Learning Connected Attentions for Convolutional Neural Networks 笔记的主要内容,如果未能解决你的问题,请参考以下文章
[Mechine Learning] Active Learning
Predictive learning vs. representation learning