A branch of machine learning 机器学习的一个分支
Re-branded name for neural networks 神经网络的改名
Why deep? Many layers are chained together in modern deep learning models 为什么是深度？在现代深度学习模型中，许多层都是连在一起的。
Neural networks: historically inspired by the way computation works in the brain 神经网络：历史上受到大脑中计算方式的启发
- Consists of computation units called neurons 由称为神经元的计算单元组成

1.2 Feed-forward NN

Aka multilayer perceptrons 又名多层感知器
Each arrow carries a weight, reflecting its importance 每个箭头都有一个权重，反映其重要性
Certain layers have nonlinear activation functions 某些层有非线性激活函数

1.3 Neuron

Each neuron is a function 每个神经元都是一个函数

- given input x, computes real-value (scalar) h 给定输入x，计算实值（标量）

- scales input (with weights, w) and adds offset (bias, b)

- applies non-linear function:

logistic sigmoid
hyperbolic sigmoid (tanh)
rectified linear unit

- w and b are parameters of the model

1.4 Matrix Vector Notation 矩阵向量表示法

- Typically have several hidden units, i.e.

- Each with its own weights () and bias term ()

- Can be expressed using matrix and vector operators 可以用矩阵和向量运算符表示

- Where a matrix comprisine the weight vectors, and is a vector of all bias terms

- Non-linear function applied element-wise 非线性函数的单元应用

1.5 Output Layer

- Binary classification problem, e.g. classify whether a tweet is + or - in sentiment

sigmoid activation function

- Multi-class classification problem, e.g. native language identification

softmax ensures probabilities > 0 and sum to 1

1.6 Learning from Data

- How to learn the parameters from data?

- Consider how well the model "fits" the training data, in terms of the probability it assigns to the correct output

want to maximise total probability, L 想要使总概率最大化，L
equivalently minimise -log L with respect to parameters 等效地使-log L 相对于参数最小化

- Trained using gradient descent 用梯度下降法训练的

tools like tensorflow, pytorch, dynet use autodiff to compute gradients automatically

1.7 Regularisation 正则化

- Have many parameters, overfits easily 有很多参数，很容易过拟合

- Low bias, high variance 过拟合的体现就是偏差小，方差大

- Regularisation is very very important in NNs 正规化在NNs中非常重要

- L1-norm: sum of absolute values of all parameters (W, b, etc) 所有参数(W、 b 等)的绝对值之和

- L2-norm: sum of squares 平方和

- Dropout: randomly zero-out some neurons of a layer 随机地使一层中的某些神经元中断

1.8 Dropout

- If dropout rate = 0.1, a random 10% of neurons now have 0 values 如果辍学率 = 0.1，那么随机抽取的10% 的神经元现在的值为0

- Can apply dropout to any layer, but in practice, mostly to the hidden layers 可以应用到任何图层，但在实践中，大多数是隐藏的图层

2. Applications in NLP

2.1 Topic Classification

Given a document, classify it into a predefined set of topics (e.g. economy, politics, sports) 给定一个文档，将其分类为一组预定义的主题(如经济、政治、体育)

Input: bag-of-words

2.2 Topic Classification - Training

2.3 Topic Classification - Prediction

2.4 Topic Classification - Improvements

- + Bag of bigrams as input

- Preprocess text to lemmatise words and remove stopwords 对文本进行预处理，使单词词干化，并删除stopwords

- Instead of raw counts, we can weight words using TF-IDF or indicators (0 or 1 depending on presence of words) 我们可以使用 TF-IDF 或指示器（0或1取决于单词的存在）来为单词加权，而不是原始计数

2.5 Language Model Revisited

- Assign a probability to a sequence of words 给一系列单词赋予一个概率

- Framed as "sliding a window" over the sentence, predicting each word from finite context 在句子上方设置“滑动窗口”，从有限的上下文中预测每个单词

E.g., n=3, a trigram model 例如，n = 3，一个三元模型

- Training involves collecting frequency counts 训练包括收集频率计数

- Difficulty with rare events 一 smoothing 平滑

2.6 Language Models as Classifiers

- LMs can be considered simple classifiers, e.g. for a trigram model LM 可以被认为是简单的分类器，例如对于一个三元模型:

- classifies the likely next word in a sequence, given “salt” and “and” 根据“ salt”和“ and”将可能的下一个单词按顺序分类

2.6 Feed-forward NN Language Model

- Use neural network as a classifier to model 使用神经网络作为分类器进行建模

- Input features = the previous two words 输入特征 = 前两个单词

- Output class = the next word 输出类 = 下一个单词

- How to represent words? Embeddings 如何表示单词? Embeddings

2.7 Word Embeddings

- Maps discrete word symbols to continuous vectors in a relatively low dimensional space 在一个相对低维的空间中将离散的单词符号映射到连续的向量

- Word embeddings allow the model to capture similarity between words 词语嵌入允许模型捕获词语之间的相似性

dog vs. cat
walking vs. running

2.8 Topic Classification

2.9 Training a FFNN LM

2.10 Input and Output Word Embeddings

2.11 Language Model: Architecture

2.12 Advantages of FFNN LM

- Count-based N-gram models (lecture 3) 基于计数的 N-gram 模型

cheap to train (just collect counts)
problems with sparsity and scaling to larger contexts
don't adequately capture properties of words (grammatical and semantic similarity ), e.g., film vs movie 不能充分捕捉单词的属性(语法和语义相似性)

- FFNN N-gram models

automatically capture word properties, leading to more robust estimates

3. Convolutional Networks

- Commonly used in computer vision 常用于计算机视觉

- Identify indicative local predictors 识别指示性局部预测因子

- Combine them to produce a fixed-size representation 将它们组合起来生成一个固定大小的表示

3.1 Convolutional Networks for NLP

4. Final Words

Pros

- Excellent performance

- Less hand-engineering of features

- Flexible — customised architecture for different tasks

Cons

- Much slower than classical ML models... needs GPU

- Lots of parameters due to vocabulary size

- Data hungry, not so good on tiny data sets

Pre-training on big corpora helps

Learning Invariant Deep Representation for NIR-VIS Face Recognition

查找异质图像匹配的过程中，发现几篇某组的论文，都是关于NIR-VIS的识别问题，提到了许多处理异质图像的处理方法，网络结构和idea都很不错，记录其中一篇。

摘要

VIS-NIR（可见光与近红外）面部识别仍然是异质图像识别中的挑战。本文只用一个网络来映射NIR和VIS图像至一个紧凑的欧式空间。网络的低级层仅仅在大规模VIS数据中训练。每个卷积层由简单的maxout operator实现。网络的高级层被划分为两个正交的子空间，分别包括模态不变身份信息（modality-invariant identity information）和模态变化光谱信息（modality-variant spectrum information）。我们的联合公式在训练时引导交替最小化方法得到深度表示，测试时高效计算异质数据。实验证明了在CASIA NIR-VIS 2.0面部识别数据中实现94 percent的正确率，仅仅有64D大小的表示，比之前低了58 percent的错误率。

1. 介绍

NIR图像提供了廉价且简单的方式来提高在低光照情况下的面部识别能力。对于光照变换没有VIS那么敏感，所以被广泛应用于安检等。在真实应用中，NIR往往需要和VIS一起使用，导致了两者之间的匹配问题。这个问题可称为：NIR-VIS 异质面部识别问题。

NIR与VIS属于不同光谱，自然有很大的外表差异。所以深度网络在VIS数据训练后不含有NIR光谱信息，所以无法很好的解决NIR问题。怎样利用大规模VIS面部数据来探索NIR和VIS面部模态不变表示值得思考。得益于网络数据，我们可以容易获得大量VIS面部数据，然而成对的NIR数据难以获得。怎样在小规模NIR-VIS数据中学习也是一个中心问题。

之前的NIR-VIS匹配方法经常利用trick来减轻外观差异，通过移除一些可能含有光谱信息的主子空间。Chen在2012提出面部外观由身份信息（identity information）和变化信息(variation information eg.,lighting,poses,expressions)组成。受启发于此，本文提出一个网络来学习Invariant Deep Representation (IDR)同时包含NIR和VIS人脸信息，利用一个单一网络来将NIR和VIS图像同时映射到一个压缩后的欧式空间，使得NIR和VIS图像在嵌入空间embedding space中可以直接对应到面部相似性。

我们的网络首先在大规模VIS数据中训练，卷积层和全连接由简化形式的maxout operator实现。这个网络使得我们学习的到的表示对于类内个体变化很鲁棒。然后，网络底层固定，微调NIR数据。高层划分为两个正交子空间：模态不变身份信息（modality-invariant identity information）和模态变化光谱信息（modality-variant spectrum information）。这个正交限制和maxout operator在高层可以缩减参数空间，因此避免了在小的NIR-VIS数据集上的过拟合。本文提出的IDR达到了SOTA，贡献如下：

一个高效深度网络结构学习模态不变表示，交替最小化高效优化。这个结构可以自然结合之前的不变特征提取和子空间学习到一个统一网络。
两个正交子空间嵌入网络中来建模身份和光谱信息。使得可以提取压缩后的表示，减小了小数据中的过拟合问题。
在数据集CASIA NIR-VIS 2.0面部数据上以64维的表示达到SOTA。

2. 相关工作

许多工作提出来减轻异质图像的外观差异。大多数方法可以分为三类：image synthesis, subspace learning、invariant feature extraction。

1）Image synthesis

主要从一个模态合成面部图像到另一个模态使得异质图像可在同一距离空间比较。

2）subspace learning

学习映射异质数据到一个共同的空间。当前sota方法是通过移除一些主子空间成分来解决。

3）Invariant feature extraction

即寻找模态不变特征使得对光照鲁棒。传统方法较多。

尽管很多方法，NIR-VIS识别表现仍然很low。远不如VIS数据结果好。很少有dl方法处理NIR-VIS，所以本文用DL方法来解决。

3. Invariant Deep Representation

本节介绍子空间分解和不变性特征提取，来学习模态不变深度表示。

技术分享图片

注意到移除光谱信息有助于提高NIR-VIS识别表现。我们进一步三个映射矩阵（W，P,见上图）来建模身份不变信息和不变光谱信息。所以特征表示可以表示如下：

技术分享图片

WX和PX分别代表共享特征和独立特征。考虑到子空间分解特性关于矩阵W和P：我们进一步提出一个正交限制使他们互相无关：

利用softmax函数来训练整个网络：

技术分享图片

优化方法：

上式包含一些非凸变量，我们利用一种交替优化方法来最小化目标函数。首先根据朗格朗日乘子，重写上述函数：

技术分享图片

待优化参数有网络参数、W、P。利用交替优化更新，网络参数初始化利用Xavier，W和P初始化：

技术分享图片

网络结构：lightened CNN B network（同作者另一作品：A Light CNN for Deep Face Representation with Noisy Labels）网络包括9个卷积层+4个最大池化层+全连接。Dropout设为0.7。初始学习率0.001，降到0.00001。基于该网络实现本文，特征层用来映射低级特征到两个正交子空间。

4. 其他要点

算法分析：分析本文提出的不变性深度表征: invariant deep representation (IDR)

我们实现了两种版本的IDR：DR表示IDR没有NIR特征和VIS特征。即仅仅训练卷积网络，没有子空间分解。这会导致大量参数在全连接和特征层，导致在小数据NIR-VIS上过拟合。特征层的maxout operator也有助于减少过拟合。因此，IDRm表示IDR没有maxout operator在特征层。

技术分享图片

上图表明IDR是最好的结果。对比IDR和IDRm，注意到maxout operator在最后一个卷积层可进一步降低equal error rate，并提高表现。

最后再附两张碾压性能图：

技术分享图片

以上是关于自然语言处理： Deep Learning for NLP: Feedforward Networks的主要内容，如果未能解决你的问题，请参考以下文章