SMART: Robust and Efficient Fine-Tuning for Pre-trainedNatural Language Models

Posted 2022-12-09 Facico

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了SMART: Robust and Efficient Fine-Tuning for Pre-trainedNatural Language Models相关的知识，希望对你有一定的参考价值。

SMART: Robust and Efficient Fine-Tuning for Pre-trainedNatural Language Models through Principled RegularizedOptimization

Smoothness-inducing Adversarial Regularization

fine-tunning的优化如下
$\\min_\\theta F(\\theta)=\\mathcalL(\\theta)+\\lambda_S R_S(\\theta)\\\\ where\\quad \\mathcalL(\\theta)=\\frac1n \\sum_i=1^n \\ell(f(x_i;\\theta),y_i)\\qquad （损失函数）$

$\\lambda_S$ 是fine-tunning参数
$R_S(\\theta)$ 是Smoothness-inducing Adversarial正则项
- $R_S(\\theta)=\\frac1n \\sum_i=1^n\\max_||\\tildex_i-x_i||_p\\leq \\epsilon\\ell_S(f(\\tildex_i;\\theta),f(x_i;\\theta))$
- $\\ell(A,B)$ 就是描述两个分布相似度的
- 如果是回归模型就把上面的 $f(x_i;\\theta)$ 改成 $y_i$
这里大致参照了VAT中，将对抗的地方设置为正则项，来平滑数据点，可以参见VAT。
- 这样使得在一定的扰动下，输出一样的分布，增强模型的鲁棒性

Bregman Proximal Point Optimization

我们使用类Bregman Proximal Point Optimization的方式来解决上面fine-tunning的优化，每次迭代的时候将入一个强惩罚项来避免模型调整过激，让模型学习到的流行更加光滑，让loss呈线性变化，增强对扰动的抵抗能力，避免灾难性遗忘。
$\\theta_t+1=argmin_\\theta \\quad F(\\theta)+\\mu D_Breg(\\theta, \\theta_t)\\\\ D_Breg(\\theta,\\theta_t)=\\frac1n\\sum_i=1^n \\ell_S(f(x_i;\\theta),f(x_i;\\theta_t))$
加入动量加速
$\\theta_t+1=argmin_\\theta\\quad F(\\theta)+\\mu D_Breg(\\theta,\\tilde\\theta_t)\\\\ \\tilde\\theta_t=(1-\\beta)\\theta_t+\\beta \\tilde\\theta_t-1$
就是做个滑动平均， $\\beta$ 是动量参数

最终

最终的损失函数为
$\\mathcalF(\\theta)=\\mathcalL(\\theta)+\\lambda_S R_S(\\theta)+\\mu D_breg(\\theta, \\theta_t)$
伪代码如上

实验

集成模型上，用这些fine-tunning后，结合MT-DNN达到当时的SOTA
单模型上，和RoBERTa结合达到SOTA

总结

论文非常精短，但是效果却很好
提供了NLP对抗性训练的新思路，尤其是fine-tunning的思路，加入对抗性正则项这个思路，能对后续工作有较大启发

以上是关于SMART: Robust and Efficient Fine-Tuning for Pre-trainedNatural Language Models的主要内容，如果未能解决你的问题，请参考以下文章