BERT 模型压缩方法

Posted shona


篇首语:本文由小常识网(小编为大家整理,主要介绍了BERT 模型压缩方法相关的知识,希望对你有一定的参考价值。


模型压缩可减少受训神经网络的冗余,由于几乎没有 BERT 或者 BERT-Large 模型可直接在 GPU 及智能手机上应用,因此模型压缩方法对于 BERT 的未来的应用前景而言,非常有价值。




这包括权重大小剪枝、注意力头剪枝、网络层以及其他部分的剪枝等。还有一些方法也通过在训练期间采用正则化的方式来提升剪枝能力(layer dropout)。



===> 分解成两个小矩阵的话参数会变少,例如 5*5 ==> 3*3  3*3

3、知识蒸馏——又名「Student Teacher」。

在预训练/下游数据上从头开始训练一个小得多的 Transformer,正常情况下,这可能会失败,但是由于未知的原因,利用完整大小的模型中的软标签可以改进优化。一些方法还将BERT 蒸馏成如LSTMS 等其他各种推理速度更快的架构。另外还有一些其他方法不仅在输出上,还在权重矩阵和隐藏的激活层上对 Teacher 知识进行更深入的挖掘。


例如,ALBERT 对 BERT 中的每个自注意力层使用相同的权重矩阵。



6、预训练和下游任务——一些方法仅仅在涉及到特定的下游任务时才压缩 BERT,也有一些方法以任务无关的方式来压缩 BERT。




PaperPruneFactorDistillW. SharingQuant.Pre-trainDownstream
Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning ?         ? ?
Are Sixteen Heads Really Better than One? ?           ?
Pruning a BERT-based Question Answering Model ?           ?
Reducing Transformer Depth on Demand with Structured Dropout ?         ?  
Reweighted Proximal Pruning for Large-Scale Language Representation ?         ?  
Structured Pruning of Large Language Models   ?         ?
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations   ?   ?   ?  
Extreme Language Model Compression with Optimal Subwords and Shared Projections     ?     ?  
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter     ?     ?  
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks     ?       ?
Distilling Transformers into Simple Neural Networks with Unlabeled Transfer Data     ?       ?
Attentive Student Meets Multi-Task Teacher: Improved Knowledge Distillation for Pretrained Models     ?     Multi-task ?
Patient Knowledge Distillation for BERT Model Compression     ?       ?
TinyBERT: Distilling BERT for Natural Language Understanding     ?     ? ?
MobileBERT: Task-Agnostic Compression of BERT by Progressive Knowledge Transfer     ?     ?  
Q8BERT: Quantized 8Bit BERT         ?   ?
Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT         ?   ?



若需要选一个赢家,我认为是 ALBERT,DistilBERT,MobileBERT,Q-BERT,LayerDrop和RPP。你也可以将其中一些方法叠加使用 4,但是有些剪枝相关的论文,它们的科学性要高于实用性,所以我们不妨也来验证一番:




Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning 30% params ? Same Some interesting ablation experiments and fine-tuning analysis
Are Sixteen Heads Really Better than One? 50-60% attn heads 1.2x Same  
Pruning a BERT-based Question Answering Model 50% attn Heads + FF 2x -1.5 F1  
Reducing Transformer Depth on Demand with Structured Dropout 50-75% layers ? Same  
Reweighted Proximal Pruning for Large-Scale Language Representation 40-80% params ? Same  
Structured Pruning of Large Language Models 35% params ? Same  
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations 90-95% params 6-20x Worse Allows training larger models (BERT-xxlarge), so effectively 30% param reduction and 1.5x speedup with better acc.
Extreme Language Model Compression with Optimal Subwords and Shared Projections 80-98% params ? worse to much worse  
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter 40% params 2.5x 97% ?? Huggingface
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks 99% params 15x ELMO equiv. Distills into Bi-LSTMs
Distilling Transformers into Simple Neural Networks with Unlabeled Transfer Data 96% params ? ? Low-resource only
Attentive Student Meets Multi-Task Teacher: Improved Knowledge Distillation for Pretrained Models 90% params 14x better than Tang^ Distills into BiLSTMs.
Patient Knowledge Distillation for BERT Model Compression 50-75% layers 2-4x Worse But better than vanilla KD
TinyBERT: Distilling BERT for Natural Language Understanding 87% params 9.4x 96%  
MobileBERT: Task-Agnostic Compression of BERT by Progressive Knowledge Transfer 77% params 4x competitive  
Q8BERT: Quantized 8Bit BERT 75% bits ? negligible "Need hardware to show speed-ups" but I don‘t think anyone has it
Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT 93% bits ? "at most 2.3% worse" ^ probably same



  • 《稀疏 Transformer:通过显式选择集中注意力》(Sparse Transformer: Concentrated Attention Through Explicit Selection),论文链接:
  • 《使用四元数网络进行轻量级和高效的神经自然语言处理》(Lightweight and Efficient Neural Natural Language Processing with Quaternion Networks,论文链接:
  • 《自适应稀疏 Transformer》(Adaptively Sparse Transformers,论文链接:
  • 《压缩 BERT 以获得更快的预测结果》(Compressing BERT for Faster Prediction,博文链接:

1、请注意,并非所有压缩方法都能使模型更快。众所周知,非结构化剪枝很难通过 GPU 并行来加速。其中一篇论文认为,在 Transformers 中,计算时间主要由 Softmax 计算决定,而不是矩阵乘法。

2、期待有更好的模型压缩评价标准。就像 F1之类的。

3、其中一些百分比是根据 BERT-Large 而不是 BERT-Base 衡量的,仅供参考。



以上是关于BERT 模型压缩方法的主要内容,如果未能解决你的问题,请参考以下文章

8.4 bert的压缩讲解 意境级

bert 压缩优化方向的论文



