PaddleSlim 模型量化 源代码解读
Posted 沉迷单车的追风少年
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了PaddleSlim 模型量化 源代码解读相关的知识,希望对你有一定的参考价值。
前言:paddleslim的中文资料非常丰富,在如何使用的教学上做的非常完善。但是源码解读与解析资料却非常少,这篇博客结合源代码大家一起学习一下paddle静态离线量化的原理和方法。
目录
原理简述
这部分官方手册说的非常nice,建议阅读:量化 — PaddleSlim 文档
静态离线量化这部分主要是封装了paddle的接口,基本的使用方法如下:
ptq = PostTrainingQuantization(
executor=exe,
sample_generator=sample_generator,
model_dir=model_dir,
model_filename=model_filename,
params_filename=params_filename,
batch_size=batch_size,
batch_nums=batch_nums,
algo=algo,
quantizable_op_type=quantizable_op_type)
ptq.quantize()
ptq.save_quantized_model(save_model_path)
支持的量化类型
paddle通过optimize_model控制是否进行算子融合优化,但是只支持CPU上的算子融合,而且只有conv2d/depthwise_conv2d + bn算子融合,相比于OPENPPL的算子融合,做了非常多的算子融合:
paddle就显得非常简陋了……
参数is_full_quantize,不要被名字迷惑了,这并不是全量化。paddle当中的"部分量化"只支持"conv2d"、"depthwise_conv2d"、"conv2d_transpose"、"mul"、"matmul"、"matmul_v2"这6种类型。支持的理由是这几个是主要的计算密集型算子,量化他们是最有效的。
而全量化支持的算子类型也是有限的,除了刚才列出的六种类型之外,还有如下:
SUPPORT_ACT_QUANTIZATION_OP_DICT =
"mul": [["X", "Y"], ["Out"]],
"matmul": [["X", "Y"], ["Out"]],
"matmul_v2": [["X", "Y"], ["Out"]],
"pool2d": [["X"], ["Out"]],
"elementwise_add": [["X", "Y"], ["Out"]],
"concat": [["X"], ["Out"]],
"softmax": [["X"], ["Out"]],
"argmax": [["X"], ["Out"]],
"transpose": [["X"], ["Out"]],
"equal": [["X", "Y"], ["Out"]],
"gather": [["X"], ["Out"]],
"greater_equal": [["X", "Y"], ["Out"]],
"greater_than": [["X", "Y"], ["Out"]],
"less_equal": [["X", "Y"], ["Out"]],
"less_than": [["X", "Y"], ["Out"]],
"mean": [["X"], ["Out"]],
"not_equal": [["X", "Y"], ["Out"]],
"reshape": [["X"], ["Out"]],
"reshape2": [["X"], ["Out"]],
"transpose2": [["X"], ["Out"]],
"nearest_interp": [["X"], ["Out"]],
"trilinear_interp": [["X"], ["Out"]],
"slice": [["Input"], ["Out"]],
"squeeze": [["X"], ["Out"]],
"elementwise_sub": [["X", "Y"], ["Out"]],
"relu": [["X"], ["Out"]],
"relu6": [["X"], ["Out"]],
"leaky_relu": [["X"], ["Out"]],
"prelu": [["X", "Alpha"], ["Out"]],
"tanh": [["X"], ["Out"]],
"swish": [["X"], ["Out"]],
"dropout": [["X"], ["Out"]],
"batch_norm": [["X"], ["Y"]],
"layer_norm": [["X"], ["Y"]],
"sigmoid": [["X"], ["Out"]],
"elementwise_mul": [["X", "Y"], ["Out"]],
"elementwise_pow": [["X", "Y"], ["Out"]],
"hard_swish": [["X"], ["Out"]],
"hard_sigmoid": [["X"], ["Out"]],
"gru": [["Input", "Weight"], ["Hidden"]],
"lstm": [["Input", "Weight"], ["Hidden"]],
"pad2d": [["X"], ["Out"]],
"pad3d": [["X"], ["Out"]],
"flatten": [["X"], ["Out"]],
"flatten2": [["X"], ["Out"]],
"unsqueeze2": [["X"], ["Out"]],
"flatten_contiguous_range": [["X"], ["Out"]],
"split": [["X"], ["Out"]],
"squeeze2": [["X"], ["Out"]],
"nearest_interp_v2": [["X"], ["Out"]],
"bilinear_interp": [["X"], ["Out"]],
"bilinear_interp_v2": [["X"], ["Out"]],
"fill_constant_batch_size_like": [["Input"], ["Out"]],
"arg_max": [["X"], ["Out"]],
"abs": [["X"], ["Out"]],
"assign": [["X"], ["Out"]],
"cast": [["X"], ["Out"]],
"clip": [["X"], ["Out"]],
"box_coder": [["PriorBox"], ["OutputBox"]],
"crop": [["X"], ["Out"]],
"cumsum": [["X"], ["Out"]],
"expand_v2": [["X"], ["Out"]],
"fill_any_like": [["X"], ["Out"]],
"fill_constant": [[], ["Out"]],
"gelu": [["X"], ["Out"]],
"instance_norm": [["X"], ["Y"]],
"lookup_table": [["W", "Ids"], ["Out"]],
"lookup_table_v2": [["W", "Ids"], ["Out"]],
"norm": [["X"], ["Norm"]],
"p_norm": [["X"], ["Out"]],
"pow": [["X"], ["Out"]],
"reduce_mean": [["X"], ["Out"]],
"stack": [["X"], ["Y"]],
"top_k_v2": [["X"], ["Out", "Indices"]],
"logical_and": [["X", "Y"], ["Out"]],
"logical_not": [["X"], ["Out"]],
"meshgrid": [["X"], ["Out"]],
"roi_align": [["X", "ROIs"], ["Out"]],
"strided_slice": [["Input"], ["Out"]],
"where": [["Condition", "X", "Y"], ["Out"]],
"grid_sampler": [["X", "Grid"], ["Output"]],
"tile": [["X"], ["Out"]],
"group_norm": [["X"], ["Y", "Mean", "Variance"]],
"reduce_sum": [["X"], ["Out"]],
"square": [["X"], ["Out"]],
"softplus": [["X"], ["Out"]],
"shuffle_channel": [["X"], ["Out"]],
"reduce_max": [["X"], ["Out"]],
"scale": [["X"], ["Out"]],
支持后端
支持的后端如下:
support_deploy_backend = [None, "tensorrt", "mkldnn", "arm"]
对应的量化类是:BaseQuantizer、TensorRTQuantizer、MKLDNNQuantizer、ARMCPUQuantizer。相比于openppl来看支持的后端数量非常少,关于不同后端之间优化方法上的区别,先挖个坑以后再讲。
后面以BaseQuantizer为例讲解下面的内容。
量化操作
关键是迭代寻找边界值的过程。因为每次sample的过程会有不同策略,这里用abs_max()为例先来看看。
Preparation stage
if self._algo in ["KL", "hist"]:
batch_id = 0
with tqdm(
total=self._batch_nums,
bar_format='Preparation stage, Run batch:|bar| n_fmt/total_fmt',
ncols=80,
) as t:
for data in self._data_loader():
self._executor.run(
program=self._program,
feed=data,
fetch_list=self._fetch_list,
return_numpy=False,
scope=self._scope,
)
self._collect_activation_abs_min_max()
batch_id += 1
t.update()
if self._batch_nums and batch_id >= self._batch_nums:
break
self._init_sampling_act_histogram()
需要注意的是当KL或hist的时候,需要对所有激活值计算abs_min和abs_max,调用_collect_activation_abs_min_max()方法。
Sampling stage
加载权重值:
var_tensor = utils.load_variable_data(self._scope, var_name)
然后会分成abs_max、channel_wise_abs_max两种方式寻找边界的最大值:
if self._weight_quantize_type == "abs_max":
abs_max_value = float(np.max(np.abs(var_tensor)))
elif self._weight_quantize_type == "channel_wise_abs_max":
abs_max_value = []
if (
self._weight_op_pairs[var_name]
in utils._channelwise_quant_axis1_ops
):
for i in range(var_tensor.shape[1]):
abs_max_value.append(
float(np.max(np.abs(var_tensor[:, i])))
)
else:
for i in range(var_tensor.shape[0]):
abs_max_value.append(
float(np.max(np.abs(var_tensor[i])))
)
self._quantized_threshold[var_name] = abs_max_value
batch_id = 0
with tqdm(
total=self._batch_nums,
bar_format='Sampling stage, Run batch:|bar| n_fmt/total_fmt',
ncols=80,
) as t:
for data in self._data_loader():
self._executor.run(
program=self._program,
feed=data,
fetch_list=self._fetch_list,
return_numpy=False,
scope=self._scope,
)
self._sampling()
batch_id += 1
t.update()
if self._batch_nums and batch_id >= self._batch_nums:
break
保存scale
量化的结果就是为每个节点寻找scale,之前在每个tensor_name当中保存了sacle,此时分割拿出来就行:
real_tensor_name, opera, scalar = tensor_name.split('#')
这里需要不断动态更新max_scale,在opera放缩的时候会用到。
插入量化/反量化节点
例如此图中的D是反量化操作,Q是量化操作。
需要我们在计算图中把量化和反量化节点插入进去:
# use QuantizationTransformPass to insert fake_quant/fake_dequantize op
if not self._onnx_format:
transform_pass = QuantizationTransformPass(
scope=self._scope,
place=self._place,
weight_bits=self._weight_bits,
activation_bits=self._activation_bits,
activation_quantize_type=self._activation_quantize_type,
weight_quantize_type=self._weight_quantize_type,
quantizable_op_type=self.quant_config.weight_quant_operation_types,
)
else:
transform_pass = QuantizationTransformPassV2(
scope=self._scope,
place=self._place,
weight_bits=self._weight_bits,
activation_bits=self._activation_bits,
activation_quantize_type=self._activation_quantize_type,
weight_quantize_type=self._weight_quantize_type,
quantizable_op_type=self.quant_config.weight_quant_operation_types,
)
for sub_graph in graph.all_sub_graphs():
# Insert fake_quant/fake_dequantize op must in test graph, so
# set per graph's _for_test is True.
sub_graph._for_test = True
transform_pass.apply(sub_graph)
# use AddQuantDequantPass to insert fake_quant_dequant op
if not self._onnx_format:
add_quant_dequant_pass = AddQuantDequantPass(
scope=self._scope,
place=self._place,
quantizable_op_type=self.quant_config.activation_quant_operation_types,
)
else:
add_quant_dequant_pass = AddQuantDequantPassV2(
scope=self._scope,
place=self._place,
quantizable_op_type=self.quant_config.activation_quant_operation_types,
)
for sub_graph in graph.all_sub_graphs():
sub_graph._for_test = True
add_quant_dequant_pass.apply(sub_graph)
激活校准过程梳理
weight是常数,这个是不需要校准的(因为不会存在误差),所以需要被校准的只有激活值。
量化公式
假设r表示量化前的浮点数,量化后的整数q可以表示为:
round(⋅)和clip(⋅)分别表示取整和截断操作,和是量化后的最小值和最大值。s是数据量化的间隔,z是表示数据偏移的偏置,z为0的量化被称为对称(Symmetric)量化,不为0的量化称为非对称(Asymmetric)量化。对称量化可以避免量化算子在推理中计算z相关的部分,降低推理时的计算复杂度;非对称量化可以根据实际数据的分布确定最小值和最小值,可以更加充分的利用量化数据信息,使得计算精度更高。
具体流程
-
使用直方图统计的方式得到原始FP32数据的统计分布;
-
在给定的搜索空间中选取若干个和分别对激活值量化,得到量化后的数据;
-
使用直方图统计得到的统计分布;
-
计算每个与的统计分布差异,并找到差异性最低的一个对应的和来计算相应的量化参数,常见的用于度量分布差异的指标包括KL散度(Kullback-Leibler Divergence)、对称KL散度(Symmetric Kullback-Leibler Divergence)和JS散度(Jenson-Shannon Divergence)。
以上是关于PaddleSlim 模型量化 源代码解读的主要内容,如果未能解决你的问题,请参考以下文章