PaddleSlim 模型量化 源代码解读

Posted 沉迷单车的追风少年

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了PaddleSlim 模型量化 源代码解读相关的知识,希望对你有一定的参考价值。

前言:paddleslim的中文资料非常丰富,在如何使用的教学上做的非常完善。但是源码解读与解析资料却非常少,这篇博客结合源代码大家一起学习一下paddle静态离线量化的原理和方法。

目录

原理简述

支持的量化类型

支持后端

量化操作

Preparation stage

Sampling stage

保存scale

插入量化/反量化节点

激活校准过程梳理

量化公式

具体流程


原理简述

这部分官方手册说的非常nice,建议阅读:量化 — PaddleSlim 文档

静态离线量化这部分主要是封装了paddle的接口,基本的使用方法如下:

            ptq = PostTrainingQuantization(
                        executor=exe,
                        sample_generator=sample_generator,
                        model_dir=model_dir,
                        model_filename=model_filename,
                        params_filename=params_filename,
                        batch_size=batch_size,
                        batch_nums=batch_nums,
                        algo=algo,
                        quantizable_op_type=quantizable_op_type)
            ptq.quantize()
            ptq.save_quantized_model(save_model_path)

支持的量化类型

paddle通过optimize_model控制是否进行算子融合优化,但是只支持CPU上的算子融合,而且只有conv2d/depthwise_conv2d + bn算子融合,相比于OPENPPL的算子融合,做了非常多的算子融合:

OpenPPL PPQ量化(3):量化计算图的加载和预处理 源码剖析_沉迷单车的追风少年的博客-CSDN博客

paddle就显得非常简陋了……

参数is_full_quantize,不要被名字迷惑了,这并不是全量化。paddle当中的"部分量化"只支持"conv2d"、"depthwise_conv2d"、"conv2d_transpose"、"mul"、"matmul"、"matmul_v2"这6种类型。支持的理由是这几个是主要的计算密集型算子,量化他们是最有效的。

而全量化支持的算子类型也是有限的,除了刚才列出的六种类型之外,还有如下:

SUPPORT_ACT_QUANTIZATION_OP_DICT = 
    "mul": [["X", "Y"], ["Out"]],
    "matmul": [["X", "Y"], ["Out"]],
    "matmul_v2": [["X", "Y"], ["Out"]],
    "pool2d": [["X"], ["Out"]],
    "elementwise_add": [["X", "Y"], ["Out"]],
    "concat": [["X"], ["Out"]],
    "softmax": [["X"], ["Out"]],
    "argmax": [["X"], ["Out"]],
    "transpose": [["X"], ["Out"]],
    "equal": [["X", "Y"], ["Out"]],
    "gather": [["X"], ["Out"]],
    "greater_equal": [["X", "Y"], ["Out"]],
    "greater_than": [["X", "Y"], ["Out"]],
    "less_equal": [["X", "Y"], ["Out"]],
    "less_than": [["X", "Y"], ["Out"]],
    "mean": [["X"], ["Out"]],
    "not_equal": [["X", "Y"], ["Out"]],
    "reshape": [["X"], ["Out"]],
    "reshape2": [["X"], ["Out"]],
    "transpose2": [["X"], ["Out"]],
    "nearest_interp": [["X"], ["Out"]],
    "trilinear_interp": [["X"], ["Out"]],
    "slice": [["Input"], ["Out"]],
    "squeeze": [["X"], ["Out"]],
    "elementwise_sub": [["X", "Y"], ["Out"]],
    "relu": [["X"], ["Out"]],
    "relu6": [["X"], ["Out"]],
    "leaky_relu": [["X"], ["Out"]],
    "prelu": [["X", "Alpha"], ["Out"]],
    "tanh": [["X"], ["Out"]],
    "swish": [["X"], ["Out"]],
    "dropout": [["X"], ["Out"]],
    "batch_norm": [["X"], ["Y"]],
    "layer_norm": [["X"], ["Y"]],
    "sigmoid": [["X"], ["Out"]],
    "elementwise_mul": [["X", "Y"], ["Out"]],
    "elementwise_pow": [["X", "Y"], ["Out"]],
    "hard_swish": [["X"], ["Out"]],
    "hard_sigmoid": [["X"], ["Out"]],
    "gru": [["Input", "Weight"], ["Hidden"]],
    "lstm": [["Input", "Weight"], ["Hidden"]],
    "pad2d": [["X"], ["Out"]],
    "pad3d": [["X"], ["Out"]],
    "flatten": [["X"], ["Out"]],
    "flatten2": [["X"], ["Out"]],
    "unsqueeze2": [["X"], ["Out"]],
    "flatten_contiguous_range": [["X"], ["Out"]],
    "split": [["X"], ["Out"]],
    "squeeze2": [["X"], ["Out"]],
    "nearest_interp_v2": [["X"], ["Out"]],
    "bilinear_interp": [["X"], ["Out"]],
    "bilinear_interp_v2": [["X"], ["Out"]],
    "fill_constant_batch_size_like": [["Input"], ["Out"]],
    "arg_max": [["X"], ["Out"]],
    "abs": [["X"], ["Out"]],
    "assign": [["X"], ["Out"]],
    "cast": [["X"], ["Out"]],
    "clip": [["X"], ["Out"]],
    "box_coder": [["PriorBox"], ["OutputBox"]],
    "crop": [["X"], ["Out"]],
    "cumsum": [["X"], ["Out"]],
    "expand_v2": [["X"], ["Out"]],
    "fill_any_like": [["X"], ["Out"]],
    "fill_constant": [[], ["Out"]],
    "gelu": [["X"], ["Out"]],
    "instance_norm": [["X"], ["Y"]],
    "lookup_table": [["W", "Ids"], ["Out"]],
    "lookup_table_v2": [["W", "Ids"], ["Out"]],
    "norm": [["X"], ["Norm"]],
    "p_norm": [["X"], ["Out"]],
    "pow": [["X"], ["Out"]],
    "reduce_mean": [["X"], ["Out"]],
    "stack": [["X"], ["Y"]],
    "top_k_v2": [["X"], ["Out", "Indices"]],
    "logical_and": [["X", "Y"], ["Out"]],
    "logical_not": [["X"], ["Out"]],
    "meshgrid": [["X"], ["Out"]],
    "roi_align": [["X", "ROIs"], ["Out"]],
    "strided_slice": [["Input"], ["Out"]],
    "where": [["Condition", "X", "Y"], ["Out"]],
    "grid_sampler": [["X", "Grid"], ["Output"]],
    "tile": [["X"], ["Out"]],
    "group_norm": [["X"], ["Y", "Mean", "Variance"]],
    "reduce_sum": [["X"], ["Out"]],
    "square": [["X"], ["Out"]],
    "softplus": [["X"], ["Out"]],
    "shuffle_channel": [["X"], ["Out"]],
    "reduce_max": [["X"], ["Out"]],
    "scale": [["X"], ["Out"]],

支持后端

支持的后端如下:

support_deploy_backend = [None, "tensorrt", "mkldnn", "arm"]

对应的量化类是:BaseQuantizer、TensorRTQuantizer、MKLDNNQuantizer、ARMCPUQuantizer。相比于openppl来看支持的后端数量非常少,关于不同后端之间优化方法上的区别,先挖个坑以后再讲。

后面以BaseQuantizer为例讲解下面的内容。

量化操作

关键是迭代寻找边界值的过程。因为每次sample的过程会有不同策略,这里用abs_max()为例先来看看。

Preparation stage

        if self._algo in ["KL", "hist"]:
            batch_id = 0
            with tqdm(
                total=self._batch_nums,
                bar_format='Preparation stage, Run batch:|bar| n_fmt/total_fmt',
                ncols=80,
            ) as t:
                for data in self._data_loader():
                    self._executor.run(
                        program=self._program,
                        feed=data,
                        fetch_list=self._fetch_list,
                        return_numpy=False,
                        scope=self._scope,
                    )
                    self._collect_activation_abs_min_max()
                    batch_id += 1
                    t.update()
                    if self._batch_nums and batch_id >= self._batch_nums:
                        break
            self._init_sampling_act_histogram()

需要注意的是当KL或hist的时候,需要对所有激活值计算abs_min和abs_max,调用_collect_activation_abs_min_max()方法。

Sampling stage

加载权重值:

var_tensor = utils.load_variable_data(self._scope, var_name)

然后会分成abs_max、channel_wise_abs_max两种方式寻找边界的最大值:

                if self._weight_quantize_type == "abs_max":
                    abs_max_value = float(np.max(np.abs(var_tensor)))
                elif self._weight_quantize_type == "channel_wise_abs_max":
                    abs_max_value = []
                    if (
                        self._weight_op_pairs[var_name]
                        in utils._channelwise_quant_axis1_ops
                    ):
                        for i in range(var_tensor.shape[1]):
                            abs_max_value.append(
                                float(np.max(np.abs(var_tensor[:, i])))
                            )
                    else:
                        for i in range(var_tensor.shape[0]):
                            abs_max_value.append(
                                float(np.max(np.abs(var_tensor[i])))
                            )
                self._quantized_threshold[var_name] = abs_max_value
        batch_id = 0
        with tqdm(
            total=self._batch_nums,
            bar_format='Sampling stage, Run batch:|bar| n_fmt/total_fmt',
            ncols=80,
        ) as t:
            for data in self._data_loader():
                self._executor.run(
                    program=self._program,
                    feed=data,
                    fetch_list=self._fetch_list,
                    return_numpy=False,
                    scope=self._scope,
                )
                self._sampling()
                batch_id += 1
                t.update()
                if self._batch_nums and batch_id >= self._batch_nums:
                    break

保存scale

量化的结果就是为每个节点寻找scale,之前在每个tensor_name当中保存了sacle,此时分割拿出来就行:

real_tensor_name, opera, scalar = tensor_name.split('#')

这里需要不断动态更新max_scale,在opera放缩的时候会用到。

插入量化/反量化节点

例如此图中的D是反量化操作,Q是量化操作。 

需要我们在计算图中把量化和反量化节点插入进去:

        # use QuantizationTransformPass to insert fake_quant/fake_dequantize op
        if not self._onnx_format:
            transform_pass = QuantizationTransformPass(
                scope=self._scope,
                place=self._place,
                weight_bits=self._weight_bits,
                activation_bits=self._activation_bits,
                activation_quantize_type=self._activation_quantize_type,
                weight_quantize_type=self._weight_quantize_type,
                quantizable_op_type=self.quant_config.weight_quant_operation_types,
            )
        else:
            transform_pass = QuantizationTransformPassV2(
                scope=self._scope,
                place=self._place,
                weight_bits=self._weight_bits,
                activation_bits=self._activation_bits,
                activation_quantize_type=self._activation_quantize_type,
                weight_quantize_type=self._weight_quantize_type,
                quantizable_op_type=self.quant_config.weight_quant_operation_types,
            )

        for sub_graph in graph.all_sub_graphs():
            # Insert fake_quant/fake_dequantize op must in test graph, so
            # set per graph's _for_test is True.
            sub_graph._for_test = True
            transform_pass.apply(sub_graph)

        # use AddQuantDequantPass to insert fake_quant_dequant op
        if not self._onnx_format:
            add_quant_dequant_pass = AddQuantDequantPass(
                scope=self._scope,
                place=self._place,
                quantizable_op_type=self.quant_config.activation_quant_operation_types,
            )
        else:
            add_quant_dequant_pass = AddQuantDequantPassV2(
                scope=self._scope,
                place=self._place,
                quantizable_op_type=self.quant_config.activation_quant_operation_types,
            )

        for sub_graph in graph.all_sub_graphs():
            sub_graph._for_test = True
            add_quant_dequant_pass.apply(sub_graph)

激活校准过程梳理

weight是常数,这个是不需要校准的(因为不会存在误差),所以需要被校准的只有激活值。

量化公式

假设r表示量化前的浮点数,量化后的整数q可以表示为:

round(⋅)和clip(⋅)分别表示取整和截断操作,是量化后的最小值和最大值。s是数据量化的间隔,z是表示数据偏移的偏置,z为0的量化被称为对称(Symmetric)量化,不为0的量化称为非对称(Asymmetric)量化。对称量化可以避免量化算子在推理中计算z相关的部分,降低推理时的计算复杂度;非对称量化可以根据实际数据的分布确定最小值和最小值,可以更加充分的利用量化数据信息,使得计算精度更高。 

具体流程

  • 使用直方图统计的方式得到原始FP32数据的统计分布

  • 在给定的搜索空间中选取若干个分别对激活值量化,得到量化后的数据

  • 使用直方图统计得到的统计分布;

  • 计算每个的统计分布差异,并找到差异性最低的一个对应的来计算相应的量化参数,常见的用于度量分布差异的指标包括KL散度(Kullback-Leibler Divergence)、对称KL散度(Symmetric Kullback-Leibler Divergence)和JS散度(Jenson-Shannon Divergence)。

以上是关于PaddleSlim 模型量化 源代码解读的主要内容,如果未能解决你的问题,请参考以下文章

精度无损,体积压缩70%以上,百度PaddleSlim为你的模型瘦身

解读业界5种主流的深度网络模型

「深度学习一遍过」必修2:解读简化版模型代码

深度学习之模型量化

知识蒸馏轻量化模型架构剪枝…几种深度学习模型压缩方法

深度学习模型量化(低精度推理)大总结