Python Pipeline 核心功能

从设计上，Python Pipeline 框架实现轻量级的服务化部署，提供了丰富的核心功能，既能满足服务基本使用，又能满足特性需求。

安装与环境检查
服务启动与关闭
本地与远程推理
批量推理
单机多卡推理
多种计算芯片上推理
TensorRT 推理加速
MKLDNN 推理加速
低精度推理
复杂图结构 DAG 跳过某个 Op 运行

安装与环境检查

在运行 Python Pipeline 服务前，确保当前环境下可部署且通过安装指南已完成安装。其次，v0.8.0及以上版本提供了环境检查功能，检验环境是否安装正确。

输入以下命令，进入环境检查程序。

python3 -m paddle_serving_server.serve check

在环境检验程序中输入多条指令来检查，例如 check_pipeline，check_all等，完整指令列表如下。

指令	描述
check_all	检查 Paddle Inference、Pipeline Serving、C++ Serving。只打印检测结果，不记录日志
check_pipeline	检查 Pipeline Serving，只打印检测结果，不记录日志
check_cpp	检查 C++ Serving，只打印检测结果，不记录日志
check_inference	检查 Paddle Inference 是否安装正确，只打印检测结果，不记录日志
debug	发生报错后，该命令将打印提示日志到屏幕，并记录详细日志文件
exit	退出

程序会分别运行 cpu 和 gpu 示例。运行成功则打印 Pipeline cpu environment running success 和 Pipeline gpu environment running success。

/usr/local/lib/python3.7/runpy.py:125: RuntimeWarning: 'paddle_serving_server.serve' found in sys.modules after import of package 'paddle_serving_server', but prior to execution of 'paddle_serving_server.serve'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
Welcome to the check env shell.Type help to list commands.

(Cmd) check_pipeline
Pipeline cpu environment running success
Pipeline gpu environment running success

运行失败时，错误信息会记录到当前目录下 stderr.log 文件和 Pipeline_test_cpu/PipelineServingLogs 目录下。用户可根据错误信息调试。

(Cmd) check_all
PaddlePaddle inference environment running success
C++ cpu environment running success
C++ gpu environment running failure, if you need this environment, please refer to https://github.com/PaddlePaddle/Serving/blob/develop/doc/Install_CN.md
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/site-packages/paddle_serving_server/serve.py", line 541, in <module>
    Check_Env_Shell().cmdloop()
  File "/usr/local/lib/python3.7/cmd.py", line 138, in cmdloop
    stop = self.onecmd(line)
  File "/usr/local/lib/python3.7/cmd.py", line 217, in onecmd
    return func(arg)
  File "/usr/local/lib/python3.7/site-packages/paddle_serving_server/serve.py", line 501, in do_check_all
    check_env("all")
  File "/usr/local/lib/python3.7/site-packages/paddle_serving_server/env_check/run.py", line 94, in check_env
    run_test_cases(pipeline_test_cases, "Pipeline", is_open_std)
  File "/usr/local/lib/python3.7/site-packages/paddle_serving_server/env_check/run.py", line 66, in run_test_cases
    mv_log_to_new_dir(new_dir_path)
  File "/usr/local/lib/python3.7/site-packages/paddle_serving_server/env_check/run.py", line 48, in mv_log_to_new_dir
    shutil.move(file_path, dir_path)
  File "/usr/local/lib/python3.7/shutil.py", line 555, in move
    raise Error("Destination path '%s' already exists" % real_dst)
shutil.Error: Destination path '/home/work/Pipeline_test_cpu/PipelineServingLogs' already exists

服务启动与关闭

服务启动需要三类文件，PYTHON 程序、模型文件和配置文件。以Python Pipeline 快速部署案例为例，

.
├── config.yml
├── imgs
│   └── ggg.png
├── ocr_det_client
│   ├── serving_client_conf.prototxt
│   └── serving_client_conf.stream.prototxt
├── ocr_det_model
│   ├── inference.pdiparams
│   ├── inference.pdmodel
│   ├── serving_server_conf.prototxt
│   └── serving_server_conf.stream.prototxt
├── ocr_det.tar.gz
├── ocr_rec_client
│   ├── serving_client_conf.prototxt
│   └── serving_client_conf.stream.prototxt
├── ocr_rec_model
│   ├── inference.pdiparams
│   ├── inference.pdmodel
│   ├── serving_server_conf.prototxt
│   └── serving_server_conf.stream.prototxt
├── pipeline_http_client.py
├── pipeline_rpc_client.py
├── ppocr_keys_v1.txt
└── web_service.py

启动服务端程序运行 web_service.py，启动客户端程序运行 pipeline_http_client.py 或 pipeline_rpc_client.py。服务端启动的日志信息在 PipelineServingLogs 目录下可用于调试。

├── PipelineServingLogs
│   ├── pipeline.log
│   ├── pipeline.log.wf
│   └── pipeline.tracer

关闭程序可使用2种方式，

前台关闭程序：Ctrl+C 关停服务
后台关闭程序：

python3 -m paddle_serving_server.serve stop   # 触发 SIGINT 信号
python3 -m paddle_serving_server.serve kill   # 触发 SIGKILL 信号，强制关闭

本地与远程推理

本地推理是指在服务所在机器环境下开启多进程推理，而远程推理是指本地服务请求远程 C++ Serving 推理服务。

本地推理的优势是实现简单，一般本地处理相比于远程推理耗时更低。而远程推理的优势是可实现 Python Pipeline 较难实现的功能，如部署加密模型，大模型推理。

Python Pipeline 的本地推理可参考如下配置，在 uci op 中增加 local_service_conf 配置，并设置 client_type: local_predictor。

op:
    uci:
        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
        concurrency: 10

        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
        local_service_conf:

            #uci模型路径
            model_config: uci_housing_model

            #计算硬件类型: 空缺时由devices决定(CPU/GPU)，0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
            device_type: 0

            #计算硬件ID，优先由device_type决定硬件类型。devices为""或空缺时为CPU预测；当为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
            devices: "" # "0,1"

            #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
            client_type: local_predictor

            #Fetch结果列表，以client_config中fetch_var的alias_name为准
            fetch_list: ["price"]

Python Pipeline 的远程推理可参考如下配置，设置 client_type: brpc，server_endpoints，timeout 和本地 client_config。

op:
    bow:
        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
        concurrency: 1
    
        #client连接类型，brpc
        client_type: brpc

        #Serving交互重试次数，默认不重试
        retry: 1

        #Serving交互超时时间, 单位ms
        timeout: 3000

        #Serving IPs
        server_endpoints: ["127.0.0.1:9393"]

        #bow模型client端配置
        client_config: "imdb_bow_client_conf/serving_client_conf.prototxt"

        #Fetch结果列表，以client_config中fetch_var的alias_name为准
        fetch_list: ["prediction"]

批量推理

Pipeline 支持批量推理，通过增大 batch size 可以提高 GPU 利用率。Python Pipeline 支持3种 batch 形式以及适用的场景如下：

场景1：客户端打包批量数据(Client Batch)
场景2：服务端合并多个请求动态合并批量(Server auto-batching)
场景3：拆分一个大批量的推理请求为多个小批量推理请求(Server mini-batch)

一.客户端打包批量数据

当输入数据是 numpy 类型，如shape 为[4, 3, 512, 512]的 numpy 数据，即4张图片，可直接作为输入数据。当输入数据的 shape 不同时，需要按最大的shape的尺寸 Padding 对齐后发送给服务端

二.服务端合并多个请求动态合并批量

有助于提升吞吐和计算资源的利用率，当多个请求的 shape 尺寸不相同时，不支持合并。当前有2种合并策略，分别是：

等待时间与最大批量结合（推荐）：结合batch_size和auto_batching_timeout配合使用，实际请求的批量条数超过batch_size时会立即执行，不超过时会等待auto_batching_timeout时间再执行

op:
    bow:
        # 并发数，is_thread_op=True时，为线程并发；否则为进程并发
        concurrency: 1

        # client连接类型，brpc, grpc和local_predictor
        client_type: brpc

        # Serving IPs
        server_endpoints: ["127.0.0.1:9393"]

        # bow模型client端配置
        client_config: "imdb_bow_client_conf/serving_client_conf.prototxt"

        # 批量查询Serving的数量, 默认1。batch_size>1要设置auto_batching_timeout，否则不足batch_size时会阻塞
        batch_size: 2

        # 批量查询超时，与batch_size配合使用
        auto_batching_timeout: 2000

阻塞式等待：仅设置batch_size，不设置auto_batching_timeout或auto_batching_timeout=0，会一直等待接受 batch_size 个请求后再推理。

op:
    bow:
        # 并发数，is_thread_op=True时，为线程并发；否则为进程并发
        concurrency: 1

        # client连接类型，brpc, grpc和local_predictor
        client_type: brpc

        # Serving IPs
        server_endpoints: ["127.0.0.1:9393"]

        # bow模型client端配置
        client_config: "imdb_bow_client_conf/serving_client_conf.prototxt"

        # 批量查询Serving的数量, 默认1。batch_size>1要设置auto_batching_timeout，否则不足batch_size时会阻塞
        batch_size: 2

        # 批量查询超时，与batch_size配合使用
        auto_batching_timeout: 2000

三.Mini-Batch

拆分一个批量数据推理请求成为多个小块推理：会降低批量数据 Padding 对齐的大小，从而提升速度。可参考 OCR 示例，核心思路是拆分数据成多个小批量，放入 list 对象 feed_list 并返回

def preprocess(self, input_dicts, data_id, log_id):
    (_, input_dict), = input_dicts.items()
    raw_im = input_dict["image"]
    data = np.frombuffer(raw_im, np.uint8)
    im = cv2.imdecode(data, cv2.IMREAD_COLOR)
    dt_boxes = input_dict["dt_boxes"]
    dt_boxes = self.sorted_boxes(dt_boxes)
    feed_list = []
    img_list = []
    max_wh_ratio = 0

    ## Many mini-batchs, the type of feed_data is list.
    max_batch_size = len(dt_boxes)

    # If max_batch_size is 0, skipping predict stage
    if max_batch_size == 0:
        return {}, True, None, ""
    boxes_size = len(dt_boxes)
    batch_size = boxes_size // max_batch_size
    rem = boxes_size % max_batch_size
    for bt_idx in range(0, batch_size + 1):
        imgs = None
        boxes_num_in_one_batch = 0
        if bt_idx == batch_size:
            if rem == 0:
                continue
            else:
                boxes_num_in_one_batch = rem
        elif bt_idx < batch_size:
            boxes_num_in_one_batch = max_batch_size
        else:
            _LOGGER.error("batch_size error, bt_idx={}, batch_size={}".
                          format(bt_idx, batch_size))
            break

        start = bt_idx * max_batch_size
        end = start + boxes_num_in_one_batch
        img_list = []
        for box_idx in range(start, end):
            boximg = self.get_rotate_crop_image(im, dt_boxes[box_idx])
            img_list.append(boximg)
            h, w = boximg.shape[0:2]
            wh_ratio = w * 1.0 / h
            max_wh_ratio = max(max_wh_ratio, wh_ratio)
        _, w, h = self.ocr_reader.resize_norm_img(img_list[0],
                                                  max_wh_ratio).shape

        imgs = np.zeros((boxes_num_in_one_batch, 3, w, h)).astype('float32')
        for id, img in enumerate(img_list):
            norm_img = self.ocr_reader.resize_norm_img(img, max_wh_ratio)
            imgs[id] = norm_img
        feed = {"x": imgs.copy()}
        feed_list.append(feed)

        return feed_list, False, None, ""

单机多卡推理

单机多卡推理与 config.yml 中配置4个参数关系紧密，is_thread_op、concurrency、device_type 和 devices，必须在进程模型和 GPU 模式，每张卡上可分配多个进程，即 M 个 Op 进程与 N 个 GPU 卡绑定。

dag:
    #op资源类型, True, 为线程模型；False，为进程模型
    is_thread_op: False

op:
    det:
        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
        concurrency: 6

        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
        local_service_conf:

            client_type: local_predictor

            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
            device_type: 0

            # 计算硬件 ID，当 devices 为""或不写时为 CPU 预测；当 devices 为"0", "0,1,2"时为 GPU 预测，表示使用的 GPU 卡
            devices: "0,1,2"

以上述案例为例，concurrency:6，即启动6个进程，devices:0,1,2，根据轮询分配机制，得到如下绑定关系：

进程ID: 0 绑定 GPU 卡0
进程ID: 1 绑定 GPU 卡1
进程ID: 2 绑定 GPU 卡2
进程ID: 3 绑定 GPU 卡0
进程ID: 4 绑定 GPU 卡1
进程ID: 5 绑定 GPU 卡2
进程ID: 6 绑定 GPU 卡0

对于更灵活的进程与 GPU 卡绑定方式，会持续开发。

多种计算芯片上推理

除了支持 CPU、GPU 芯片推理之外，Python Pipeline 还支持在多种计算硬件上推理。根据 config.yml 中的 device_type 和 devices来设置推理硬件和加速库如下：

CPU(Intel) : 0
GPU(GPU / Jetson / 海光 DCU) : 1
TensorRT : 2
CPU(Arm) : 3
XPU : 4
Ascend310 : 5
ascend910 : 6

当不设置device_type时，根据 devices 来设置，即当 device_type 为 "" 或空缺时为 CPU 推理；当有设定如"0,1,2"时，为 GPU 推理，并指定 GPU 卡。

以使用 XPU 的编号为0卡为例，配合 ir_optim 一同开启，config.yml详细配置如下：

# 计算硬件类型
device_type: 4

# 计算硬件ID，优先由device_type决定硬件类型
devices: "0"

# 开启ir优化
ir_optim: True

TensorRT 推理加速

TensorRT 是一个高性能的深度学习推理优化器，在 Nvdia 的 GPU 硬件平台运行的推理框架，为深度学习应用提供低延迟、高吞吐率的部署推理。

通过设置device_type、devices和ir_optim 字段即可实现 TensorRT 高性能推理。必须同时设置 ir_optim: True 才能开启 TensorRT。

op:
    imagenet:
        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
        concurrency: 1

        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
        local_service_conf:

            #uci模型路径
            model_config: serving_server/

            #计算硬件类型: 空缺时由devices决定(CPU/GPU)，0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
            device_type: 2

            #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
            devices: "1" # "0,1"

            #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
            client_type: local_predictor

            #Fetch结果列表，以client_config中fetch_var的alias_name为准
            fetch_list: ["score"]

            #开启 ir_optim
            ir_optim: True

MKL-DNN 推理加速

MKL-DNN 针对 Intel CPU 和 GPU 的数学核心库，对深度学习网络进行算子和指令集的性能优化，从而提升执行速度。Paddle 框架已集成了 MKL-DNN。

目前仅支持 Intel CPU 推理加速，通过设置device_type 和 devices 和 use_mkldnn 字段使用 MKL-DNN。

op:
    imagenet:
        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
        concurrency: 1

        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
        local_service_conf:

            #uci模型路径
            model_config: serving_server/

            #计算硬件类型: 空缺时由devices决定(CPU/GPU)，0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
            device_type: 0

            #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
            devices: ""

            #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
            client_type: local_predictor

            #Fetch结果列表，以client_config中fetch_var的alias_name为准
            fetch_list: ["score"]

            #开启 MKLDNN
            use_mkldnn: True

低精度推理

Pipeline Serving支持低精度推理，CPU、GPU和TensoRT支持的精度类型如下图所示：

低精度推理需要有量化模型，配合config.yml配置一起使用，以低精度示例为例

一.CPU 低精度推理

通过设置，device_type 和 devices 字段使用 CPU 推理，通过调整precision、thread_num和use_mkldnn参数选择低精度和性能调优。

op:
    imagenet:
        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
        concurrency: 1

        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
        local_service_conf:

            #uci模型路径
            model_config: serving_server/

            #计算硬件类型: 空缺时由devices决定(CPU/GPU)，0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
            device_type: 0

            #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
            devices: ""

            #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
            client_type: local_predictor

            #Fetch结果列表，以client_config中fetch_var的alias_name为准
            fetch_list: ["score"]

            #精度，CPU 支持: "fp32"(default), "bf16"(mkldnn); 不支持: "int8"
            precision: "bf16"

            #CPU 算数计算线程数，默认4线程
            thread_num: 10

            #开启 MKLDNN
            use_mkldnn: True

二.GPU 和 TensorRT 低精度推理

通过设置device_type 和 devices 字段使用原生 GPU 或 TensorRT 推理，通过调整precision、ir_optim和use_calib参数选择低精度和性能调优，如开启 TensorRT，必须一同开启ir_optim，use_calib仅配合 int8 使用。

op:
    imagenet:
        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
        concurrency: 1

        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
        local_service_conf:

            #uci模型路径
            model_config: serving_server/

            #计算硬件类型: 空缺时由devices决定(CPU/GPU)，0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
            device_type: 2

            #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
            devices: "1" # "0,1"

            #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
            client_type: local_predictor

            #Fetch结果列表，以client_config中fetch_var的alias_name为准
            fetch_list: ["score"]

            #精度，GPU 支持: "fp32"(default), "fp16", "int8"
            precision: "int8"

            #开启 TensorRT int8 calibration
            use_calib: True

            #开启 ir_optim
            ir_optim: True

三.性能测试

测试环境如下：

GPU 型号: A100-40GB
CPU 型号: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz * 160
CUDA: CUDA Version: 11.2
CuDNN: 8.0

测试方法：

模型: Resnet50 量化模型
部署方法: Python Pipeline 部署
计时方法: 刨除第一次运行初始化，运行100次计算平均值

在此环境下测试不同精度推理结果，GPU 推理性能较好的配置是

GPU + int8 + ir_optim + TensorRT + use_calib : 15.1 ms
GPU + fp16 + ir_optim + TensorRT : 17.2 ms

CPU 推理性能较好的配置是

CPU + bf16 + MKLDNN : 18.2 ms
CPU + fp32 + thread_num=10 : 18.4 ms

完整性能指标如下：

复杂图结构 DAG 跳过某个 Op 运行

此应用场景一般在 Op 前后处理中有 if 条件判断时，不满足条件时，跳过后面处理。实际做法是在跳过此 Op 的 process 阶段，只要在 preprocess 做好判断，跳过 process 阶段，在和 postprocess 后直接返回即可。 preprocess 返回结果列表的第二个结果是 is_skip_process=True 表示是否跳过当前 Op 的 process 阶段，直接进入 postprocess 处理。

## Op::preprocess() 函数实现
def preprocess(self, input_dicts, data_id, log_id):
    """
    In preprocess stage, assembling data for process stage. users can 
    override this function for model feed features.
    Args:
        input_dicts: input data to be preprocessed
        data_id: inner unique id
        log_id: global unique id for RTT
    Return:
        input_dict: data for process stage
        is_skip_process: skip process stage or not, False default
        prod_errcode: None default, otherwise, product errores occured.
                      It is handled in the same way as exception. 
        prod_errinfo: "" default
    """
    # multiple previous Op
    if len(input_dicts) != 1:
        _LOGGER.critical(
            self._log(
                "Failed to run preprocess: this Op has multiple previous "
                "inputs. Please override this func."))
        os._exit(-1)
    (_, input_dict), = input_dicts.items()
    return input_dict, False, None, ""

以下示例 Jump::preprocess() 重载了原函数，返回了 True 字段

class JumpOp(Op):
    ## Overload func JumpOp::preprocess
    def preprocess(self, input_dicts, data_id, log_id):
        (_, input_dict), = input_dicts.items()
        if input_dict.has_key("jump"):
            return input_dict, True, None, ""
        else
            return input_dict, False, None, ""

PaddlePaddle / Serving

Python Pipeline 核心功能

安装与环境检查

服务启动与关闭

本地与远程推理

批量推理

单机多卡推理

多种计算芯片上推理

TensorRT 推理加速

MKL-DNN 推理加速

低精度推理

复杂图结构 DAG 跳过某个 Op 运行

About

Releases

Contributors

Activities

PaddlePaddle / Serving .gitee-modal { width: 500px !important; }

Python Pipeline 核心功能

安装与环境检查

服务启动与关闭

本地与远程推理

批量推理

单机多卡推理

多种计算芯片上推理

TensorRT 推理加速

MKL-DNN 推理加速

低精度推理

复杂图结构 DAG 跳过某个 Op 运行

About

Releases

The Open Source Evaluation Index is derived from the OSS Compass evaluation system, which evaluates projects around the following three dimensions

Contributors

Activities

Search

PaddlePaddle / Serving