（四）TensorRT | 基于 GPU 端的 Python 推理

1. TensorRT 的简介和安装

TensorRT 是一种基于英伟达硬件的高性能的深度学习前向推理框架，本文介绍使用 TensorRT 在通用 GPU 上的部署流程。

本地需先安装 CUDA，以 CUDA11.0、TensorRT-8.2.5.1 为例。首先，去官网下载（需先登录）对应的压缩包。Python 安装文件 whl 位于解压后根目录下的 python 文件夹内，pip 安装对应版本即可。

本文主要代码来自 TensorRT 仓库：https://github.com/NVIDIA/TensorRT/tree/main/samples/python/yolov3_onnx。

2. TensorRT 的基本使用

和通用的推理流程类似，本文代码的流程按照：输入图像的预处理，把处理后的速度拷贝到设备（GPU）上，在设备上运行模型推理，将设备上的推理结果拷贝到主机（本地 CPU），针对推理结果的后处理。

2.1 模型转换

本文以 ONNX 为基础，将其转换为基于 TensorRT 推理的 trt 文件格式。关键函数为 runtime.deserialize_cuda_engine，函数原型为：

deserialize_cuda_engine(self: tensorrt.tensorrt.Runtime, serialized_engine: buffer)→ tensorrt.tensorrt.ICudaEngine
1

可以看到它是类 tensorrt.tensorrt.Runtime 的成员函数，参数只有 serialized_engine 一个。该参数可来自于读取已存在的 trt 文件，或通过序列化 ONNX 模型得到。

2.1.1 读取 trt 文件

with open(engine_file_path, "rb") as f:
    with trt.Runtime(TRT_LOGGER) as runtime:
        return runtime.deserialize_cuda_engine(f.read())
1
2
3

2.1.2 序列化 ONNX 模型

序列化 ONNX 模型的关键函数是 build_serialized_network，函数原型为：

build_serialized_network(self: tensorrt.tensorrt.Builder, network: tensorrt.tensorrt.INetworkDefinition, config: tensorrt.tensorrt.IBuilderConfig)→ tensorrt.tensorrt.IHostMemory
1

可以看到它是类 tensorrt.tensorrt.Builder 类的成员函数，参数有 network 和 config 两个。函数返回类型和上述 f.read() 的内容类似，作为函数 deserialize_cuda_engine 的参数。

第一个参数 network 的类型为 tensorrt.tensorrt.INetworkDefinition，通过函数 create_network 得到，函数原型为：

create_network(self: tensorrt.tensorrt.Builder, flags: int = 0)→ tensorrt.tensorrt.INetworkDefinition
1

参数 flags 的内容与 TensorRT 中的显示 Batch 和隐式 Batch 模式有关，相关内容可参考文档。在 Python 中，TensorRT 推荐写法：

network = builder.create_network(
    1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
1
2

第二个参数 config 的类型为 tensorrt.tensorrt.IBuilderConfig，通过函数 create_builder_config 得到，函数原型为：

create_builder_config(self: tensorrt.tensorrt.Builder)→ tensorrt.tensorrt.IBuilderConfig
1

该函数主要用于设置一些配置项，配置内容可参考文档。

2.1.3 get_engine

把上述两种方式结合起来，得到 get_engine 函数内容，返回反序列化后的模型。

def get_engine(onnx_file_path, engine_file_path=""):
    # 如果不指定 engine_file_path 则通过 build_engine 生成 engine 文件
    def build_engine():
        # 基于 INetworkDefinition 构建 ICudaEngine
        builder = trt.Builder(TRT_LOGGER)
        # 基于 INetworkDefinition 和 IBuilderConfig 构建 engine
        network = builder.create_network(
            1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
        # 构建 builder 的配置对象
        config = builder.create_builder_config()
        # 构建 ONNX 解析器
        parser = trt.OnnxParser(network, TRT_LOGGER)
        # 构建 TensorRT 运行时
        runtime = trt.Runtime(TRT_LOGGER)
        # 参数设置
        config.max_workspace_size = 1 << 28 # 256MiB
        builder.max_batch_size = 1
        # 解析 onnx 模型
        if not os.path.exists(onnx_file_path):
            print(
                f"[INFO] ONNX file {onnx_file_path} not found.")
        print(f"[INFO] Loading ONNX file from {onnx_file_path}.")
        with open(onnx_file_path, "rb") as model:
            print("[INFO] Beginning ONNX file parsing.")
            if not parser.parse(model.read()):
                print("[ERROR] Failed to parse the ONNX file.")
                for error in range(parser.num_errors):
                    print(parser.get_error(error))
                return None 
        # 根据 yolov3.onnx，reshape 输入数据的形状
        network.get_input(0).shape = [1, 3, 608, 608]
        print("[INFO] Completed parsing of ONNX file.")
        print(f"[INFO] Building an engine from {onnx_file_path}.")
        # 序列化模型
        plan = builder.build_serialized_network(network, config)
        # 反序列化
        engine = runtime.deserialize_cuda_engine(plan)
        print("[INFO] Completed creating engine.")
        # 写入文件
        with open(engine_file_path, "wb") as f:
            f.write(plan)
        return engine 

    if os.path.exists(engine_file_path):
        print(f"[INFO] Reading engine from {engine_file_path}.")
        with open(engine_file_path, "rb") as f:
            with trt.Runtime(TRT_LOGGER) as runtime:
                return runtime.deserialize_cuda_engine(f.read())
    else:
        return build_engine()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

2.2 输入图像预处理

本文使用 YOLOv3 模型来自 DarkNet，预处理主要包括 resize 和归一化两种。

class PreprocessYOLO:
    def __init__(self, input_resolution):
        self.input_resolution = input_resolution

    def preprocess(self, image_path):
        image_raw, image_resized = self.load_and_resize(image_path)
        image_preprocesed = self.shuffle_and_normalize(image_resized)
        return image_raw, image_preprocesed

    def load_and_resize(self, image_path):
        image_raw = Image.open(image_path)
        new_resolution = (self.input_resolution[1], self.input_resolution[0])
        image_resized = image_raw.resize(new_resolution, resample=Image.CUBIC)
        image_resized = np.array(image_resized, dtype=np.float32, order="C")
        return image_raw, image_resized

    def shuffle_and_normalize(self, image):
        # 归一化
        image /= 255.0
        # (w,h,c) -> (c,h,w)
        image = np.transpose(image, [2, 0, 1])
        # (c,h,w) -> (n,c,h,w)
        image = np.expand_dims(image, axis=0)
        image = np.array(image, dtype=np.float32, order="C")
        return image 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

2.3 推理

在执行模型推理前，首先要在本地 CPU 和设备 GPU 上分配内存。

def allocate_buffers(engine):
    inputs   = []
    outputs  = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        size  = trt.volume(
            engine.get_binding_shape(binding)) * engine.max_batch_size
        dtype = trt.nptype(
            engine.get_binding_dtype(binding))
        # 分配主机内存和设备内存
        host_mem   = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # 绑定设备内存
        bindings.append(int(device_mem))
        # 输入输出绑定
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings, stream 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

然后，在执行推理时首先将图像数据从 CPU 拷贝到 GPU，然后执行推理，最后将推理结果从 GPU 拷贝到 CPU。

def do_inference(context, bindings, inputs, outputs, stream):
    # 将输入数据从主机拷贝到设备
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # 推理
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    # 将输出数据从设备拷贝到主机
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # 同步流
    stream.synchronize()
    # 仅返回主机上的输出
    return [out.host for out in outputs]
1
2
3
4
5
6
7
8
9
10
11

2.4 输出后处理

得到推理结果后，针对数据结果做后处理，由类 PostprocessYOLO 的各成员函数完成。

本项目的仓库地址：https://github.com/zhangtaoshan/cv_inference_python，欢迎交流。

3. 总结

本文使用 ONNX 这一中间件将其他模型转换为 TensorRT 推理时的格式，后续将介绍构建 ONNX 模型的基本流程，
本文介绍 GPU 上使用基于 TensorRT 的推理，和前几篇文章介绍的部署内容类似，主要分为三大阶段：输入图像预处理，模型推理，输出后推理。

相关阅读:
芯片科普 |ATE测试如何入门？ATE测试的工作内容和要求？
atcoder ABC 232 B~E题解
 【MySQL从入门到精通】【高级篇】（二十八）子查询优化，排序优化，GROUP BY优化和分页查询优化
 layui子界面操作数据后主界面刷新怎么操作
 Android音视频开发-AudioTrack
设计模式——策略模式（Strategy Pattern）+ Spring相关源码
 [笔记] 记录docker-compose使用和Harbor的部署过程
 【数字实验室】在时序逻辑中使用阻塞赋值会怎么样？
深入理解计算机系统——第七章 Linking
使用kubesphere图形界面创建一个devops的CI/CD流程
原文地址：https://blog.csdn.net/Skies_/article/details/126554430