目前将Pytorch转为TensorRT主要有两种途径。一种是将Pytorch先转为ONNX,然后再用TensorRT解析ONNX格式;另一种是将通过开源项目torch2trt将Pytorch直接转为TensorRT。这两种方式或多或少会遇到算子缺失问题,比如Pytorch转ONNX时ONNX不支持einsum (可用einops.rearrange代替)、torch.nn.functional.fold(暂未找到好的替换算子,速度精度都会降)、nn.AdaptiveAvgPool2d(需要改成普通的nn.AvgPool2d(),对精度影响不会很大)这三个算子。再比如ONNX支持但Tensor RT不支持的算子:Mod/fmod, MaxPool3D, torch.nn.functional.Pad。
这里提供一下Tensor RT7.0支持的算子和onnx支持的算子的相关链接。
这里主要介绍先将Pytorch转为ONNX的操作。Pytorch直接转TensorRT链接。
下面是单输入输出模型代码示意。
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# device = "cpu"
def model_converter(model, checkpoint_path, onnx_path):
model.load_state_dict(torch.load(checkpoint_path))
model = model.to(device)
model.eval()
dummy_input = torch.randn(1, 6, 512, 512, device=device)
input_names = ['input']
output_names = ['cd_map']
torch.onnx.export(model, dummy_input, onnx_path,
# export_params=True,
verbose=True,
input_names=input_names,
output_names=output_names)
if __name__ == '__main__':
model = None
checkpoint_path = ''
onnx_path = ''
model_converter(model, checkpoint_path, onnx_path)
通过运行往往会报错,下面是常见报错的解决方法。
RuntimeError: Failed to export an ONNX attribute 'onnx::Gather', since it's not constant, please try to make things (e.g., kernel size) static if possible
这里主要是代码在使用avg_pool2d
时我们在设置其kernel_size时往往使用x.size()取后两个结果作为卷积核尺寸大小,改起来比较简单如下(参考链接):
# Pytorch原代码
def forward(self, x):
batch, channels, height, width = x.size()
out = F.avg_pool2d(x, kernel_size=[height, width]).view(batch, -1)
return out
# 修改后的Pytorch代码
def forward(self, x):
batch, channels, height, width = x.size()
if torch.is_tensor(height):
height = height.item() # 这里是修正代码
width = width.item() # 这里是修正代码
out = F.avg_pool2d(x, kernel_size=[height, width]).view(batch, -1)
return out
该函数详细的 API 文档可参考 torch.onnx ‒ PyTorch 1.11.0 documentation
torch.onnx.export
在 torch.onnx.__init__.py
文件中的定义如下:
def export(model, args, f, export_params=True, verbose=False, training=TrainingMode.EVAL,
input_names=None, output_names=None, aten=False, export_raw_ir=False,
operator_export_type=None, opset_version=None, _retain_param_name=True,
do_constant_folding=True, example_outputs=None, strip_doc_string=True,
dynamic_axes=None, keep_initializers_as_inputs=None, custom_opsets=None,
enable_onnx_checker=True, use_external_data_format=False):
前三个必选参数为模型、模型输入、导出的 onnx 文件名,我们对这几个参数已经很熟悉了。我们来着重看一下后面的一些常用可选参数。
模型中是否存储模型权重。一般中间表示包含两大类信息:模型结构和模型权重,这两类信息可以在同一个文件里存储,也可以分文件存储。ONNX 是用同一个文件表示记录模型的结构和权重的。 我们部署时一般都默认这个参数为 True。如果 onnx 文件是用来在不同框架间传递模型(比如 PyTorch 到 Tensorflow)而不是用于部署,则可以令这个参数为 False。
设置输入和输出张量的名称。如果不设置的话,会自动分配一些简单的名字(如数字)。 ONNX 模型的每个输入和输出张量都有一个名字。很多推理引擎在运行 ONNX 文件时,都需要以“名称-张量值”的数据对来输入数据,并根据输出张量的名称来获取输出数据。在进行跟张量有关的设置(比如添加动态维度)时,也需要知道张量的名字。 在实际的部署流水线中,我们都需要设置输入和输出张量的名称,并保证 ONNX 和推理引擎中使用同一套名称。
转换时参考哪个 ONNX 算子集版本,默认为9。后文会详细介绍 PyTorch 与 ONNX 的算子对应关系。
指定输入输出张量的哪些维度是动态的。 为了追求效率,ONNX 默认所有参与运算的张量都是静态的(张量的形状不发生改变)。但在实际应用中,我们又希望模型的输入张量是动态的,尤其是本来就没有形状限制的全卷积模型。因此,我们需要显式地指明输入输出张量的哪几个维度的大小是可变的。 我们来看一个dynamic_axes
的设置例子:
import torch
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.conv = torch.nn.Conv2d(3, 3, 3)
def forward(self, x):
x = self.conv(x)
return x
model = Model()
dummy_input = torch.rand(1, 3, 10, 10)
model_names = ['model_static.onnx',
'model_dynamic_0.onnx',
'model_dynamic_23.onnx']
dynamic_axes_0 = {
'in' : [0],
'out' : [0]
}
dynamic_axes_23 = {
'in' : [2, 3],
'out' : [2, 3]
}
torch.onnx.export(model, dummy_input, model_names[0],
input_names=['in'], output_names=['out'])
torch.onnx.export(model, dummy_input, model_names[1],
input_names=['in'], output_names=['out'], dynamic_axes=dynamic_axes_0)
torch.onnx.export(model, dummy_input, model_names[2],
input_names=['in'], output_names=['out'], dynamic_axes=dynamic_axes_23)
首先,我们导出3个 ONNX 模型,分别为没有动态维度、第0维动态、第2第3维动态的模型。 在这份代码里,我们是用列表的方式表示动态维度,例如:
dynamic_axes_0 = {
'in' : [0],
'out' : [0]
}
```
由于 ONNX 要求每个动态维度都有一个名字,这样写的话会引出一条 UserWarning,警告我们通过列表的方式设置动态维度的话系统会自动为它们分配名字。一种显式添加动态维度名字的方法如下:
```python
dynamic_axes_0 = {
'in' : {0: 'batch'},
'out' : {0: 'batch'}
}
由于在这份代码里我们没有更多的对动态维度的操作,因此简单地用列表指定动态维度即可。 之后,我们用下面的代码来看一看动态维度的作用:
import onnxruntime
import numpy as np
origin_tensor = np.random.rand(1, 3, 10, 10).astype(np.float32)
mult_batch_tensor = np.random.rand(2, 3, 10, 10).astype(np.float32)
big_tensor = np.random.rand(1, 3, 20, 20).astype(np.float32)
inputs = [origin_tensor, mult_batch_tensor, big_tensor]
exceptions = dict()
for model_name in model_names:
for i, input in enumerate(inputs):
try:
ort_session = onnxruntime.InferenceSession(model_name)
ort_inputs = {'in': input}
ort_session.run(['out'], ort_inputs)
except Exception as e:
exceptions[(i, model_name)] = e
print(f'Input[{i}] on model {model_name} error.')
else:
print(f'Input[{i}] on model {model_name} succeed.')
我们在模型导出计算图时用的是一个形状为(1, 3, 10, 10)
的张量。现在,我们来尝试以形状分别是(1, 3, 10, 10), (2, 3, 10, 10), (1, 3, 20, 20)
为输入,用ONNX Runtime运行一下这几个模型,看看哪些情况下会报错,并保存对应的报错信息。得到的输出信息应该如下:
Input[0] on model model_static.onnx succeed.
Input[1] on model model_static.onnx error.
Input[2] on model model_static.onnx error.
Input[0] on model model_dynamic_0.onnx succeed.
Input[1] on model model_dynamic_0.onnx succeed.
Input[2] on model model_dynamic_0.onnx error.
Input[0] on model model_dynamic_23.onnx succeed.
Input[1] on model model_dynamic_23.onnx error.
Input[2] on model model_dynamic_23.onnx succeed.
可以看出,形状相同的(1, 3, 10, 10)
的输入在所有模型上都没有出错。而对于batch(第0维)或者长宽(第2、3维)不同的输入,只有在设置了对应的动态维度后才不会出错。我们可以错误信息中找出是哪些维度出了问题。比如我们可以用以下代码查看input[1]
在model_static.onnx
中的报错信息:
print(exceptions[(1, 'model_static.onnx')])
# output
# [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Got invalid dimensions for input: in for the following indices index: 0 Got: 2 Expected: 1 Please fix either the inputs or the model.
这段报错告诉我们名字叫in
的输入的第0维不匹配。本来该维的长度应该为1,但我们的输入是2。实际部署中,如果我们碰到了类似的报错,就可以通过设置动态维度来解决问题。
直接上代码,这里数据处理需要根据自己模型的输入进行修改。我这里的模型输入是A和B两个文件中对应图片在通道维度的合并,且做了truncated_linear_stretch。
import os
import cv2
import torch
# import tensorrt as trt
import numpy as np
import onnxruntime as rt
from torchvision import transforms as T
from model.res_unet_diff import Res_Unet_diff
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# device = "cpu"
def model_converter(model, checkpoint_path, onnx_path):
model.load_state_dict(torch.load(checkpoint_path))
model = model.to(device)
# model = torch.load(checkpoint_path)
model.eval()
dummy_input = torch.randn(1, 6, 512, 512, device=device)
input_names = ['input']
output_names = ['cd_map']
torch.onnx.export(model, dummy_input, onnx_path,
# export_params=True,
verbose=True,
input_names=input_names,
output_names=output_names)
def truncated_linear_stretch(image, truncated_value=0.2, max_out=255, min_out=0):
image_stretch = []
for i in range(image.shape[2]):
gray = image[:, :, i]
truncated_down = np.percentile(gray, truncated_value)
truncated_up = np.percentile(gray, 100 - truncated_value)
gray = (gray - truncated_down) / (truncated_up - truncated_down) * (max_out - min_out) + min_out
gray[gray < min_out] = min_out
gray[gray > max_out] = max_out
image_stretch.append(gray)
image_stretch = cv2.merge(image_stretch)
image_stretch = np.uint8(image_stretch)
return image_stretch
def cv2tensor(img, mean=None, std=None):
image = img.astype(np.float32)/255.0
if mean and std:
image = (image - mean)/ std
image = image.transpose((2, 0, 1))
image = image[np.newaxis,:,:,:]
image = np.array(image, dtype=np.float32)
return image
def image_process(image1_path, image2_path):
img1 = cv2.imread(image1_path)
img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB)
img2 = cv2.imread(image2_path)
img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB)
image_A = truncated_linear_stretch(img1)
image_B = truncated_linear_stretch(img2)
image_A_B = np.concatenate((image_A, image_B), axis=2)
image = cv2tensor(image_A_B)
return image
def torch_image_process(image1_path, image2_path):
img1 = cv2.imread(image1_path)
img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB)
img2 = cv2.imread(image2_path)
img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB)
image_A = truncated_linear_stretch(img1)
image_B = truncated_linear_stretch(img2)
image_A_B = np.concatenate((image_A, image_B), axis=2)
as_tensor = T.Compose([
T.ToTensor(),
])
image = as_tensor(image_A_B)
image = image.unsqueeze(0)
return image
def onnx_runtime(img1_path, img2_path, onnx_path):
imgdata = image_process(img1_path, img2_path)
sess = rt.InferenceSession(onnx_path)
input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name
pred_onnx = sess.run([output_name], {input_name: imgdata})
print("ONNX outputs:")
print(np.array(pred_onnx)[0, 0, 0, 0, 0])
def torch_runtime(img1_path, img2_path, model, checkpoint_path):
img = torch_image_process(img1_path, img2_path)
img = img.to(device)
model.load_state_dict(torch.load(checkpoint_path))
model = model.to(device)
model.eval()
pred_torch = model(img)
print("Pytorch outputs:")
print(pred_torch[0, 0, 0, 0])
if __name__ == '__main__':
model = None
checkpoint_path = ''
onnx_path = ''
img1_path = 'A/141.tif'
img2_path = 'B/141.tif'
# model_converter(model, checkpoint_path, onnx_path)
onnx_runtime(img1_path, img2_path, onnx_path)
torch_runtime(img1_path, img2_path, model, checkpoint_path)
这里给个参考链接
这个步骤千万不要省略,我安装就是省略了这一步导致tensorrt库无法调用。
这里直接使用安装后的TensorRT-7.0.011自带的工具进行转换(bin文件下的trtexec)。进入TensorRT-7.0.0.11/bin目录,执行:
./trtexec --onnx=test.onnx --saveEngine=test.trt
在bin目录下即可生成test.trt
推理代码参考链接,这里我直接复制了。需要注意TensorRT安装的版本,版本不一样可能会有写法上的区别。下面的报错就是因为TensorRT版本由7点几变8点几。
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import time
import cv2
TRT_LOGGER = trt.Logger()
def get_img_np_nchw(image):
image_cv = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image_cv = cv2.resize(image_cv, (112, 112))
mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])
img_np = np.array(image_cv, dtype=float) / 255.
img_np = (img_np - mean) / std
img_np = img_np.transpose((2, 0, 1))
img_np_nchw = np.expand_dims(img_np, axis=0)
return img_np_nchw
class HostDeviceMem(object):
def __init__(self, host_mem, device_mem):
super(HostDeviceMem, self).__init__()
self.host = host_mem
self.device = device_mem
def __str__(self):
return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
def __repr__(self):
return self.__str__()
def allocate_buffers(engine):
inputs = []
outputs = []
bindings = []
stream = cuda.Stream() # pycuda 操作缓冲区
for binding in engine:
size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
dtype = trt.nptype(engine.get_binding_dtype(binding))
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes) # 分配内存
bindings.append(int(device_mem))
if engine.binding_is_input(binding):
inputs.append(HostDeviceMem(host_mem, device_mem))
else:
outputs.append(HostDeviceMem(host_mem, device_mem))
return inputs, outputs, bindings, stream
def get_engine(engine_file_path=""):
print("Reading engine from file {}".format(engine_file_path))
with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
return runtime.deserialize_cuda_engine(f.read())
def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs] # 将输入放入device
context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle) # 执行模型推理
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs] # 将预测结果从缓冲区取出
stream.synchronize() # 线程同步
return [out.host for out in outputs]
def postprocess_the_outputs(h_outputs, shape_of_output):
h_outputs = h_outputs.reshape(*shape_of_output)
return h_outputs
def landmark_detection(image_path):
trt_engine_path = './models/landmark_detect_106.trt'
engine = get_engine(trt_engine_path)
context = engine.create_execution_context()
inputs, outputs, bindings, stream = allocate_buffers(engine)
image = cv2.imread(image_path)
image = cv2.resize(image, (112, 112))
img_np_nchw = get_img_np_nchw(image)
img_np_nchw = img_np_nchw.astype(dtype=np.float32)
inputs[0].host = img_np_nchw.reshape(-1)
t1 = time.time()
trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
t2 = time.time()
print('used time: ', t2-t1)
shape_of_output = (1, 212)
landmarks = postprocess_the_outputs(trt_outputs[1], shape_of_output)
landmarks = landmarks.reshape(landmarks.shape[0], -1, 2)
height, width = image.shape[:2]
pred_landmark = landmarks[0] * [height, width]
for (x, y) in pred_landmark.astype(np.int32):
cv2.circle(image, (x, y), 1, (0, 255, 255), -1)
cv2.imshow('landmarks', image)
cv2.waitKey(0)
return pred_landmark
if __name__ == '__main__':
image_path = './images/3766_20190805_12_10.png'
landmarks = landmark_detection(image_path)
‘NoneType‘ object has no attribute ‘create_execution_context‘
主要原因是tensorrt版本问题,如果使用的是8点几版本则需要在下面红线位置添加即可
trt.init_libnvinfer_plugins(TRT_LOGGER, '')