yolov5的onnx推断示例和思路记录（包含detect.py的最新源码解读）

前言

最近把yolov5的模型导出为了onnx格式，想写一个脚本来验证一下结果，看看和直接使用pt文件进行推断有无出入，虽然官方在detect.py文件里可以针对各种模型格式直接进行推断，但用起来总感觉不懂其中奥妙，同时官方的detect.py里引入了太多外部库，所以就有了单独写一个推断脚本的想法。当然，如果仅仅是输入一个空向量进行推断，是很简单的，但我希望能对图片直接进行预测并保存。

如果觉得太长不看请跳到最终源码部分（文末）。另源码已放在github上，链接点这里。

2022/09/07 github上更新了cpu版本的推断。

准备和思路

首先肯定是要装好onnxruntime的，我这里是gpu版本的onnxruntime，还有torch，torchvision，opencv-python以及相应的cuda工具包，这里就不赘述了。
思路为以下：
1.对图片进行预处理，转换为onnx模型的输入尺寸。
2.进行推断得到所有框之后，使用non_max_suppression去掉所有不符合条件的框，也就是根据confidence和iou分数来去掉分数不够的框和重叠多余的框。
3.把框框的坐标转换为原始图片尺寸的坐标（因为这个图片已经被预处理转换尺寸过了）
4.根据坐标以及标签名称在原始图片上进行标注并保存（使用opencv和cv2.putText方法）

阅读源码

首先让我们看看detect.py里面是怎么写的，然后定位我们完成任务需要的东西。

1.加载onnx模型

# detect.py
device = select_device(device)
model = DetectMultiBackend(weights, device=device, dnn=dnn, data=data, fp16=half)
stride, names, pt = model.stride, model.names, model.pt
imgsz = check_img_size(imgsz, s=stride)  # check image size
1
2
3
4
5

第一行device，很明显是在问你是用cpu还是gpu跑呢，因为我们有cuda，所以肯定是用cuda来跑，那么这个device就要改成cuda。
第二行DetectMultiBackend，yolov5的最新源码（2022.8月份）中是通过这个函数实现多模型格式的装载，这个和以前是不一样的，你在网上找到的一些2021年的yolov5项目中，这一段可能是attempt_load函数。所以需要具体细看这个函数里面到底是怎么实现的。
第三行就是定义模型的一些基本信息，stride一般都是32，names就是模型的标签名，pt就是你是不是用pytorch的pt权重来进行推断，我们既然是onnx，那pt=False，这个很重要。
第四行是检查输入图片的尺寸是否是32的倍数，这个是yolov5训练的时候就会这么要求的，必须为32的倍数，如果不是要进行尺寸转换。
然后让我们细看一下这个DetectMultiBackend:

# common.py
elif onnx:  # ONNX Runtime
     LOGGER.info(f'Loading {w} for ONNX Runtime inference...')
     cuda = torch.cuda.is_available()
     check_requirements(('onnx', 'onnxruntime-gpu' if cuda else 'onnxruntime'))
     import onnxruntime
     providers = ['CUDAExecutionProvider', 'CPUExecutionProvider'] if cuda else ['CPUExecutionProvider']
     session = onnxruntime.InferenceSession(w, providers=providers)
     meta = session.get_modelmeta().custom_metadata_map  # metadata
     if 'stride' in meta:
         stride, names = int(meta['stride']), eval(meta['names'])
1
2
3
4
5
6
7
8
9
10
11

这个函数位于models文件夹下的common.py文件里，因为前面过长就不放了，我们只看感兴趣的onnx部分，
首先是判断gpu版本的onnxruntime有没有装，cuda能不能用，如果能用就使用’CUDAExecutionProvider’来进行推断，所以这一段代码就是用来加载onnx模型的。

2.对图片进行预处理

# detect.py
if webcam:
        view_img = check_imshow()
        cudnn.benchmark = True  # set True to speed up constant image size inference
        dataset = LoadStreams(source, img_size=imgsz, stride=stride, auto=pt)
        bs = len(dataset)  # batch_size
    else:
        dataset = LoadImages(source, img_size=imgsz, stride=stride, auto=pt)
        bs = 1  # batch_size
1
2
3
4
5
6
7
8
9

因为我们是只对单张图片进行推断，所以肯定是使用了else循环下的LoadImages函数，这里涉及到两个参数img_size和auto：
img_size是onnx模型的输入尺寸，一般来说，默认导出的是(640,640)，这个值在你导出onnx模型的时候是可以修改的。
auto是你是不是用了pt，我们是onnx，所以auto应该等于False。
接下来看这个函数的具体实现（在utils下的dataloaders.py里）：

# dataloaders.py
else:
    # Read image
    self.count += 1
    img0 = cv2.imread(path)  # BGR
    assert img0 is not None, f'Image Not Found {path}'
    s = f'image {self.count}/{self.nf} {path}: '

        # Padded resize
img = letterbox(img0, self.img_size, stride=self.stride, auto=self.auto)[0]

        # Convert
img = img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB
img = np.ascontiguousarray(img)
1
2
3
4
5
6
7
8
9
10
11
12
13
14

因为我们只有一张图片，所以直接跳到else部分，这里面先用cv2.imread读取了图片，再用letterbox转换成了640x640的图片，再进行维度转换，最后成为一个array数组，所以我们自己写的时候只需要把letterbox方法（在utils的augmentations.py下）合理的copy下来就可以成功的预处理了。

3.进行推断

# detect.py
im = torch.from_numpy(im).to(device)
im = im.half() if model.fp16 else im.float()  # uint8 to fp16/32
im /= 255  # 0 - 255 to 0.0 - 1.0
if len(im.shape) == 3:
   im = im[None]  # expand for batch dim
t2 = time_sync()
dt[0] += t2 - t1

# Inference
visualize = increment_path(save_dir / Path(path).stem, mkdir=True) if visualize else False
pred = model(im, augment=augment, visualize=visualize)
1
2
3
4
5
6
7
8
9
10
11
12

终于到推断部分了，第一行首先是把数组变成tensor好进行接下来的处理，第二行因为我们是用cuda进行onnx的推断，所以在这里我们要用im.float()，接下来四行都没用，跳过，最后一句为正式推断，那么到底发生了啥呢，让我们跳到models下的common.py：

# common.py
elif self.onnx:  # ONNX Runtime
     im = im.cpu().numpy()  # torch to numpy
     y = self.session.run([self.session.get_outputs()[0].name], {self.session.get_inputs()[0].name: im})[0]
1
2
3
4

其实很简单，就两句话，把tensor又转成了numpy，然后输入onnx进行推断。

4.去掉多余的框

我这边用的yolov5-s，在第三步结尾会得到一个(1,25200,7)的向量，也就是可以理解为25200个框，其中包含了四个坐标点，1个置信分数，1个分类的信息。所以肯定要把多余的框全部去掉，这个在detect.py里的实现很简单，就一句话：

# detect.py
pred = non_max_suppression(pred, conf_thres, iou_thres, classes, agnostic_nms, max_det=max_det)
1
2

这里pred是第三步得到的预测，conf_thres默认值为0.25，iou_thres默认值为0.45，小于前面两个值的框都会被去掉，classes是你指定要对哪一类进行预测，默认是false，agnostic_nms默认是false，应该是一种别的nms的方法，这里就不细讲了，有挺多不同nms的方法，max_det默认是300，应该表示最多的框数。
然后这个函数是在utils的general.py里，具体实现有点复杂，我们可以按需删除我们不需要的功能，比如classes一般用不着，相关的就可以删掉。

# general.py
box = xywh2xyxy(x[:, :4])
iou = box_iou(boxes[i], boxes)
1
2
3

我之所以摘出这两个，是因为它调用了别的文件里的函数，第一个是用来把x,y,w,h的图片坐标转换为x,y,x,y的坐标系，第二个是用来计算框框的iou分数，其中xywh2xyxy也位于general.py下，而box_iou则位于utils里的metrics.py里。

5.对图片标注并保存

# detect.py
for i, det in enumerate(pred):
    det[:, :4] = scale_coords(im.shape[2:], det[:, :4], img0.shape).round()
#initialize annotator
annotator = Annotator(img0, line_width=3)
#annotate the image
for *xyxy, conf, cls in reversed(det):
    c = int(cls)  # integer class
    label = f'{names[c]} {conf:.2f}'
    annotator.box_label(xyxy, label, color=colors(c, True))
1
2
3
4
5
6
7
8
9
10

这里我略作了修改，看的更清楚。
第一步就是把第四步得到的tensor里面的坐标转换为原始图片的尺寸，使用了scale_coords这个函数，然后标注使用的是Annotator这个类，在你定义好了labels之后就可以使用Annotator进行标注并保存。
其中scale_coords在utils的general.py文件下，Annotator在utils的plots.py下。

总结

还是得多看源码才能提升功力，用着别人的源码只能当个调参侠，继续努力吧！

最终代码

在经过上面一番阅读和漫长的debug之后，我们得到了以下脚本（有点长）：

#inference only for onnx
import onnxruntime
import torch
import torchvision
import cv2
import numpy as np
import time
w = 'best.onnx' #文件名 请自行修改
cuda = torch.cuda.is_available()
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider'] if cuda else ['CPUExecutionProvider']
session = onnxruntime.InferenceSession(w, providers=providers)
#warmup to reduce the first inference time but useless in fact.
# t1 = time.time()
# im = torch.zeros((1,3,640,640), dtype=torch.float, device=torch.device('cuda'))
# im = im.cpu().numpy()  # torch to numpy
# y = session.run([session.get_outputs()[0].name], {session.get_inputs()[0].name: im})[0]
# t2 = time.time()
# print(t2-t1)
#preprocess img to array
def letterbox(im, new_shape=(640, 640), color=(114, 114, 114), auto=True, scaleFill=False, scaleup=True, stride=32):
    # Resize and pad image while meeting stride-multiple constraints
    shape = im.shape[:2]  # current shape [height, width]
    if isinstance(new_shape, int):
        new_shape = (new_shape, new_shape)

    # Scale ratio (new / old)
    r = min(new_shape[0] / shape[0], new_shape[1] / shape[1])
    if not scaleup:  # only scale down, do not scale up (for better val mAP)
        r = min(r, 1.0)

    # Compute padding
    ratio = r, r  # width, height ratios
    new_unpad = int(round(shape[1] * r)), int(round(shape[0] * r))
    dw, dh = new_shape[1] - new_unpad[0], new_shape[0] - new_unpad[1]  # wh padding
    if auto:  # minimum rectangle
        dw, dh = np.mod(dw, stride), np.mod(dh, stride)  # wh padding
    elif scaleFill:  # stretch
        dw, dh = 0.0, 0.0
        new_unpad = (new_shape[1], new_shape[0])
        ratio = new_shape[1] / shape[1], new_shape[0] / shape[0]  # width, height ratios

    dw /= 2  # divide padding into 2 sides
    dh /= 2

    if shape[::-1] != new_unpad:  # resize
        im = cv2.resize(im, new_unpad, interpolation=cv2.INTER_LINEAR)
    top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))
    left, right = int(round(dw - 0.1)), int(round(dw + 0.1))
    im = cv2.copyMakeBorder(im, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)  # add border
    return im
def xywh2xyxy(x):
    # Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right
    y = x.clone() if isinstance(x, torch.Tensor) else np.copy(x)
    y[:, 0] = x[:, 0] - x[:, 2] / 2  # top left x
    y[:, 1] = x[:, 1] - x[:, 3] / 2  # top left y
    y[:, 2] = x[:, 0] + x[:, 2] / 2  # bottom right x
    y[:, 3] = x[:, 1] + x[:, 3] / 2  # bottom right y
    return y
def box_area(box):
    # box = xyxy(4,n)
    return (box[2] - box[0]) * (box[3] - box[1])
def box_iou(box1, box2, eps=1e-7):
    # inter(N,M) = (rb(N,M,2) - lt(N,M,2)).clamp(0).prod(2)
    (a1, a2), (b1, b2) = box1[:, None].chunk(2, 2), box2.chunk(2, 1)
    inter = (torch.min(a2, b2) - torch.max(a1, b1)).clamp(0).prod(2)

    # IoU = inter / (area1 + area2 - inter)
    return inter / (box_area(box1.T)[:, None] + box_area(box2.T) - inter + eps)
def non_max_suppression(prediction,
                        conf_thres=0.25,
                        iou_thres=0.45,
                        agnostic=False,
                        max_det=300):
    bs = prediction.shape[0]  # batch size
    xc = prediction[..., 4] > conf_thres  # candidates
    # Settings
    # min_wh = 2  # (pixels) minimum box width and height
    max_wh = 7680  # (pixels) maximum box width and height
    max_nms = 30000  # maximum number of boxes into torchvision.ops.nms()
    redundant = True  # require redundant detections
    merge = False  # use merge-NMS
    output = [torch.zeros((0, 6), device = prediction.device)] * bs
    for xi, x in enumerate(prediction):  # image index, image inference
        # Apply constraints
        # x[((x[..., 2:4] < min_wh) | (x[..., 2:4] > max_wh)).any(1), 4] = 0  # width-height
        x = x[xc[xi]]  # confidence
        # If none remain process next image
        if not x.shape[0]:
            continue

        # Compute conf
        x[:, 5:] *= x[:, 4:5]  # conf = obj_conf * cls_conf

        # Box (center x, center y, width, height) to (x1, y1, x2, y2)
        box = xywh2xyxy(x[:, :4])

        # Detections matrix nx6 (xyxy, conf, cls)
        conf, j = x[:, 5:].max(1, keepdim=True)
        x = torch.cat((box, conf, j.float()), 1)[conf.view(-1) > conf_thres]
        # Apply finite constraint
        # if not torch.isfinite(x).all():
        #     x = x[torch.isfinite(x).all(1)]

        # Check shape
        n = x.shape[0]  # number of boxes
        if not n:  # no boxes
            continue
        elif n > max_nms:  # excess boxes
            x = x[x[:, 4].argsort(descending=True)[:max_nms]]  # sort by confidence

        # Batched NMS
        c = x[:, 5:6] * (0 if agnostic else max_wh)  # classes
        boxes, scores = x[:, :4] + c, x[:, 4]  # boxes (offset by class), scores
        i = torchvision.ops.nms(boxes, scores, iou_thres)  # NMS
        if i.shape[0] > max_det:  # limit detections
            i = i[:max_det]
        if merge and (1 < n < 3E3):  # Merge NMS (boxes merged using weighted mean)
            # update boxes as boxes(i,4) = weights(i,n) * boxes(n,4)
            iou = box_iou(boxes[i], boxes) > iou_thres  # iou matrix
            weights = iou * scores[None]  # box weights
            x[i, :4] = torch.mm(weights, x[:, :4]).float() / weights.sum(1, keepdim=True)  # merged boxes
            if redundant:
                i = i[iou.sum(1) > 1]  # require redundancy

        output[xi] = x[i]
    return output
def scale_coords(img1_shape, coords, img0_shape, ratio_pad=None):
    # Rescale coords (xyxy) from img1_shape to img0_shape
    if ratio_pad is None:  # calculate from img0_shape
        gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])  # gain  = old / new
        pad = (img1_shape[1] - img0_shape[1] * gain) / 2, (img1_shape[0] - img0_shape[0] * gain) / 2  # wh padding
    else:
        gain = ratio_pad[0][0]
        pad = ratio_pad[1]

    coords[:, [0, 2]] -= pad[0]  # x padding
    coords[:, [1, 3]] -= pad[1]  # y padding
    coords[:, :4] /= gain
    clip_coords(coords, img0_shape)
    return coords
def clip_coords(boxes, shape):
    # Clip bounding xyxy bounding boxes to image shape (height, width)
    if isinstance(boxes, torch.Tensor):  # faster individually
        boxes[:, 0].clamp_(0, shape[1])  # x1
        boxes[:, 1].clamp_(0, shape[0])  # y1
        boxes[:, 2].clamp_(0, shape[1])  # x2
        boxes[:, 3].clamp_(0, shape[0])  # y2
    else:  # np.array (faster grouped)
        boxes[:, [0, 2]] = boxes[:, [0, 2]].clip(0, shape[1])  # x1, x2
        boxes[:, [1, 3]] = boxes[:, [1, 3]].clip(0, shape[0])  # y1, y2
class Annotator:
    def __init__(self, im, line_width=None):
        assert im.data.contiguous, 'Image not contiguous. Apply np.ascontiguousarray(im) to Annotator() input images.'
        self.im = im
        self.lw = line_width or max(round(sum(im.shape) / 2 * 0.003), 2)  # line width

    def box_label(self, box, label='', color=(128, 128, 128), txt_color=(255, 255, 255)):
        # Add one xyxy box to image with label
        p1, p2 = (int(box[0]), int(box[1])), (int(box[2]), int(box[3]))
        cv2.rectangle(self.im, p1, p2, color, thickness=self.lw, lineType=cv2.LINE_AA)
        if label:
            tf = max(self.lw - 1, 1)  # font thickness
            w, h = cv2.getTextSize(label, 0, fontScale=self.lw / 3, thickness=tf)[0]  # text width, height
            outside = p1[1] - h >= 3
            p2 = p1[0] + w, p1[1] - h - 3 if outside else p1[1] + h + 3
            cv2.rectangle(self.im, p1, p2, color, -1, cv2.LINE_AA)  # filled
            cv2.putText(self.im,
                        label, (p1[0], p1[1] - 2 if outside else p1[1] + h + 2),
                        0,
                        self.lw / 3,
                        txt_color,
                        thickness=tf,
                        lineType=cv2.LINE_AA)

    def rectangle(self, xy, fill=None, outline=None, width=1):
        # Add rectangle to image (PIL-only)
        self.draw.rectangle(xy, fill, outline, width)

    def text(self, xy, text, txt_color=(255, 255, 255)):
        # Add text to image (PIL-only)
        w, h = self.font.getsize(text)  # text width, height
        self.draw.text((xy[0], xy[1] - h + 1), text, fill=txt_color, font=self.font)

    def result(self):
        # Return annotated image as array
        return np.asarray(self.im)
class Colors:
    def __init__(self):
        # hex = matplotlib.colors.TABLEAU_COLORS.values()
        hexs = ('FF3838', 'FF9D97', 'FF701F', 'FFB21D', 'CFD231', '48F90A', '92CC17', '3DDB86', '1A9334', '00D4BB',
                '2C99A8', '00C2FF', '344593', '6473FF', '0018EC', '8438FF', '520085', 'CB38FF', 'FF95C8', 'FF37C7')
        self.palette = [self.hex2rgb(f'#{c}') for c in hexs]
        self.n = len(self.palette)

    def __call__(self, i, bgr=False):
        c = self.palette[int(i) % self.n]
        return (c[2], c[1], c[0]) if bgr else c

    @staticmethod
    def hex2rgb(h):  # rgb order (PIL)
        return tuple(int(h[1 + i:1 + i + 2], 16) for i in (0, 2, 4))


colors = Colors()  # create instance for 'from utils.plots import colors'
img0 = cv2.imread('test.png') #自行修改文件名称
img = letterbox(img0, (640,640), stride=32, auto=False) #only pt use auto=True, but we are onnx
img = img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB
img = np.ascontiguousarray(img)
im = torch.from_numpy(img).to(torch.device('cuda'))
im = im.float()
im /= 255  # 0 - 255 to 0.0 - 1.0
if len(im.shape) == 3:
    im = im[None]  # expand for batch dim
im = im.cpu().numpy()  # torch to numpy
y = session.run([session.get_outputs()[0].name], {session.get_inputs()[0].name: im})[0] #inference onnx model to get the total output
#non_max_suppression to remove redundant boxes
y = torch.from_numpy(y).to(torch.device('cuda'))
pred = non_max_suppression(y, conf_thres = 0.25, iou_thres = 0.45, agnostic= False, max_det=1000)
#transform coordinate to original picutre size
for i, det in enumerate(pred):
    det[:, :4] = scale_coords(im.shape[2:], det[:, :4], img0.shape).round()
print(det)
#标签，请自行修改
names = ['nofall', 'fall']
#initialize annotator
annotator = Annotator(img0, line_width=3)
#annotate the image
for *xyxy, conf, cls in reversed(det):
    c = int(cls)  # integer class
    label = f'{names[c]} {conf:.2f}'
    annotator.box_label(xyxy, label, color=colors(c, True))
#自行修改文件名称
cv2.imwrite('test.png', img0)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233

这里说一下warmup部分，yolov5的detect.py会先给模型传入一个空向量来预加载，这样正式预测的时候延时就会变小，我这测试传入空向量的时间是0.73s，正式预测是0.01s。
但是如果不用warmup直接进行预测，时间也就等于以上两者总和，同时从第二次预测开始也都是0.01s起步，所以这个功能在我这个场景下好像没啥用就注释掉了。

相关阅读:
李迟2022年11月工作生活总结
 【大数据 - Doris 实践】数据表的基本使用（五）：ROLLUP
大模型的视觉能力
 Android adb查看系统时间
 一个极简的Http请求client推荐，一行搞玩外部请求
 springAOP面试题
 JVM入门
 linux scp命令
 Stm32_标准库_14_串口&蓝牙模块_解决手机与蓝牙模块数据传输的不完整性
 软件测试进阶(黑白盒测试)
原文地址：https://blog.csdn.net/weixin_43945848/article/details/126503453