• slowfast


    一个神器 并且模型转换加部署全套 我慢慢说来

    近年来,基于深度学习的人体动作识别的研究越来越多,slowfast模型提出了快慢两通道网络在动作识别数据集上表现十分优异,本文介绍Slowfast数据准备,如何训练,以及slowfast使用onnx进行推理,着重介绍了Slowfast使用Tensorrt推理,并且使用yolov5和deepsort进行人物追踪,以及使用C++ 部署。

    1.数据准备

    1.1 剪裁视频

    准备多组视频数据,其中IN_DATA_DIR 为原始视频数据存放目录,OUT_DATA_DIR为目标视频数据存放目录。这一步保证所有视频长度相同

    1. IN_DATA_DIR="/project/train/src_repo/data/video"
    2. OUT_DATA_DIR="/project/train/src_repo/data/splitvideo"
    3. str="_"
    4. if [[ ! -d "${OUT_DATA_DIR}" ]]; then
    5.   echo "${OUT_DATA_DIR} doesn't exist. Creating it.";
    6.   mkdir -p ${OUT_DATA_DIR}
    7. fi
    8. for video in $(ls -A1 -U ${IN_DATA_DIR}/*)
    9. do 
    10.     for i in {0..10}
    11.     do 
    12.       index=$(expr $i \* 10)
    13.       out_name="${OUT_DATA_DIR}/${i}${str}${video##*/}"
    14.       if [ ! -f "${out_name}" ]; then
    15.         ffmpeg -ss ${index} -t 80 -i "${video}" "${out_name}"
    16.       fi
    17.     done
    18. done

    1.2 提取关键帧

    关键帧是从视频每一秒中提取一帧,IN_DATA_DIR为步骤一得到视频的目录,OUT_DATA_DIR为提取的关键帧的存放目录

    #切割图片,每秒1帧

    1. IN_DATA_DIR="/project/train/src_repo/data/splitvideo/"
    2. OUT_DATA_DIR="/project/train/src_repo/data/splitimages/"
    3.  
    4. if [[ ! -d "${OUT_DATA_DIR}" ]]; then
    5.   echo "${OUT_DATA_DIR} doesn't exist. Creating it.";
    6.   mkdir -p ${OUT_DATA_DIR}
    7. fi
    8.  
    9. for video in $(ls -A1 -U ${IN_DATA_DIR}/*)
    10. do
    11.   video_name=${video##*/}
    12.  
    13.   if [[ $video_name = *".webm" ]]; then
    14.     video_name=${video_name::-5}
    15.   else
    16.     video_name=${video_name::-4}
    17.   fi
    18.  
    19.   out_video_dir=${OUT_DATA_DIR}/${video_name}/
    20.   mkdir -p "${out_video_dir}"
    21.  
    22.   out_name="${out_video_dir}/${video_name}_%06d.jpg"
    23.  
    24.   ffmpeg -i "${video}" -r 1 -q:v 1 "${out_name}"
    25. done

    1.3 分割视频

    将步骤一生成的视频通过ffmpeg进行分帧,每秒30帧,IN_DATA_DIR为存放视频目录,OUT_DATA_DIR为存放结果目录

    1. IN_DATA_DIR="/project/train/src_repo/video"
    2. OUT_DATA_DIR="/project/train/src_repo/spiltvideo"
    3. if [[ ! -d "${OUT_DATA_DIR}" ]]; then
    4.   echo "${OUT_DATA_DIR} doesn't exist. Creating it.";
    5.   mkdir -p ${OUT_DATA_DIR}
    6. fi
    7. for video in $(ls -A1 -U ${IN_DATA_DIR}/*)
    8. do
    9.   out_name="${OUT_DATA_DIR}/${video##*/}"
    10.   if [ ! -f "${out_name}" ]; then
    11.     ffmpeg -ss 0 -t 100 -i "${video}" "${out_name}"
    12.   fi
    13. done

    1.4 文件目录

    ava  #一级文件夹,用来存放视频信息
    —person_box_67091280_iou90 #二级文件夹,用来存放目标检测信息文件夹
    ——ava_detection_train_boxes_and_labels_include_negative_v2.2.csv #二级文件夹下文件,用来存放目标检测的信息,用于训练
    ——ava_detection_val_boxes_and_labels.csv #二级文件夹下文件,用来存放目标检测的信息,用于测试
    —ava_action_list_v2.2_for_activitynet_2019.pbtxt #一级文件夹下的文件,用来存放标签信息
    —ava_val_excluded_timestamps_v2.2.csv #一级文件夹下的文件,用来没有人物的帧,在训练过程中会抛弃这些帧
    —ava_train_v2.2.csv #一级文件夹下的文件,用来存放训练数据,关键帧的信息
    —ava_val_v2.2.csv  #一级文件夹下的文件,用来存放验证数据,关键帧的信息

    frame_lists  #一级文件夹,存放1.3中生成的图片的路径
    —train.csv
    —val.csv

    frames  #一级文件夹,存放1.3中生成的图片
    —A
    ——A_000001.jpg
    ——A_0000012.jpg

    ——A_000090.jpg
    —B
    ——B_000001.jpg
    ——B_0000012.jpg

    ——B_000090.jpg

    2.环境准备

    2.1 环境准备

    pip install iopath
    pip install fvcore
    pip install simplejson
    pip install pytorchvideo

    2.2  detectron2 安装

    !python -m pip install pyyaml==5.1
    import sys, os, distutils.core
    # Note: This is a faster way to install detectron2 in Colab, but it does not include all functionalities.
    # See https://detectron2.readthedocs.io/tutorials/install.html for full installation instructions
    !git clone 'https://github.com/facebookresearch/detectron2'
    dist = distutils.core.run_setup("./detectron2/setup.py")
    !python -m pip install {' '.join([f"'{x}'" for x in dist.install_requires])}
    sys.path.insert(0, os.path.abspath('./detectron2'))

    3.slowfast 训练

    3.1 训练

    python tools/run_net.py --cfg configs/AVA/SLOWFAST_32x2_R50_SHORT.yaml

    SLOWFAST_32x2_R50_SHORT.yaml

    TRAIN:
      ENABLE: Fasle
      DATASET: ava
      BATCH_SIZE: 8 #64
      EVAL_PERIOD: 5
      CHECKPOINT_PERIOD: 1
      AUTO_RESUME: True
      CHECKPOINT_FILE_PATH: '/content/SLOWFAST_32x2_R101_50_50.pkl'  #预训练模型地址
      CHECKPOINT_TYPE: pytorch
    DATA:
      NUM_FRAMES: 32
      SAMPLING_RATE: 2
      TRAIN_JITTER_SCALES: [256, 320]
      TRAIN_CROP_SIZE: 224
      TEST_CROP_SIZE: 224
      INPUT_CHANNEL_NUM: [3, 3]
      PATH_TO_DATA_DIR: '/content/ava'
    DETECTION:
      ENABLE: True
      ALIGNED: True
    AVA:
      FRAME_DIR: '/content/ava/frames'   #数据准备阶段生成的目录
      FRAME_LIST_DIR: '/content/ava/frame_lists'
      ANNOTATION_DIR: '/content/ava/annotations'
      DETECTION_SCORE_THRESH: 0.5
      FULL_TEST_ON_VAL: True
      TRAIN_PREDICT_BOX_LISTS: [
        "ava_train_v2.2.csv",
        "person_box_67091280_iou90/ava_detection_train_boxes_and_labels_include_negative_v2.2.csv",
      ]
      TEST_PREDICT_BOX_LISTS: [
        "person_box_67091280_iou90/ava_detection_val_boxes_and_labels.csv"]
      
     
    SLOWFAST:
      ALPHA: 4
      BETA_INV: 8
      FUSION_CONV_CHANNEL_RATIO: 2
      FUSION_KERNEL_SZ: 7
    RESNET:
      ZERO_INIT_FINAL_BN: True
      WIDTH_PER_GROUP: 64
      NUM_GROUPS: 1
      DEPTH: 50
      TRANS_FUNC: bottleneck_transform
      STRIDE_1X1: False
      NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]
      SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [2, 2]]
      SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [1, 1]]
    NONLOCAL:
      LOCATION: [[[], []], [[], []], [[], []], [[], []]]
      GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]
      INSTANTIATION: dot_product
      POOL: [[[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]]]
    BN:
      USE_PRECISE_STATS: False
      NUM_BATCHES_PRECISE: 20
    SOLVER:
      BASE_LR: 0.1
      LR_POLICY: steps_with_relative_lrs
      STEPS: [0, 10, 15, 20]
      LRS: [1, 0.1, 0.01, 0.001]
      MAX_EPOCH: 20
      MOMENTUM: 0.9
      WEIGHT_DECAY: 1e-7
      WARMUP_EPOCHS: 5.0
      WARMUP_START_LR: 0.000125
      OPTIMIZING_METHOD: sgd
    MODEL:
      NUM_CLASSES: 1
      ARCH: slowfast
      MODEL_NAME: SlowFast
      LOSS_FUNC: bce
      DROPOUT_RATE: 0.5
      HEAD_ACT: sigmoid
    TEST:
      ENABLE: False
      DATASET: ava
      BATCH_SIZE: 8
    DATA_LOADER:
      NUM_WORKERS: 0
      PIN_MEMORY: True
    NUM_GPUS: 1
    NUM_SHARDS: 1
    RNG_SEED: 0
    OUTPUT_DIR: .

    3.2 训练过程常见报错

    1.slowfast/datasets/ava_helper.py 中AVA_VALID_FRAMES改为你的视频长度

    2.pytorchvideo.layers.distributed报错

    from pytorchvideo.layers.distributed import ( # noqa
    ImportError: cannot import name 'cat_all_gather' from 'pytorchvideo.layers.distributed' 
    (/site-packages/pytorchvideo/layers/distributed.py)

    3.pytorchvideo.losses 报错

    File "SlowFast/slowfast/models/losses.py", line 11, in
    from pytorchvideo.losses.soft_target_cross_entropy import (
    ModuleNotFoundError: No module named 'pytorchvideo.losses'

    错误2,3可以通过查看参考链接一来解决

    4.slowfast 预测

    第一种:使用官方的脚本进行推理

    python tools/run_net.py --cfg demo/AVA/SLOWFAST_32x2_R101_50_50.yaml

    第二种:由于detectron2安装问题,以及之后部署一系列的问题,可以使用yolov5加上slowfast进行推理

    首先,先来了解slowfast的推理过程

    Step1:连续读取64帧并且判断是否满足64帧

    1. while was_read:
    2.     frames=[]
    3.     seq_length=64
    4.     while was_read and len(frames) < seq_length:
    5.         was_read, frame =cap.read()
    6.         frames.append(frame)

    Step2: 使用yolov5进行目标检测

    1.yolov5 推理代码,将sys.path.insert路径和权重路径weights进行更改

    1. import argparse
    2. import os
    3. import platform
    4. import shutil
    5. import time
    6. from pathlib import Path
    7. import sys
    8. import json
    9. sys.path.insert(1'/content/drive/MyDrive/yolov5/')
    10. import cv2
    11. import torch
    12. import torch.backends.cudnn as cudnn
    13. import numpy as np
    14. import argparse
    15. import time
    16. import cv2
    17. import torch
    18. import torch.backends.cudnn as cudnn
    19. from numpy import random
    20. from models.common import DetectMultiBackend
    21. from utils.augmentations import letterbox
    22. from utils.general import check_img_size, non_max_suppression, scale_coords, set_logging
    23. from utils.torch_utils import select_device
    24. # ####### 参数设置
    25. conf_thres = 0.6
    26. iou_thres = 0.5
    27. #######
    28. imgsz = 640
    29. weights = "/content/yolov5l.pt"
    30. device = '0'
    31. stride = 32
    32. names = ["person"]
    33. import os
    34. def init():
    35.     # Initialize
    36.     global imgsz, device, stride
    37.     set_logging()
    38.     device = select_device('0')
    39.     half = device.type != 'cpu'  # half precision only supported on CUDA
    40.     model = DetectMultiBackend(weights, device=device, dnn=False)
    41.     stride, pt, jit, engine = model.stride, model.pt, model.jit, model.engine
    42.     imgsz = check_img_size(imgsz, s=stride)  # check img_size
    43.     model.half()  # to FP16
    44.     model.eval()
    45.     return model
    46. def process_image(model, input_image=None, args=None, **kwargs):
    47.     img0 = input_image
    48.     img = letterbox(img0, new_shape=imgsz, stride=stride, auto=True)[0]
    49.     img = img.transpose((201))[::-1]  # HWC to CHW, BGR to RGB
    50.     img = np.ascontiguousarray(img)
    51.     img = torch.from_numpy(img).to(device)
    52.     img = img.half()
    53.     img /= 255.0  # 0 - 255 to 0.0 - 1.0
    54.     if len(img.shape) == 3:
    55.         img = img[None]
    56.     pred = model(img, augment=False, val=True)[0]
    57.     pred = non_max_suppression(pred, conf_thres, iou_thres, agnostic=False)
    58.     result=[]
    59.     for i, det in enumerate(pred):  # detections per image
    60.         gn = torch.tensor(img0.shape)[[1010]]  # normalization gain whwh
    61.         if det is not None and len(det):
    62.             # Rescale boxes from img_size to im0 size
    63.             det[:, :4] = scale_coords(img.shape[2:], det[:, :4], img0.shape).round()
    64.             for *xyxy, conf, cls in det:
    65.                 if cls==0:
    66.                     result.append([float(xyxy[0]),float(xyxy[1]),float(xyxy[2]),float(xyxy[3])])
    67.     if len(result)==0:
    68.       return None
    69.     return torch.from_numpy(np.array(result))

    2.bbox 预处理

    1. def scale_boxes(size, boxes, height, width):
    2.     """
    3.     Scale the short side of the box to size.
    4.     Args:
    5.         size (int): size to scale the image.
    6.         boxes (ndarray): bounding boxes to peform scale. The dimension is
    7.         `num boxes` x 4.
    8.         height (int): the height of the image.
    9.         width (int): the width of the image.
    10.     Returns:
    11.         boxes (ndarray): scaled bounding boxes.
    12.     """
    13.     if (width <= height and width == size) or (
    14.         height <= width and height == size
    15.     ):
    16.         return boxes
    17.     new_width = size
    18.     new_height = size
    19.     if width < height:
    20.         new_height = int(math.floor((float(height) / width) * size))
    21.         boxes *= float(new_height) / height
    22.     else:
    23.         new_width = int(math.floor((float(width) / height) * size))
    24.         boxes *= float(new_width) / width
    25.     return boxes

    Step3: 图像预处理

    1.Resize 图像尺寸

    1. def scale(size, image):
    2.     """
    3.     Scale the short side of the image to size.
    4.     Args:
    5.         size (int): size to scale the image.
    6.         image (array): image to perform short side scale. Dimension is
    7.             `height` x `width` x `channel`.
    8.     Returns:
    9.         (ndarray): the scaled image with dimension of
    10.             `height` x `width` x `channel`.
    11.     """
    12.     height = image.shape[0]
    13.     width = image.shape[1]
    14.     # print(height,width)
    15.     if (width <= height and width == size) or (
    16.         height <= width and height == size
    17.     ):
    18.         return image
    19.     new_width = size
    20.     new_height = size
    21.     if width < height:
    22.         new_height = int(math.floor((float(height) / width) * size))
    23.     else:
    24.         new_width = int(math.floor((float(width) / height) * size))
    25.     img = cv2.resize(
    26.         image, (new_width, new_height), interpolation=cv2.INTER_LINEAR
    27.     )
    28.     # print(new_width, new_height)
    29.     return img.astype(np.float32)

    2.归一化

    1. def tensor_normalize(tensor, mean, std, func=None):
    2.     """
    3.     Normalize a given tensor by subtracting the mean and dividing the std.
    4.     Args:
    5.         tensor (tensor): tensor to normalize.
    6.         mean (tensor or list): mean value to subtract.
    7.         std (tensor or list): std to divide.
    8.     """
    9.     if tensor.dtype == torch.uint8:
    10.         tensor = tensor.float()
    11.         tensor = tensor / 255.0
    12.     if type(mean) == list:
    13.         mean = torch.tensor(mean)
    14.     if type(std) == list:
    15.         std = torch.tensor(std)
    16.     if func is not None:
    17.         tensor = func(tensor)
    18.     tensor = tensor - mean
    19.     tensor = tensor / std
    20.     return tensor

    3.构建slow以及fast 输入数据

    主要思路为从64帧图像数据中选取32帧作为fast的输入,再从fast中选取8帧作为slow的输入,并将 T H W C -> C T H W.因此最后fast_pathway维度为(b,3,32,h,w) slow_pathway的维度为(b,3,8,h,w)

    1. def process_cv2_inputs(frames):
    2.     """
    3.     Normalize and prepare inputs as a list of tensors. Each tensor
    4.     correspond to a unique pathway.
    5.     Args:
    6.         frames (list of array): list of input images (correspond to one clip) in range [0, 255].
    7.         cfg (CfgNode): configs. Details can be found in
    8.             slowfast/config/defaults.py
    9.     """
    10.     inputs = torch.from_numpy(np.array(frames)).float() / 255
    11.     inputs = tensor_normalize(inputs, [0.45,0.45,0.45], [0.225,0.225,0.225])
    12.     # T H W C -> C T H W.
    13.     inputs = inputs.permute(3012)
    14.     # Sample frames for num_frames specified.
    15.     index = torch.linspace(0, inputs.shape[1] - 132).long()
    16.     print(index)
    17.     inputs = torch.index_select(inputs, 1, index)
    18.     fast_pathway = inputs
    19.     slow_pathway = torch.index_select(
    20.             inputs,
    21.             1,
    22.             torch.linspace(
    23.                 0, inputs.shape[1] - 1, inputs.shape[1] // 4
    24.             ).long(),
    25.         )
    26.     frame_list = [slow_pathway, fast_pathway]
    27.     print(np.shape(frame_list[0]))
    28.     inputs = [inp.unsqueeze(0for inp in frame_list]
    29.     return inputs

    5.slowfast onnx 推理

    5.1 导出onnx文件

    1. import os
    2. import sys
    3. from collections import OrderedDict
    4. import torch
    5. import argparse
    6. work_root = os.path.split(os.path.realpath(__file__))[0]
    7. from slowfast.config.defaults import get_cfg
    8. import slowfast.utils.checkpoint as cu
    9. from slowfast.models import build_model
    10. def parser_args():
    11.     parser = argparse.ArgumentParser()
    12.     parser.add_argument(
    13.         "--cfg",
    14.         dest="cfg_file",
    15.         type=str,
    16.         default=os.path.join(
    17.             work_root, "/content/drive/MyDrive/SlowFast/demo/AVA/SLOWFAST_32x2_R101_50_50.yaml"),
    18.         help="Path to the config file",
    19.     )
    20.     parser.add_argument(
    21.         '--half',
    22.         type=bool,
    23.         default=False,
    24.         help='use half mode',
    25.     )
    26.     parser.add_argument(
    27.         '--checkpoint',
    28.         type=str,
    29.         default=os.path.join(work_root,
    30.                              "/content/SLOWFAST_32x2_R101_50_50.pkl"),
    31.         help='test model file path',
    32.     )
    33.     parser.add_argument(
    34.         '--save',
    35.         type=str,
    36.         default=os.path.join(work_root, "/content/SLOWFAST_head.onnx"),
    37.         help='save model file path',
    38.     )
    39.     return parser.parse_args()
    40. def main():
    41.     args = parser_args()
    42.     print(args)
    43.     cfg_file = args.cfg_file
    44.     checkpoint_file = args.checkpoint
    45.     save_checkpoint_file = args.save
    46.     half_flag = args.half
    47.     cfg = get_cfg()
    48.     cfg.merge_from_file(cfg_file)
    49.     cfg.TEST.CHECKPOINT_FILE_PATH = checkpoint_file
    50.     print(cfg.DATA)
    51.     print("export pytorch model to onnx!\n")
    52.     device = "cuda:0"
    53.     with torch.no_grad():
    54.         model = build_model(cfg)
    55.         model = model.to(device)
    56.         model.eval()
    57.         cu.load_test_checkpoint(cfg, model)
    58.         if half_flag:
    59.             model.half()
    60.         fast_pathway= torch.randn(1332256455)
    61.         slow_pathway= torch.randn(138256455)
    62.         bbox=torch.randn(32,5).to(device)
    63.         fast_pathway = fast_pathway.to(device)
    64.         slow_pathway = slow_pathway.to(device)
    65.         inputs = [slow_pathway, fast_pathway]
    66.         for p in model.parameters():
    67.          p.requires_grad = False
    68.         torch.onnx.export(model, (inputs,bbox), save_checkpoint_file, input_names=['slow_pathway','fast_pathway','bbox'],output_names=['output'], opset_version=12)
    69.         onnx_check()
    70. def onnx_check():
    71.     import onnx
    72.     args = parser_args()
    73.     print(args)
    74.     onnx_model_path = args.save
    75.     model = onnx.load(onnx_model_path)
    76.     onnx.checker.check_model(model)
    77. if __name__ == '__main__':
    78.     main()

    5.2 onnx 推理

    1. import torch
    2. import math
    3. import onnxruntime
    4. from torchvision.ops import roi_align
    5. import argparse
    6. import os
    7. import platform
    8. import shutil
    9. import time
    10. from pathlib import Path
    11. import sys
    12. import json
    13. sys.path.insert(1'/content/drive/MyDrive/yolov5/')
    14. import cv2
    15. import torch
    16. import torch.backends.cudnn as cudnn
    17. import numpy as np
    18. import argparse
    19. import time
    20. import cv2
    21. import torch
    22. import torch.backends.cudnn as cudnn
    23. from numpy import random
    24. from models.common import DetectMultiBackend
    25. from utils.augmentations import letterbox
    26. from utils.general import check_img_size, non_max_suppression, scale_coords, set_logging
    27. from utils.torch_utils import select_device
    28. # ####### 参数设置
    29. conf_thres = 0.6
    30. iou_thres = 0.5
    31. #######
    32. imgsz = 640
    33. weights = "/content/yolov5l.pt"
    34. device = '0'
    35. stride = 32
    36. names = ["person"]
    37. import os
    38. def init():
    39.     # Initialize
    40.     global imgsz, device, stride
    41.     set_logging()
    42.     device = select_device('0')
    43.     half = device.type != 'cpu'  # half precision only supported on CUDA
    44.     model = DetectMultiBackend(weights, device=device, dnn=False)
    45.     stride, pt, jit, engine = model.stride, model.pt, model.jit, model.engine
    46.     imgsz = check_img_size(imgsz, s=stride)  # check img_size
    47.     model.half()  # to FP16
    48.     model.eval()
    49.     return model
    50. def process_image(model, input_image=None, args=None, **kwargs):
    51.     img0 = input_image
    52.     img = letterbox(img0, new_shape=imgsz, stride=stride, auto=True)[0]
    53.     img = img.transpose((201))[::-1]  # HWC to CHW, BGR to RGB
    54.     img = np.ascontiguousarray(img)
    55.     img = torch.from_numpy(img).to(device)
    56.     img = img.half()
    57.     img /= 255.0  # 0 - 255 to 0.0 - 1.0
    58.     if len(img.shape) == 3:
    59.         img = img[None]
    60.     pred = model(img, augment=False, val=True)[0]
    61.     pred = non_max_suppression(pred, conf_thres, iou_thres, agnostic=False)
    62.     result=[]
    63.     for i, det in enumerate(pred):  # detections per image
    64.         gn = torch.tensor(img0.shape)[[1010]]  # normalization gain whwh
    65.         if det is not None and len(det):
    66.             # Rescale boxes from img_size to im0 size
    67.             det[:, :4] = scale_coords(img.shape[2:], det[:, :4], img0.shape).round()
    68.             for *xyxy, conf, cls in det:
    69.                 if cls==0:
    70.                     result.append([float(xyxy[0]),float(xyxy[1]),float(xyxy[2]),float(xyxy[3])])
    71.     if len(result)==0:
    72.       return None
    73.     for i in range(32-len(result)):
    74.       result.append([float(0),float(0),float(0),float(0)])
    75.     return torch.from_numpy(np.array(result))
    76. def scale(size, image):
    77.     """
    78.     Scale the short side of the image to size.
    79.     Args:
    80.         size (int): size to scale the image.
    81.         image (array): image to perform short side scale. Dimension is
    82.             `height` x `width` x `channel`.
    83.     Returns:
    84.         (ndarray): the scaled image with dimension of
    85.             `height` x `width` x `channel`.
    86.     """
    87.     height = image.shape[0]
    88.     width = image.shape[1]
    89.     # print(height,width)
    90.     if (width <= height and width == size) or (
    91.         height <= width and height == size
    92.     ):
    93.         return image
    94.     new_width = size
    95.     new_height = size
    96.     if width < height:
    97.         new_height = int(math.floor((float(height) / width) * size))
    98.     else:
    99.         new_width = int(math.floor((float(width) / height) * size))
    100.     img = cv2.resize(
    101.         image, (new_width, new_height), interpolation=cv2.INTER_LINEAR
    102.     )
    103.     # print(new_width, new_height)
    104.     return img.astype(np.float32)
    105. def tensor_normalize(tensor, mean, std, func=None):
    106.     """
    107.     Normalize a given tensor by subtracting the mean and dividing the std.
    108.     Args:
    109.         tensor (tensor): tensor to normalize.
    110.         mean (tensor or list): mean value to subtract.
    111.         std (tensor or list): std to divide.
    112.     """
    113.     if tensor.dtype == torch.uint8:
    114.         tensor = tensor.float()
    115.         tensor = tensor / 255.0
    116.     if type(mean) == list:
    117.         mean = torch.tensor(mean)
    118.     if type(std) == list:
    119.         std = torch.tensor(std)
    120.     if func is not None:
    121.         tensor = func(tensor)
    122.     tensor = tensor - mean
    123.     tensor = tensor / std
    124.     return tensor
    125. def scale_boxes(size, boxes, height, width):
    126.     """
    127.     Scale the short side of the box to size.
    128.     Args:
    129.         size (int): size to scale the image.
    130.         boxes (ndarray): bounding boxes to peform scale. The dimension is
    131.         `num boxes` x 4.
    132.         height (int): the height of the image.
    133.         width (int): the width of the image.
    134.     Returns:
    135.         boxes (ndarray): scaled bounding boxes.
    136.     """
    137.     if (width <= height and width == size) or (
    138.         height <= width and height == size
    139.     ):
    140.         return boxes
    141.     new_width = size
    142.     new_height = size
    143.     if width < height:
    144.         new_height = int(math.floor((float(height) / width) * size))
    145.         boxes *= float(new_height) / height
    146.     else:
    147.         new_width = int(math.floor((float(width) / height) * size))
    148.         boxes *= float(new_width) / width
    149.     return boxes
    150. def process_cv2_inputs(frames):
    151.     """
    152.     Normalize and prepare inputs as a list of tensors. Each tensor
    153.     correspond to a unique pathway.
    154.     Args:
    155.         frames (list of array): list of input images (correspond to one clip) in range [0, 255].
    156.         cfg (CfgNode): configs. Details can be found in
    157.             slowfast/config/defaults.py
    158.     """
    159.     inputs = torch.from_numpy(np.array(frames)).float() / 255
    160.     inputs = tensor_normalize(inputs, [0.45,0.45,0.45], [0.225,0.225,0.225])
    161.     # T H W C -> C T H W.
    162.     inputs = inputs.permute(3012)
    163.     # Sample frames for num_frames specified.
    164.     index = torch.linspace(0, inputs.shape[1] - 132).long()
    165.     print(index)
    166.     inputs = torch.index_select(inputs, 1, index)
    167.     fast_pathway = inputs
    168.     slow_pathway = torch.index_select(
    169.             inputs,
    170.             1,
    171.             torch.linspace(
    172.                 0, inputs.shape[1] - 1, inputs.shape[1] // 4
    173.             ).long(),
    174.         )
    175.     frame_list = [slow_pathway, fast_pathway]
    176.     print(np.shape(frame_list[0]))
    177.     inputs = [inp.unsqueeze(0for inp in frame_list]
    178.     return inputs
    179. #加载模型
    180. yolov5=init()
    181. slowfast = onnxruntime.InferenceSession('/content/SLOWFAST_32x2_R101_50_50.onnx')
    182. #加载数据开始推理
    183. cap = cv2.VideoCapture("/content/atm_125.mp4")
    184. was_read=True
    185. while was_read:
    186.     frames=[]
    187.     seq_length=64
    188.     while was_read and len(frames) < seq_length:
    189.         was_read, frame =cap.read()
    190.         frames.append(frame)
    191.     
    192.     bboxes = process_image(yolov5,frames[64//2])
    193.     if bboxes is not None:
    194.       frames = [cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) for frame in frames]
    195.       frames = [scale(256, frame) for frame in frames]
    196.       inputs = process_cv2_inputs(frames)
    197.       if bboxes is not None:
    198.           bboxes = scale_boxes(256,bboxes,1080,1920)
    199.           index_pad = torch.full(
    200.               size=(bboxes.shape[0], 1),
    201.               fill_value=float(0),
    202.               device=bboxes.device,
    203.           )
    204.           # Pad frame index for each box.
    205.           bboxes = torch.cat([index_pad, bboxes], axis=1)
    206.       for i in range(len(inputs)):
    207.         inputs[i] = inputs[i].numpy()
    208.       if bboxes is not None:
    209.           outputs = slowfast.run(None, {'slow_pathway': inputs[0],'fast_pathway':inputs[1],'bbox':bboxes})
    210.           for i in range(80):
    211.             if outputs[0][0][i]>0.3:
    212.               print(i)
    213.           print(np.shape(prd))
    214.     else:
    215.         print("没有检测到任何人物")

    slowfast python Tensorrt 推理

    6.1 导出Tensorrt

    接下来,为本文的创新点

    一开始,本文尝试使用直接将onnx导出为Tensorrt,导出失败,查找原因是因为roi_alignTensorrt中还未实现(roi_align 将在下个版本的Tensorrt中实现)。

    查看导出的onnx图,会发现roi_align只在head部分用到。

     于是我们提出以下思路,如下图所示,将roi_ailgn模块单独划分出来,不经过Tensorrt加速,将slowfast分成为两个网络,其中主体网络用于提取特征,head网络部分负责进行动作分类.。

    6.2 Tensorrt推理代码

    1. import ctypes
    2. import os
    3. import numpy as np
    4. import cv2
    5. import random
    6. import tensorrt as trt
    7. import pycuda.autoinit
    8. import pycuda.driver as cuda
    9. import threading
    10. import time
    11. class TrtInference():
    12.     _batch_size = 1
    13.     def __init__(self, model_path=None, cuda_ctx=None):
    14.         self._model_path = model_path
    15.         if self._model_path is None:
    16.             print("please set trt model path!")
    17.             exit()
    18.         self.cuda_ctx = cuda_ctx
    19.         if self.cuda_ctx is None:
    20.             self.cuda_ctx = cuda.Device(0).make_context()
    21.         if self.cuda_ctx:
    22.             self.cuda_ctx.push()
    23.         self.trt_logger = trt.Logger(trt.Logger.INFO)
    24.         self._load_plugins()
    25.         self.engine = self._load_engine()
    26.         try:
    27.             self.context = self.engine.create_execution_context()
    28.             self.stream = cuda.Stream()
    29.             for index, binding in enumerate(self.engine):
    30.                 if self.engine.binding_is_input(binding):
    31.                     batch_shape = list(self.engine.get_binding_shape(binding)).copy()
    32.                     batch_shape[0] = self._batch_size
    33.                     self.context.set_binding_shape(index, batch_shape)
    34.             self.host_inputs, self.host_outputs, self.cuda_inputs, self.cuda_outputs, self.bindings = self._allocate_buffers()
    35.         except Exception as e:
    36.             raise RuntimeError('fail to allocate CUDA resources'from e
    37.         finally:
    38.             if self.cuda_ctx:
    39.                 self.cuda_ctx.pop()
    40.     def _load_plugins(self):
    41.         pass
    42.     def _load_engine(self):
    43.         with open(self._model_path, 'rb'as f, trt.Runtime(self.trt_logger) as runtime:
    44.             return runtime.deserialize_cuda_engine(f.read())
    45.     def _allocate_buffers(self):
    46.         host_inputs, host_outputs, cuda_inputs, cuda_outputs, bindings = \
    47.             [], [], [], [], []
    48.         for index, binding in enumerate(self.engine):
    49.             size = trt.volume(self.context.get_binding_shape(index)) * \
    50.                    self.engine.max_batch_size
    51.             host_mem = cuda.pagelocked_empty(size, np.float32)
    52.             cuda_mem = cuda.mem_alloc(host_mem.nbytes)
    53.             bindings.append(int(cuda_mem))
    54.             if self.engine.binding_is_input(binding):
    55.                 host_inputs.append(host_mem)
    56.                 cuda_inputs.append(cuda_mem)
    57.             else:
    58.                 host_outputs.append(host_mem)
    59.                 cuda_outputs.append(cuda_mem)
    60.         return host_inputs, host_outputs, cuda_inputs, cuda_outputs, bindings
    61.     def destroy(self):
    62.         """Free CUDA memories and context."""
    63.         del self.cuda_outputs
    64.         del self.cuda_inputs
    65.         del self.stream
    66.         if self.cuda_ctx:
    67.             self.cuda_ctx.pop()
    68.             del self.cuda_ctx
    69.     def inference(self, inputs):
    70.         np.copyto(self.host_inputs[0], inputs[0].ravel())
    71.         np.copyto(self.host_inputs[1], inputs[1].ravel())
    72.         if self.cuda_ctx:
    73.             self.cuda_ctx.push()
    74.         cuda.memcpy_htod_async(
    75.             self.cuda_inputs[0], self.host_inputs[0], self.stream)
    76.         cuda.memcpy_htod_async(
    77.             self.cuda_inputs[1], self.host_inputs[1], self.stream)
    78.         self.context.execute_async(
    79.             batch_size=1,
    80.             bindings=self.bindings,
    81.             stream_handle=self.stream.handle)
    82.         cuda.memcpy_dtoh_async(
    83.             self.host_outputs[0], self.cuda_outputs[0], self.stream)
    84.         cuda.memcpy_dtoh_async(
    85.             self.host_outputs[1], self.cuda_outputs[1], self.stream)
    86.         self.stream.synchronize()
    87.         if self.cuda_ctx:
    88.             self.cuda_ctx.pop()
    89.         output = [self.host_outputs[0],self.host_outputs[1]]
    90.         return output
    91. class TrtInference_head():
    92.     _batch_size = 1
    93.     def __init__(self, model_path=None, cuda_ctx=None):
    94.         self._model_path = model_path
    95.         if self._model_path is None:
    96.             print("please set trt model path!")
    97.             exit()
    98.         self.cuda_ctx = cuda_ctx
    99.         if self.cuda_ctx is None:
    100.             self.cuda_ctx = cuda.Device(0).make_context()
    101.         if self.cuda_ctx:
    102.             self.cuda_ctx.push()
    103.         self.trt_logger = trt.Logger(trt.Logger.INFO)
    104.         self._load_plugins()
    105.         self.engine = self._load_engine()
    106.         try:
    107.             self.context = self.engine.create_execution_context()
    108.             self.stream = cuda.Stream()
    109.             for index, binding in enumerate(self.engine):
    110.                 if self.engine.binding_is_input(binding):
    111.                     batch_shape = list(self.engine.get_binding_shape(binding)).copy()
    112.                     batch_shape[0] = self._batch_size
    113.                     self.context.set_binding_shape(index, batch_shape)
    114.             self.host_inputs, self.host_outputs, self.cuda_inputs, self.cuda_outputs, self.bindings = self._allocate_buffers()
    115.         except Exception as e:
    116.             raise RuntimeError('fail to allocate CUDA resources'from e
    117.         finally:
    118.             if self.cuda_ctx:
    119.                 self.cuda_ctx.pop()
    120.     def _load_plugins(self):
    121.         pass
    122.     def _load_engine(self):
    123.         with open(self._model_path, 'rb'as f, trt.Runtime(self.trt_logger) as runtime:
    124.             return runtime.deserialize_cuda_engine(f.read())
    125.     def _allocate_buffers(self):
    126.         host_inputs, host_outputs, cuda_inputs, cuda_outputs, bindings = \
    127.             [], [], [], [], []
    128.         for index, binding in enumerate(self.engine):
    129.             size = trt.volume(self.context.get_binding_shape(index)) * \
    130.                    self.engine.max_batch_size
    131.             host_mem = cuda.pagelocked_empty(size, np.float32)
    132.             cuda_mem = cuda.mem_alloc(host_mem.nbytes)
    133.             bindings.append(int(cuda_mem))
    134.             if self.engine.binding_is_input(binding):
    135.                 host_inputs.append(host_mem)
    136.                 cuda_inputs.append(cuda_mem)
    137.             else:
    138.                 host_outputs.append(host_mem)
    139.                 cuda_outputs.append(cuda_mem)
    140.         return host_inputs, host_outputs, cuda_inputs, cuda_outputs, bindings
    141.     def destroy(self):
    142.         """Free CUDA memories and context."""
    143.         del self.cuda_outputs
    144.         del self.cuda_inputs
    145.         del self.stream
    146.         if self.cuda_ctx:
    147.             self.cuda_ctx.pop()
    148.             del self.cuda_ctx
    149.     def inference(self, inputs):
    150.         np.copyto(self.host_inputs[0], inputs[0].ravel())
    151.         np.copyto(self.host_inputs[1], inputs[1].ravel())
    152.         if self.cuda_ctx:
    153.             self.cuda_ctx.push()
    154.         cuda.memcpy_htod_async(
    155.             self.cuda_inputs[0], self.host_inputs[0], self.stream)
    156.         cuda.memcpy_htod_async(
    157.             self.cuda_inputs[1], self.host_inputs[1], self.stream)
    158.         self.context.execute_async(
    159.             batch_size=1,
    160.             bindings=self.bindings,
    161.             stream_handle=self.stream.handle)
    162.         cuda.memcpy_dtoh_async(
    163.             self.host_outputs[0], self.cuda_outputs[0], self.stream)
    164.         self.stream.synchronize()
    165.         if self.cuda_ctx:
    166.             self.cuda_ctx.pop()
    167.         output = self.host_outputs[0]
    168.         return output
    1. import torch
    2. import math
    3. from torchvision.ops import roi_align
    4. import argparse
    5. import os
    6. import platform
    7. import shutil
    8. import time
    9. from pathlib import Path
    10. import sys
    11. import json
    12. sys.path.insert(1'/content/drive/MyDrive/yolov5/')
    13. import cv2
    14. import torch
    15. import torch.backends.cudnn as cudnn
    16. import numpy as np
    17. import argparse
    18. import time
    19. import cv2
    20. import torch
    21. import torch.backends.cudnn as cudnn
    22. from numpy import random
    23. from models.common import DetectMultiBackend
    24. from utils.augmentations import letterbox
    25. from utils.general import check_img_size, non_max_suppression, scale_coords, set_logging
    26. from utils.torch_utils import select_device
    27. # ####### 参数设置
    28. conf_thres = 0.89
    29. iou_thres = 0.5
    30. #######
    31. imgsz = 640
    32. weights = "/content/yolov5l.pt"
    33. device = '0'
    34. stride = 32
    35. names = ["person"]
    36. import os
    37. def init():
    38.     # Initialize
    39.     global imgsz, device, stride
    40.     set_logging()
    41.     device = select_device('0')
    42.     half = device.type != 'cpu'  # half precision only supported on CUDA
    43.     model = DetectMultiBackend(weights, device=device, dnn=False)
    44.     stride, pt, jit, engine = model.stride, model.pt, model.jit, model.engine
    45.     imgsz = check_img_size(imgsz, s=stride)  # check img_size
    46.     model.half()  # to FP16
    47.     model.eval()
    48.     return model
    49. def process_image(model, input_image=None, args=None, **kwargs):
    50.     img0 = input_image
    51.     img = letterbox(img0, new_shape=imgsz, stride=stride, auto=True)[0]
    52.     img = img.transpose((201))[::-1]  # HWC to CHW, BGR to RGB
    53.     img = np.ascontiguousarray(img)
    54.     img = torch.from_numpy(img).to(device)
    55.     img = img.half()
    56.     img /= 255.0  # 0 - 255 to 0.0 - 1.0
    57.     if len(img.shape) == 3:
    58.         img = img[None]
    59.     pred = model(img, augment=False, val=True)[0]
    60.     pred = non_max_suppression(pred, conf_thres, iou_thres, agnostic=False)
    61.     result=[]
    62.     for i, det in enumerate(pred):  # detections per image
    63.         gn = torch.tensor(img0.shape)[[1010]]  # normalization gain whwh
    64.         if det is not None and len(det):
    65.             # Rescale boxes from img_size to im0 size
    66.             det[:, :4] = scale_coords(img.shape[2:], det[:, :4], img0.shape).round()
    67.             for *xyxy, conf, cls in det:
    68.                 if cls==0:
    69.                     result.append([float(xyxy[0]),float(xyxy[1]),float(xyxy[2]),float(xyxy[3])])
    70.     if len(result)==0:
    71.       return None
    72.     for i in range(32-len(result)):
    73.       result.append([float(0),float(0),float(0),float(0)])
    74.     return torch.from_numpy(np.array(result))
    75. def scale(size, image):
    76.     """
    77.     Scale the short side of the image to size.
    78.     Args:
    79.         size (int): size to scale the image.
    80.         image (array): image to perform short side scale. Dimension is
    81.             `height` x `width` x `channel`.
    82.     Returns:
    83.         (ndarray): the scaled image with dimension of
    84.             `height` x `width` x `channel`.
    85.     """
    86.     height = image.shape[0]
    87.     width = image.shape[1]
    88.     # print(height,width)
    89.     if (width <= height and width == size) or (
    90.         height <= width and height == size
    91.     ):
    92.         return image
    93.     new_width = size
    94.     new_height = size
    95.     if width < height:
    96.         new_height = int(math.floor((float(height) / width) * size))
    97.     else:
    98.         new_width = int(math.floor((float(width) / height) * size))
    99.     img = cv2.resize(
    100.         image, (new_width, new_height), interpolation=cv2.INTER_LINEAR
    101.     )
    102.     # print(new_width, new_height)
    103.     return img.astype(np.float32)
    104. def tensor_normalize(tensor, mean, std, func=None):
    105.     """
    106.     Normalize a given tensor by subtracting the mean and dividing the std.
    107.     Args:
    108.         tensor (tensor): tensor to normalize.
    109.         mean (tensor or list): mean value to subtract.
    110.         std (tensor or list): std to divide.
    111.     """
    112.     if tensor.dtype == torch.uint8:
    113.         tensor = tensor.float()
    114.         tensor = tensor / 255.0
    115.     if type(mean) == list:
    116.         mean = torch.tensor(mean)
    117.     if type(std) == list:
    118.         std = torch.tensor(std)
    119.     if func is not None:
    120.         tensor = func(tensor)
    121.     tensor = tensor - mean
    122.     tensor = tensor / std
    123.     return tensor
    124. def scale_boxes(size, boxes, height, width):
    125.     """
    126.     Scale the short side of the box to size.
    127.     Args:
    128.         size (int): size to scale the image.
    129.         boxes (ndarray): bounding boxes to peform scale. The dimension is
    130.         `num boxes` x 4.
    131.         height (int): the height of the image.
    132.         width (int): the width of the image.
    133.     Returns:
    134.         boxes (ndarray): scaled bounding boxes.
    135.     """
    136.     if (width <= height and width == size) or (
    137.         height <= width and height == size
    138.     ):
    139.         return boxes
    140.     new_width = size
    141.     new_height = size
    142.     if width < height:
    143.         new_height = int(math.floor((float(height) / width) * size))
    144.         boxes *= float(new_height) / height
    145.     else:
    146.         new_width = int(math.floor((float(width) / height) * size))
    147.         boxes *= float(new_width) / width
    148.     return boxes
    149. def process_cv2_inputs(frames):
    150.     """
    151.     Normalize and prepare inputs as a list of tensors. Each tensor
    152.     correspond to a unique pathway.
    153.     Args:
    154.         frames (list of array): list of input images (correspond to one clip) in range [0, 255].
    155.         cfg (CfgNode): configs. Details can be found in
    156.             slowfast/config/defaults.py
    157.     """
    158.     inputs = torch.from_numpy(np.array(frames)).float() / 255
    159.     inputs = tensor_normalize(inputs, [0.45,0.45,0.45], [0.225,0.225,0.225])
    160.     # T H W C -> C T H W.
    161.     inputs = inputs.permute(3012)
    162.     # Sample frames for num_frames specified.
    163.     index = torch.linspace(0, inputs.shape[1] - 132).long()
    164.     print(index)
    165.     inputs = torch.index_select(inputs, 1, index)
    166.     fast_pathway = inputs
    167.     slow_pathway = torch.index_select(
    168.             inputs,
    169.             1,
    170.             torch.linspace(
    171.                 0, inputs.shape[1] - 1, inputs.shape[1] // 4
    172.             ).long(),
    173.         )
    174.     frame_list = [slow_pathway, fast_pathway]
    175.     print(np.shape(frame_list[0]))
    176.     inputs = [inp.unsqueeze(0for inp in frame_list]
    177.     return inputs
    178. #加载模型
    179. yolov5=init()
    180. slowfast = TrtInference('/content/SLOWFAST_32x2_R101_50_50.engine',None)
    181. head = TrtInference_head('/content/SLOWFAST_head.engine',None)

    #加载数据开始推理

    1. cap = cv2.VideoCapture("/content/atm_125.mp4")
    2. was_read=True
    3. while was_read:
    4.     frames=[]
    5.     seq_length=64
    6.     while was_read and len(frames) < seq_length:
    7.         was_read, frame =cap.read()
    8.         frames.append(frame)
    9.     
    10.     bboxes = process_image(yolov5,frames[64//2])
    11.     if bboxes is not None:
    12.       frames = [cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) for frame in frames]
    13.       frames = [scale(256, frame) for frame in frames]
    14.       inputs = process_cv2_inputs(frames)
    15.       print(bboxes)
    16.       if bboxes is not None:
    17.           bboxes = scale_boxes(256,bboxes,1080,1920)
    18.           index_pad = torch.full(
    19.               size=(bboxes.shape[0], 1),
    20.               fill_value=float(0),
    21.               device=bboxes.device,
    22.           )
    23.           # Pad frame index for each box.
    24.           bboxes = torch.cat([index_pad, bboxes], axis=1)
    25.       for i in range(len(inputs)):
    26.         inputs[i] = inputs[i].numpy()
    27.       if bboxes is not None:
    28.           outputs=slowfast.inference(inputs)
    29.           outputs[0]=outputs[0].reshape(1,2048,16,29)
    30.           outputs[1]=outputs[1].reshape(1,256,16,29)
    31.           outputs[0]=torch.from_numpy(outputs[0])
    32.           outputs[1]=torch.from_numpy(outputs[1])
    33.           outputs[0]=roi_align(outputs[0],bboxes.to(dtype=outputs[0].dtype),7,1.0/16,0,True)
    34.           outputs[1]=roi_align(outputs[1],bboxes.to(dtype=outputs[1].dtype),7,1.0/16,0,True)
    35.           outputs[0] = outputs[0].numpy()
    36.           outputs[1] = outputs[1].numpy()
    37.           prd=head.inference(outputs)
    38.           prd=prd.reshape(32,80)
    39.           for i in range(80):
    40.             if prd[0][i]>0.3:
    41.               print(i)
    42.     else:
    43.         print("没有检测到任何人物")

    通过阅读上述的代码

    slow_pathway 与fast_pathway 经过slowfast主体模型,通过reshaperoi_align 需要的维度,将reshape后的结果,bbox以及相应的参数带入到roi_align中得到head模型需要的输入。

    7.slowfast C++ tensorrt 部署

    7.1 yolov5 C++ 目标检测

    yolov5 本文就不介绍了,我直接使用平台自带的yolov5 tensorrt 代码

    https://github.com/ExtremeMart/ev_sdk_demo4.0_pedestrian_intrusion_yolov5

    7.2  deepsort C++ 目标追踪

    本文参考以下的deepsort代码

    https://github.com/RichardoMrMu/deepsort-tensorrt

    由于这部分不是本文的重点,只需要知道怎么使用这部分的代码,写好CmakeLists文件,在代码中可以按照以下的方式使用deepsort

    1. #include "deepsort.h" 
    2. /**
    3.  DeepSortBox 为yolov5识别的结果
    4.  DeepSortBox 结构
    5.  {
    6.   x1,
    7.   y1,
    8.   x2,
    9.   y2,
    10.   score,
    11.   label,
    12.   trackID
    13.  }

     img 为原始的图片
     最终结果存放在DeepSortBox中
    */
    DS->sort(img, DeepSortBox); 

    7.3    slowfast C++ 目标动作识别

    运行环境:

    Tensorrt8.4

    opencv4.1.1

    cudnn8.0

    cuda11.1

    文件准备:

    body.onnx

    head.onnx 

     slowfast推理流程图

    我们还是按照预测的流程图来实现Tensorrt推理代码

    通过onnx可视化查看body.onnx输入以及输出

    head.onnx的输入以及输出

    Step1:模型加载

    body.onnx以及head.onnx 通过Tensorrt加载,并且开辟Tensorrt推理运行空间,代码如下

    1. void loadheadOnnx(const std::string strModelName)
    2. {
    3.     Logger gLogger;
    4.     //根据tensorrt pipeline 构建网络
    5.     IBuilder* builder = createInferBuilder(gLogger);
    6.     builder->setMaxBatchSize(1);
    7.     const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);  
    8.     INetworkDefinition* network = builder->createNetworkV2(explicitBatch);
    9.     nvonnxparser::IParser* parser = nvonnxparser::createParser(*network, gLogger);
    10.     parser->parseFromFile(strModelName.c_str(), static_cast<int>(ILogger::Severity::kWARNING));
    11.     IBuilderConfig* config = builder->createBuilderConfig();
    12.     config->setMaxWorkspaceSize(1ULL << 30);    
    13.     m_CudaheadEngine = builder->buildEngineWithConfig(*network, *config);    
    14.     std::string strTrtName = strModelName;
    15.     size_t sep_pos = strTrtName.find_last_of(".");
    16.     strTrtName = strTrtName.substr(0, sep_pos) + ".trt";
    17.     IHostMemory *gieModelStream = m_CudaheadEngine->serialize();
    18.     std::string serialize_str;
    19.     std::ofstream serialize_output_stream;
    20.     serialize_str.resize(gieModelStream->size());   
    21.     memcpy((void*)serialize_str.data(),gieModelStream->data(),gieModelStream->size());
    22.     serialize_output_stream.open(strTrtName.c_str());
    23.     serialize_output_stream<
    24.     serialize_output_stream.close();
    25.     m_CudaheadContext = m_CudaheadEngine->createExecutionContext();
    26.     parser->destroy();
    27.     network->destroy();
    28.     config->destroy();
    29.     builder->destroy();
    30. }

    Step2: 为输入输出数据开辟空间

    body.onnx 输入为slow_pathwayfast_pathway的维度为(B,C,T,H,W),其中slow_pathway的T为8,输出为(B,2048,16,29)fast_pathway的维度为32,输出为(B,256,16,29)``,head的输入(32,2048,7,7)与(32,256,7,7),输出为(32,80),具体代码实现如下:whaosoft aiot http://143ai.com  

    1.  slow_pathway_InputIndex = m_CudaslowfastEngine->getBindingIndex(slow_pathway_NAME);
    2.     fast_pathway_InputIndex = m_CudaslowfastEngine->getBindingIndex(fast_pathway_NAME);
    3.     slow_pathway_OutputIndex = m_CudaslowfastEngine->getBindingIndex(slow_pathway_OUTPUT);
    4.     fast_pathway_OutputIndex = m_CudaslowfastEngine->getBindingIndex(fast_pathway_OUTPUT); 
    5.     dims_i = m_CudaslowfastEngine->getBindingDimensions(slow_pathway_InputIndex);
    6.     SDKLOG(INFO)<" "<" "<" "<
    7.     SDKLOG(INFO) << "slow_pathway dims " << dims_i.d[0] << " " << dims_i.d[1] << " " << dims_i.d[2] << " " << dims_i.d[3]<< " " << dims_i.d[4];
    8.     size = dims_i.d[0] * dims_i.d[1] * dims_i.d[2] * dims_i.d[3]* dims_i.d[4];
    9.     cudaMalloc(&slowfast_ArrayDevMemory[slow_pathway_InputIndex], size * sizeof(float));
    10.     slowfast_ArrayHostMemory[slow_pathway_InputIndex] = malloc(size * sizeof(float));
    11.     slowfast_ArraySize[slow_pathway_InputIndex]=size* sizeof(float);
    12.     
    13.     dims_i = m_CudaslowfastEngine->getBindingDimensions(fast_pathway_InputIndex);
    14.     SDKLOG(INFO) << "fast_pathway dims " << dims_i.d[0] << " " << dims_i.d[1] << " " << dims_i.d[2] << " " << dims_i.d[3]<< " " << dims_i.d[4];
    15.     size = dims_i.d[0] * dims_i.d[1] * dims_i.d[2] * dims_i.d[3]* dims_i.d[4];
    16.     cudaMalloc(&slowfast_ArrayDevMemory[fast_pathway_InputIndex], size * sizeof(float));
    17.     slowfast_ArrayHostMemory[fast_pathway_InputIndex] = malloc(size * sizeof(float));
    18.     slowfast_ArraySize[fast_pathway_InputIndex]=size* sizeof(float);
    19.     
    20.     
    21.     dims_i = m_CudaslowfastEngine->getBindingDimensions(slow_pathway_OutputIndex);
    22.     SDKLOG(INFO) << "slow_out dims " << dims_i.d[0] << " " << dims_i.d[1] << " " << dims_i.d[2] << " " << dims_i.d[3];
    23.     size = dims_i.d[0] * dims_i.d[1] * dims_i.d[2] * dims_i.d[3];
    24.     cudaMalloc(&slowfast_ArrayDevMemory[slow_pathway_OutputIndex], size * sizeof(float));
    25.     slowfast_ArrayHostMemory[slow_pathway_OutputIndex] = malloc(size * sizeof(float));
    26.     slowfast_ArraySize[slow_pathway_OutputIndex]=size* sizeof(float);
    27.     
    28.    
    29.     
    30.     dims_i = m_CudaslowfastEngine->getBindingDimensions(fast_pathway_OutputIndex);
    31.     SDKLOG(INFO) << "fast_out dims " << dims_i.d[0] << " " << dims_i.d[1] << " " << dims_i.d[2] << " " << dims_i.d[3];
    32.     size = dims_i.d[0] * dims_i.d[1] * dims_i.d[2] * dims_i.d[3];
    33.     cudaMalloc(&slowfast_ArrayDevMemory[fast_pathway_OutputIndex], size * sizeof(float));
    34.     slowfast_ArrayHostMemory[fast_pathway_OutputIndex] = malloc(size * sizeof(float));
    35.     slowfast_ArraySize[fast_pathway_OutputIndex]=size* sizeof(float);
    36.     
    37.    
    38.     
    39.     size=32*2048*7*7;
    40.     cudaMalloc(&ROIAlign_ArrayDevMemory[0], size * sizeof(float));
    41.     ROIAlign_ArrayHostMemory[0] = malloc(size * sizeof(float));
    42.     ROIAlign_ArraySize[0]=size* sizeof(float);
    43.     
    44.     size=32*256*7*7;
    45.     cudaMalloc(&ROIAlign_ArrayDevMemory[1], size * sizeof(float));
    46.     ROIAlign_ArrayHostMemory[1] = malloc(size * sizeof(float));
    47.     ROIAlign_ArraySize[1]=size* sizeof(float);
    48.     
    49.    
    50.     size=32*80;
    51.     cudaMalloc(&ROIAlign_ArrayDevMemory[2], size * sizeof(float));
    52.     ROIAlign_ArrayHostMemory[2] = malloc(size * sizeof(float));
    53.     ROIAlign_ArraySize[2]=size* sizeof(float);
    54.     size=32*5;
    55.     boxes_data= malloc(size * sizeof(float));
    56.     dims_i = m_CudaheadEngine->getBindingDimensions(0);

    Step3:输入数据预处理

    首先由于我导出onnx文件没有使用动态尺寸,导致input 图片大小已经确定了,size=256*455(这个结果是1080*1920等比例放缩),slowfast模型要求为RGB,需要将图片从BGR转换为RGB,之后进行resize到256*455,具体代码实现如下

    1.   cv::Mat framesimg = img.clone();
    2.         cv::cvtColor(framesimg, framesimg, cv::COLOR_BGR2RGB);
    3.         int height = framesimg.rows;
    4.         int width = framesimg.cols;
    5.         // 对图像进行预处理
    6.         //cv2.COLOR_BGR2RGB
    7.         int size=256;
    8.         int new_width = width;
    9.         int new_height = height;
    10.         if ((width <= height && width == size) || (height <= width and height == size)){
    11.             
    12.         }
    13.         else{
    14.             new_width = size;
    15.             new_height = size;
    16.             if(width
    17.                 new_height = int((float(height) / width) * size);
    18.             }else{  
    19.                 new_width = int((float(width) / height) * size);
    20.             }
    21.             cv::resize(framesimg, framesimg, cv::Size{new_width,new_height},cv::INTER_LINEAR);
    22.         } 

    其次之后对图像进行归一化操作,并且按照CTHW的顺序进行排列,其中C为通道,T为图像顺序,H 为图像的长度,W为图像的宽度,由于slowfast有两个输入,一个输入为fast_pathway 为32帧的图像,维度为(b,c,T,h,w),其中T为32 ,因此需要每两帧添加图像数据到fast_pathway中,另外一个输入为slow_pathway为8帧的图像,维度为(b,c,T,h,w),其中T为8,因此需要每四帧添加图像数据到slow_pathway 中,具体代码如下

    1.   float *data=(float *)slowfast_ArrayHostMemory[fast_pathway_InputIndex];
    2.         new_width =  framesimg.cols;
    3.         new_height = framesimg.rows;
    4.         for (size_t c = 0; c < 3; c++)
    5.         {
    6.             for (size_t  h = 0; h < new_height; h++)
    7.             {
    8.                 for (size_t w = 0; w < new_width; w++)
    9.                 {
    10.                     float v=((float)framesimg.at(h, w)[c]) / 255.0f;
    11.                     v -= 0.45;
    12.                     v /= 0.225;
    13.                     data[c*32*256*455+fast_index* new_width * new_height + h * new_width + w] =v;
    14.                 }
    15.             }
    16.         }
    17.         fast_index++;
    18.         if(frames==0||frames==8||frames==16||frames==26||frames==34||frames==44||frames==52||frames==63){
    19.             data=(float *)slowfast_ArrayHostMemory[slow_pathway_InputIndex];
    20.             for (size_t c = 0; c < 3; c++)
    21.             {
    22.                 for (size_t  h = 0; h < new_height; h++)
    23.                 {
    24.                     for (size_t w = 0; w < new_width; w++)
    25.                     {
    26.                        float v=((float)framesimg.at(h, w)[c]) / 255.0f;
    27.                         v -= 0.45;
    28.                         v /= 0.225;
    29.                         data[c*8*256*455+slow_index* new_width * new_height + h * new_width + w] =v;
    30.                     }
    31.                 }
    32.             }  
    33.             slow_index++;
    34.         }

    Step4: roi_align实现

    正如上一节所描述一样,roi_align在当前版本中的Tensorrt中并没有实现,而在torchvision.ops中实现了roi_align,python推理代码可以直接调用。而C++代码必须要实现roi_align,具体原理这里不讲解了,可以简单认为roi_align具体过程就是crop和resize的过程,从特征图中提取bbox对应的特征,将提取到的特征resize到7*7。具体代码实现如下

    1. void ROIAlignForwardCpu(const float* bottom_data, const float spatial_scale, const int num_rois,
    2.                      const int height, const int width, const int channels,
    3.                      const int aligned_height, const int aligned_width, const float * bottom_rois,
    4.                      float* top_data)
    5. {
    6.     const int output_size = num_rois * aligned_height * aligned_width * channels;
    7.     int idx = 0;
    8.     for (idx = 0; idx < output_size; ++idx)
    9.     {
    10.         int pw = idx % aligned_width;
    11.         int ph = (idx / aligned_width) % aligned_height;
    12.         int c = (idx / aligned_width / aligned_height) % channels;
    13.         int n = idx / aligned_width / aligned_height / channels;  
    14.         float roi_batch_ind = 0
    15.         float roi_start_w = bottom_rois[n * 5 + 1] * spatial_scale;
    16.         float roi_start_h = bottom_rois[n * 5 + 2] * spatial_scale;
    17.         float roi_end_w = bottom_rois[n * 5 + 3] * spatial_scale;
    18.         float roi_end_h = bottom_rois[n * 5 + 4] * spatial_scale; 
    19.         float roi_width = fmaxf(roi_end_w - roi_start_w + 1.0.);
    20.         float roi_height = fmaxf(roi_end_h - roi_start_h + 1.0.);
    21.         float bin_size_h = roi_height / (aligned_height - 1.);
    22.         float bin_size_w = roi_width / (aligned_width - 1.);
    23.         float h = (float)(ph) * bin_size_h + roi_start_h;
    24.         float w = (float)(pw) * bin_size_w + roi_start_w;
    25.         int hstart = fminf(floor(h), height - 2); 
    26.         int wstart = fminf(floor(w), width - 2);
    27.         int img_start = roi_batch_ind * channels * height * width; 
    28.         if (h < 0 || h >= height || w < 0 || w >= width)  
    29.         {
    30.             top_data[idx] = 0.
    31.         }
    32.         else
    33.         {
    34.             float h_ratio = h - (float)(hstart); 
    35.             float w_ratio = w - (float)(wstart);
    36.             int upleft = img_start + (c * height + hstart) * width + wstart;
    37.             
    38.             int upright = upleft + 1;
    39.             int downleft = upleft + width; 
    40.             int downright = downleft + 1
    41.             top_data[idx] = bottom_data[upleft] * (1. - h_ratio) * (1. - w_ratio)
    42.                 + bottom_data[upright] * (1. - h_ratio) * w_ratio
    43.                 + bottom_data[downleft] * h_ratio * (1. - w_ratio)
    44.                 + bottom_data[downright] * h_ratio * w_ratio;  
    45.         }
    46.     }
    47. }

    Step5:推理

    首先将 Step3中准备好的数据使用body进行推理,将推理结果使用Step4中的roi_align函数进行提取bbox对应的特征,最后将提取的特征使用head模型进行推理,得到output。具体代码实现如下

    1. cudaMemcpyAsync(slowfast_ArrayDevMemory[slow_pathway_InputIndex], slowfast_ArrayHostMemory[slow_pathway_InputIndex], slowfast_ArraySize[slow_pathway_InputIndex], cudaMemcpyHostToDevice, m_CudaStream);
    2.     cudaMemcpyAsync(slowfast_ArrayDevMemory[fast_pathway_InputIndex], slowfast_ArrayHostMemory[fast_pathway_InputIndex], slowfast_ArraySize[fast_pathway_InputIndex], cudaMemcpyHostToDevice, m_CudaStream);
    3.     m_CudaslowfastContext->enqueueV2(slowfast_ArrayDevMemory , m_CudaStream, nullptr);    
    4.    cudaMemcpyAsync(slowfast_ArrayHostMemory[slow_pathway_OutputIndex], slowfast_ArrayDevMemory[slow_pathway_OutputIndex], slowfast_ArraySize[slow_pathway_OutputIndex], cudaMemcpyDeviceToHost, m_CudaStream);
    5.     cudaMemcpyAsync(slowfast_ArrayHostMemory[fast_pathway_OutputIndex], slowfast_ArrayDevMemory[fast_pathway_OutputIndex], slowfast_ArraySize[fast_pathway_OutputIndex], cudaMemcpyDeviceToHost, m_CudaStream);
    6.     cudaStreamSynchronize(m_CudaStream);  
    7.     data=(float*)slowfast_ArrayHostMemory[fast_pathway_OutputIndex];
    8.     ROIAlignForwardCpu((float*)slowfast_ArrayHostMemory[slow_pathway_OutputIndex], 0.062532,16,292048,77, (float*)boxes_data,       (float*)ROIAlign_ArrayHostMemory[0]);
    9.     ROIAlignForwardCpu((float*)slowfast_ArrayHostMemory[fast_pathway_OutputIndex], 0.062532,16,29256,77, (float*)boxes_data,       (float*)ROIAlign_ArrayHostMemory[1]);
    10.     data=(float*)ROIAlign_ArrayHostMemory[0];
    11.     cudaMemcpyAsync(ROIAlign_ArrayDevMemory[0], ROIAlign_ArrayHostMemory[0], ROIAlign_ArraySize[0], cudaMemcpyHostToDevice, m_CudaStream);
    12.     cudaMemcpyAsync(ROIAlign_ArrayDevMemory[1], ROIAlign_ArrayHostMemory[1], ROIAlign_ArraySize[1], cudaMemcpyHostToDevice, m_CudaStream);
    13.     m_CudaheadContext->enqueueV2(ROIAlign_ArrayDevMemory, m_CudaStream, nullptr); 
    14.     cudaMemcpyAsync(ROIAlign_ArrayHostMemory[2], ROIAlign_ArrayDevMemory[2], ROIAlign_ArraySize[2], cudaMemcpyDeviceToHost, m_CudaStream);
    15.     cudaStreamSynchronize(m_CudaStream); 

    参考链接

    https://github.com/facebookresearch/SlowFast

    开源项目一

    https://github.com/facebookresearch/SlowFast

    开源项目二

    https://github.com/wufan-tb/yolo_slowfast?tab=readme-ov-file

     

  • 相关阅读:
    芯片产业管理和营销指北(2)—— 产品线经理人事管理
    flex布局属性简约速记
    交互式 .Net
    迅为RK3568开发板Scharr滤波器算子边缘检测
    WordPress初学者入门教程-附录一搜索引擎优化(SEO)
    ArcGIS基础:如何在大量数据里挑选随机样本(创建随机点工具)
    【C++程序员必修第一课】C++基础课程-11:C++ 内置数组
    【JVM】初步认识Java虚拟机
    vue脚手架安装
    45-3 护网溯源 - 为什么要做溯源工作
  • 原文地址:https://blog.csdn.net/qq_29788741/article/details/127798953