Nvidia Deepstream小细节系列：Deepstream APP 运行延迟，卡顿，死机处理办法

当我们在运行deepstream应用的时候，可能会遇到延迟，视频显示卡顿，甚至死机，或应用程序关闭等情况。这里详细罗列所有可能发生上述现象的原因和可行的解决办法。目前代码基于deepstream python，但对于C++平台也是一个道理。

环境描述：

本案例运行环境：Jetson NX
IDE：VSCode
JetPack 4.6.1 GA
Deepstream 6.0.1

文章目录

Nvidia Deepstream小细节系列：Deepstream APP 运行延迟，卡顿，死机处理办法

可能1：Jetson 时钟未设置

在我们安装Deepstream的最后一步，需要Boost the clocks。

sudo nvpmodel -m 0
sudo jetson_clocks
1
2

一般如果我们是通过SDKManager安装的Deepstream，或者直接允许相关镜像，不需要担心这个问题。

可能2：设置`batched-push-timeout`参数

batched-push-timeout是Gst-nvstreammux的一个参数。从官方文档看到的关于此参数的定义：

Timeout in microseconds to wait after the first buffer is available to push the batch even if a complete batch is not formed.
1

翻译成中文：

即使没有形成完整的批次，在第一个缓冲区可用后将批次推入之后的超时（以微秒为单位）。
1

我们知道，Gst-nvstreammux的作用是将各个video stream mux在一起，然后进入下一个nvinfer模块。batched-push-timeout这个参数不能设置太大，否则，batch push就会造成延迟。一般这个值设置为1/max_fps。需要注意，单位是microseconds。所以在代码中，我们需要这么写：

streammux.set_property('batched-push-timeout', int(1000000/CfgVidSource.fps) )
1

可能3：nvInfer的模型相关

我们需要检查是否模型本身推理的模型就注定了系统的延迟。比如我们设置2个摄像头连接到系统，每个摄像头25FPS，那么模型推理的时间必须在1/50秒内完成。当然，我们可以通过设置nvTraker，还是模块的多线程来解决这个问题，但终究我们需要先考虑模型本身的性能。

另外一种可能性就是模型的图像输入尺寸。我们可以通过调整Gst-nvstreammux中的width和height参数来控制从nvstreammux输入给nvinfer的图像尺寸。width和height的默认值是1280和720。个人觉得图像输入尺寸的调整不会对整体性能有多大改变，可能输入图像尺寸过大会导致nvinfer消耗更多的时间。由于大多数模型的输入图像的尺寸也不大，所以我们可以自定义这两个参数，数值大于等于模型的输入图像尺寸即可，但不需要太大。相关代码如下：

streammux.set_property('width', streammux_width)
streammux.set_property('height', streammux_height)
1
2

可能4：设置`live-source`参数

live-source是Gst-nvstreammux的一个参数。官方文档看到的关于此参数的定义：

Indicates to muxer that sources are live, e.g. live feeds like an RTSP or USB camera.
1

也就是说，如果我们视频流的输入端是RTSP的IP摄像头，或者USB摄像头（应该也包含CSI摄像头），那么这个参数应该设置为1。其实这个参数默认值就是1。相关代码如下：

if is_live: # For live video input, we should switch on parameter 'live-source' of streammux to 1.
    streammux.set_property('live-source', 1) 
1
2

可能5：设置`sink`的`sync`参数

这一条我没有在Nvidia官方的文档中找到，但在TroubleShoot中有记录，并且在自己实际的项目中使用。对于输入模块，不论是nvoverlaysink，nveglglessink，fakesink，都有一个sync的参数。这个值要设置为0/False。对于udpsink来说，还有一个async参数需要设置为False，sync的值设置为1，我也不知道为什么。相关代码如下：

sink_rtsp = Gst.ElementFactory.make("udpsink", "udpsink")           # example rtsp stream: rtsp://<server IP>:8554/ds-test
sink_rtsp.set_property('async', False)
sink_rtsp.set_property('sync', True)
1
2
3

sink_localdisplay = Gst.ElementFactory.make("nvoverlaysink", "nvvideo-renderer")
sink_localdisplay.set_property("sync", False)
1
2

sink_localdisplay = Gst.ElementFactory.make("nveglglessink", "nvvideo-renderer")
sink_localdisplay.set_property("sync", False)
1
2

sink_localdisplay = Gst.ElementFactory.make("fakesink", "fakesink")              # fakesink is a kind of video output which output nothing.
sink_localdisplay.set_property("sync", False)
1
2

可能6：`nvinfer`和`nvtracker`一起使用

一般来说，inference的时间比较长，但object tracking用时比较短。

对于nvinfer，我们可以设置interval参数。interval参数的官方定义如下：

Specifies the number of consecutive batches to be skipped for inference
1

这个参数，我们可以在config文件中设置，比如dstest1_pgie_config.txt，也可以在代码中设置，比如下面代码：

pgie = Gst.ElementFactory.make("nvinfer", "primary-inference") 
pgie.set_property("interval",interval)
1
2

当然，我们需要在pipeline中初始化nvtracker模块，并且与nvinfer link在一起。

tracker = Gst.ElementFactory.make("nvtracker", "tracker")
#...
pgie.link(tracker)
tracker.link(nvvidconv1)
1
2
3
4

可能7：`qos`的设置

官网的FAQ上有一个问题：

When deepstream-app is run in loop on Jetson AGX Xavier using “while true; do deepstream-app -c <config_file>; done;”, after a few iterations I see low FPS for certain iterations. Why is that?
1

回答如下：

This may happen when you are running thirty 1080p streams at 30 frames/second. The issue is caused by initial load. I/O operations bog down the CPU, and with qos=1 as a default property of the [sink0] group, decodebin starts dropping frames. To avoid this, set qos=0 in the [sink0] group in the configuration file.
1

所以这个问题其实一般不会发生，因为我们一般情况下不会接入那么多摄像头。但是以防万一，我们也可以在代码中设置一下。

在AV Sync in DeepStream中，列举了很多RTSP/RTMP/VidFile输入，RTSP/RTMP/VidFile输出的例子，其中可以看到qos的使用，比如：

RTMP_IN -> RTMP_OUT：

gst-launch-1.0 uridecodebin3 uri=$input1 name=demux1 ! queue ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! mux1.sink_0 nvstreammux batch-size=2 max-latency=250000000 batched-push-timeout=33333 width=1920 height=1080 sync-inputs=1 name=mux1 ! queue ! nvmultistreamtiler width=480 height=360 ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! nvv4l2h264enc ! h264parse ! queue ! flvmux name=mux streamable=true !  rtmpsink location=$output async=0 qos=0 sync=1 uridecodebin3 uri=$input2 name=demux2 ! queue ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! mux1.sink_1 demux1. ! queue ! audioconvert ! mixer.sink_0 audiomixer name=mixer  latency=250000000 ! queue ! avenc_aac ! aacparse ! queue ! mux. demux2. ! queue ! audioconvert ! mixer.  fakesrc num-buffers=0 is-live=1 ! mixer. -e
1

RTMP_IN->FILE_OUT：

gst-launch-1.0 uridecodebin3 uri=$input1 name=demux1 ! queue ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! mux1.sink_0 nvstreammux batch-size=2 batched-push-timeout=33333 width=1920 height=1080 sync-inputs=1 max-latency=250000000 name=mux1 ! queue ! nvmultistreamtiler width=480 height=360 ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! nvv4l2h264enc ! h264parse ! queue ! flvmux name=mux streamable=true ! filesink location=out.flv  async=0 qos=0 sync=1 uridecodebin3 uri=$input2 name=demux2 ! queue ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! mux1.sink_1 demux1. ! queue ! audioconvert ! mixer.sink_0 audiomixer latency=250000000 name=mixer ! queue ! avenc_aac ! aacparse ! queue ! mux. demux2. ! queue ! audioconvert ! mixer.  fakesrc num-buffers=0 is-live=1 ! mixer. -e
1

RTSP_IN->RTSP_OUT：

gst-launch-1.0 uridecodebin3 uri=$input1 name=demux1 ! queue ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! mux1.sink_0 nvstreammux batch-size=2 batched-push-timeout=33333 width=1920 height=1080 sync-inputs=1 name=mux1 ! queue ! nvmultistreamtiler width=480 height=360 ! nvrtspoutsinkbin name=r uridecodebin3 uri=$input2 name=demux2 ! queue ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! mux1.sink_1 demux1. ! queue ! audioconvert ! mixer.sink_0 audiomixer name=mixer ! queue ! r.  demux2. ! queue ! audioconvert ! mixer. -e
1

可以看到，我们只需要对对应sink的qos赋值0即可。python里面可以这么写：

sink_localdisplay = Gst.ElementFactory.make("nveglglessink", "nvvideo-renderer")
sink_localdisplay.set_property("qos", 0)
1
2

可能8：`drop-on-latency`的设置

在TroubleShooting的网站上是这么写的：

For RTSP streaming input, if the input has high jitter the GStreamer rtpjitterbuffer element might drop packets which are late. Increase the latency property of rtspsrc, for deepstream-app set latency in [source*] group. Alternatively, if using RTSP type source (type=4) with deepstream-app, turn off drop-on-latency in deepstream_source_bin.c. These steps may add cumulative delay in frames reaching the renderer and memory accumulation in the rtpjitterbuffer if the pipeline is not fast enough.
1

更深一步，我察了一下，Gstreamer关于rtspsrc的网页关于drop-on-latency的描述：

Tells the jitterbuffer to never exceed the given latency in size
1

也就是说，如果我们使用rtspsrc作为输入模块，那么设置drop-on-latency为0就可以防止jitterbuffer超过预设累计buffer。但如果我们使用uridecodebin就不要设置了，因为没有这个参数。

可能9：`probe()`函数

这个可能性是困惑我时间最长的，终于在这里找到明确的答案：

Pipeline unable to perform at real time
WARNING: A lot of buffers are being dropped. (13): gstbasesink.c(2902):
gst_base_sink_is_too_late (): /GstPipeline:pipeline0/GstEglGlesSink:nvvideo-renderer:
There may be a timestamping problem, or this computer is too slow.

Answer:
This could be thrown from any GStreamer app when the pipeline through-put is low
resulting in late buffers at the sink plugin.
This could be a hardware capability limitation or expensive software calls
that hinder pipeline buffer flow.
With python, there is a possibility for such an induced delay in software.
This is with regards to the probe() callbacks an user could leverage
for metadata extraction and other use-cases as demonstrated in our test-apps.

Please NOTE:
a) probe() callbacks are synchronous and thus holds the buffer
(info.get_buffer()) from traversing the pipeline until user return.
b) loops inside probe() callback could be costly in python.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

重点：probe() callbacks are synchronous。通常，我们会在某一个模块，比如osd后面加一个probe，然后在里面添加一些后处理，或者metadata收集发送给云端等功能。但问题是，这个probe函数是synchronous的，这个函数耗时越长，越影响整个pipeline的实时性。这里，有好几种办法可以缓解或解决。比如，我们可以在主pipeline上的probe上尽量少写代码或循环语句，而可以通过tee和queue新建一个线程，在那个线程上再新建一个probe函数，里面写复杂逻辑。或者我们所幸新建一个线程，在那个线程里写复杂逻辑，probe如果需要结果的话，从那个线程里拿。总之，不要在主线程里的probe函数写耗时的逻辑。

可能10：主pipeline没有进行任何多线程设置

在案例deepstream_test_3.py中，有这么一段代码：

queue1=Gst.ElementFactory.make("queue","queue1")
queue2=Gst.ElementFactory.make("queue","queue2")
queue3=Gst.ElementFactory.make("queue","queue3")
queue4=Gst.ElementFactory.make("queue","queue4")
queue5=Gst.ElementFactory.make("queue","queue5")
pipeline.add(queue1)
pipeline.add(queue2)
pipeline.add(queue3)
pipeline.add(queue4)
pipeline.add(queue5)
print("Creating Pgie \n ")
pgie = Gst.ElementFactory.make("nvinfer", "primary-inference")
if not pgie:
    sys.stderr.write(" Unable to create pgie \n")
print("Creating tiler \n ")
tiler=Gst.ElementFactory.make("nvmultistreamtiler", "nvtiler")
if not tiler:
    sys.stderr.write(" Unable to create tiler \n")
print("Creating nvvidconv \n ")
nvvidconv = Gst.ElementFactory.make("nvvideoconvert", "convertor")
if not nvvidconv:
    sys.stderr.write(" Unable to create nvvidconv \n")
print("Creating nvosd \n ")
nvosd = Gst.ElementFactory.make("nvdsosd", "onscreendisplay")
if not nvosd:
    sys.stderr.write(" Unable to create nvosd \n")
nvosd.set_property('process-mode',OSD_PROCESS_MODE)
nvosd.set_property('display-text',OSD_DISPLAY_TEXT)
if(is_aarch64()):
    print("Creating transform \n ")
    transform=Gst.ElementFactory.make("nvegltransform", "nvegl-transform")
    if not transform:
        sys.stderr.write(" Unable to create transform \n")

print("Creating EGLSink \n")
sink = Gst.ElementFactory.make("nveglglessink", "nvvideo-renderer")
if not sink:
    sys.stderr.write(" Unable to create egl sink \n")

if is_live:
    print("Atleast one of the sources is live")
    streammux.set_property('live-source', 1)

streammux.set_property('width', 1920)
streammux.set_property('height', 1080)
streammux.set_property('batch-size', number_sources)
streammux.set_property('batched-push-timeout', 4000000)
pgie.set_property('config-file-path', "dstest3_pgie_config.txt")
pgie_batch_size=pgie.get_property("batch-size")
if(pgie_batch_size != number_sources):
    print("WARNING: Overriding infer-config batch-size",pgie_batch_size," with number of sources ", number_sources," \n")
    pgie.set_property("batch-size",number_sources)
tiler_rows=int(math.sqrt(number_sources))
tiler_columns=int(math.ceil((1.0*number_sources)/tiler_rows))
tiler.set_property("rows",tiler_rows)
tiler.set_property("columns",tiler_columns)
tiler.set_property("width", TILED_OUTPUT_WIDTH)
tiler.set_property("height", TILED_OUTPUT_HEIGHT)
sink.set_property("qos",0)
sink.set_property("sync", 0)

print("Adding elements to Pipeline \n")
pipeline.add(pgie)
pipeline.add(tiler)
pipeline.add(nvvidconv)
pipeline.add(nvosd)
if is_aarch64():
    pipeline.add(transform)
pipeline.add(sink)

print("Linking elements in the Pipeline \n")
streammux.link(queue1)
queue1.link(pgie)
pgie.link(queue2)
queue2.link(tiler)
tiler.link(queue3)
queue3.link(nvvidconv)
nvvidconv.link(queue4)
queue4.link(nvosd)
if is_aarch64():
    nvosd.link(queue5)
    queue5.link(transform)
    transform.link(sink)
else:
    nvosd.link(queue5)
    queue5.link(sink)   
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86

我们看到，在配置pipeline的时候，出现了许多queue，这是为什么？这就是Gstreamer在pipeline中强制使用新线程的一种方式了。下面是Gstreamer官网的原文：

There are several reasons to force the use of threads. However, for performance reasons, you never want to use one thread for every element out there, since that will create some overhead. Let’s now list some situations where threads can be particularly useful:

Data buffering, for example when dealing with network streams or when recording data from a live stream such as a video or audio card. Short hickups elsewhere in the pipeline will not cause data loss. See also Stream buffering about network buffering with queue2.
Synchronizing output devices, e.g. when playing a stream containing both video and audio data. By using threads for both outputs, they will run independently and their synchronization will be better.

Above, we’ve mentioned the “queue” element several times now. A queue is the thread boundary element through which you can force the use of threads. It does so by using a classic provider/consumer model as learned in threading classes at universities all around the world. By doing this, it acts both as a means to make data throughput between threads threadsafe, and it can also act as a buffer.

这里需要注意，在任务繁重的probe中，queue中的一个参数：leaky的参数需要设置为2。意思是，由于这个probe函数任务繁重，无法做到每一桢都处理，所以我们会体丢掉一些桢，设置为2的意思是丢掉老的桢。

其他可能性

在Reference中，Deepstream的官方文档还罗列了一些其他可能引起视频显示延迟，卡顿，死机，或应用程序关闭的可能性。但我觉得可能性不大，或不甚理解，这里也罗列如下：

If secondary inferencing is enabled, try to increase batch-size in the the configuration file’s [secondary-gie#] group in case the number of objects to be inferred is greater than the batch-size setting. （不甚理解）
On Jetson, use Gst-nvdrmvideosink instead of Gst-nveglglessink as nveglglessink requires GPU utilization.（个人不是很赞同。首先，导致视频显示延迟的原因不一定是GPU，也可能是CPU，比如在probe函数写了很复杂。其次，我查了一下，Gst-nvdrmvideosink不常用，而且好像不是很好用）
If the elements in the pipeline are getting starved for buffers (you can check if CPU/GPU utilization is low), increase the number of buffers allocated by the decoder by setting the num-extra-surfaces property of the [source#] group in the application or the num-extra-surfaces property of Gst-nvv4l2decoder element.
On Jetson in the configuration file of gst-nvinfer set scaling-compute-hw = 1 if gpu usage is not 100%.
On dgpu set cudadec-memtype=0 property on Gst-nvv4l2decoder plugin to select device memory output.

Reference

相关阅读:
2023年中国劳保用镜市场规模现状及行业需求前景分析[图]
二叉树的坡度
 Spark入门
 基于Gazebo的无人机管道检测
 MongoDB 数据库性能优化技巧
 [附源码]java毕业设计网上报销管理系统
 大数据安全建设面临哪些挑战
 逻辑门整理
 【玩转Git三剑客】Git学习笔记-20220902
实现自定义Spring Boot Starter
原文地址：https://blog.csdn.net/zyctimes/article/details/125431691

Nvidia Deepstream小细节系列：Deepstream APP 运行延迟，卡顿，死机处理办法

Nvidia Deepstream小细节系列：Deepstream APP 运行延迟，卡顿，死机处理办法

文章目录

可能1：Jetson 时钟未设置

可能2：设置batched-push-timeout参数

可能3：nvInfer的模型相关

可能4：设置live-source参数

可能5：设置sink的sync参数

可能6：nvinfer和nvtracker一起使用

可能7：qos的设置

可能8：drop-on-latency的设置

可能9：probe()函数

可能10：主pipeline没有进行任何多线程设置

其他可能性

Reference

可能2：设置`batched-push-timeout`参数

可能4：设置`live-source`参数

可能5：设置`sink`的`sync`参数

可能6：`nvinfer`和`nvtracker`一起使用

可能7：`qos`的设置

可能8：`drop-on-latency`的设置

可能9：`probe()`函数