当我们在运行deepstream应用的时候,可能会遇到延迟,视频显示卡顿,甚至死机,或应用程序关闭等情况。这里详细罗列所有可能发生上述现象的原因和可行的解决办法。目前代码基于deepstream python,但对于C++平台也是一个道理。
环境描述:
在我们安装Deepstream的最后一步,需要Boost the clocks。
sudo nvpmodel -m 0
sudo jetson_clocks
一般如果我们是通过SDKManager安装的Deepstream,或者直接允许相关镜像,不需要担心这个问题。
batched-push-timeout参数batched-push-timeout是Gst-nvstreammux的一个参数。从官方文档看到的关于此参数的定义:
Timeout in microseconds to wait after the first buffer is available to push the batch even if a complete batch is not formed.
翻译成中文:
即使没有形成完整的批次,在第一个缓冲区可用后将批次推入之后的超时(以微秒为单位)。
我们知道,Gst-nvstreammux的作用是将各个video stream mux在一起,然后进入下一个nvinfer模块。batched-push-timeout这个参数不能设置太大,否则,batch push就会造成延迟。一般这个值设置为1/max_fps。需要注意,单位是microseconds。所以在代码中,我们需要这么写:
streammux.set_property('batched-push-timeout', int(1000000/CfgVidSource.fps) )
我们需要检查是否模型本身推理的模型就注定了系统的延迟。比如我们设置2个摄像头连接到系统,每个摄像头25FPS,那么模型推理的时间必须在1/50秒内完成。当然,我们可以通过设置nvTraker,还是模块的多线程来解决这个问题,但终究我们需要先考虑模型本身的性能。
另外一种可能性就是模型的图像输入尺寸。我们可以通过调整Gst-nvstreammux中的width和height参数来控制从nvstreammux输入给nvinfer的图像尺寸。width和height的默认值是1280和720。个人觉得图像输入尺寸的调整不会对整体性能有多大改变,可能输入图像尺寸过大会导致nvinfer消耗更多的时间。由于大多数模型的输入图像的尺寸也不大,所以我们可以自定义这两个参数,数值大于等于模型的输入图像尺寸即可,但不需要太大。相关代码如下:
streammux.set_property('width', streammux_width)
streammux.set_property('height', streammux_height)
live-source参数live-source是Gst-nvstreammux的一个参数。官方文档看到的关于此参数的定义:
Indicates to muxer that sources are live, e.g. live feeds like an RTSP or USB camera.
也就是说,如果我们视频流的输入端是RTSP的IP摄像头,或者USB摄像头(应该也包含CSI摄像头),那么这个参数应该设置为1。其实这个参数默认值就是1。相关代码如下:
if is_live: # For live video input, we should switch on parameter 'live-source' of streammux to 1.
streammux.set_property('live-source', 1)
sink的sync参数这一条我没有在Nvidia官方的文档中找到,但在TroubleShoot中有记录,并且在自己实际的项目中使用。对于输入模块,不论是nvoverlaysink,nveglglessink,fakesink,都有一个sync的参数。这个值要设置为0/False。对于udpsink来说,还有一个async参数需要设置为False,sync的值设置为1,我也不知道为什么。相关代码如下:
sink_rtsp = Gst.ElementFactory.make("udpsink", "udpsink") # example rtsp stream: rtsp://<server IP>:8554/ds-test
sink_rtsp.set_property('async', False)
sink_rtsp.set_property('sync', True)
sink_localdisplay = Gst.ElementFactory.make("nvoverlaysink", "nvvideo-renderer")
sink_localdisplay.set_property("sync", False)
sink_localdisplay = Gst.ElementFactory.make("nveglglessink", "nvvideo-renderer")
sink_localdisplay.set_property("sync", False)
sink_localdisplay = Gst.ElementFactory.make("fakesink", "fakesink") # fakesink is a kind of video output which output nothing.
sink_localdisplay.set_property("sync", False)
nvinfer和nvtracker一起使用一般来说,inference的时间比较长,但object tracking用时比较短。
对于nvinfer,我们可以设置interval参数。interval参数的官方定义如下:
Specifies the number of consecutive batches to be skipped for inference
这个参数,我们可以在config文件中设置,比如dstest1_pgie_config.txt,也可以在代码中设置,比如下面代码:
pgie = Gst.ElementFactory.make("nvinfer", "primary-inference")
pgie.set_property("interval",interval)
当然,我们需要在pipeline中初始化nvtracker模块,并且与nvinfer link在一起。
tracker = Gst.ElementFactory.make("nvtracker", "tracker")
#...
pgie.link(tracker)
tracker.link(nvvidconv1)
qos的设置官网的FAQ上有一个问题:
When deepstream-app is run in loop on Jetson AGX Xavier using “while true; do deepstream-app -c <config_file>; done;”, after a few iterations I see low FPS for certain iterations. Why is that?
回答如下:
This may happen when you are running thirty 1080p streams at 30 frames/second. The issue is caused by initial load. I/O operations bog down the CPU, and with qos=1 as a default property of the [sink0] group, decodebin starts dropping frames. To avoid this, set qos=0 in the [sink0] group in the configuration file.
所以这个问题其实一般不会发生,因为我们一般情况下不会接入那么多摄像头。但是以防万一,我们也可以在代码中设置一下。
在AV Sync in DeepStream中,列举了很多RTSP/RTMP/VidFile输入,RTSP/RTMP/VidFile输出的例子,其中可以看到qos的使用,比如:
RTMP_IN -> RTMP_OUT:
gst-launch-1.0 uridecodebin3 uri=$input1 name=demux1 ! queue ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! mux1.sink_0 nvstreammux batch-size=2 max-latency=250000000 batched-push-timeout=33333 width=1920 height=1080 sync-inputs=1 name=mux1 ! queue ! nvmultistreamtiler width=480 height=360 ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! nvv4l2h264enc ! h264parse ! queue ! flvmux name=mux streamable=true ! rtmpsink location=$output async=0 qos=0 sync=1 uridecodebin3 uri=$input2 name=demux2 ! queue ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! mux1.sink_1 demux1. ! queue ! audioconvert ! mixer.sink_0 audiomixer name=mixer latency=250000000 ! queue ! avenc_aac ! aacparse ! queue ! mux. demux2. ! queue ! audioconvert ! mixer. fakesrc num-buffers=0 is-live=1 ! mixer. -e
RTMP_IN->FILE_OUT:
gst-launch-1.0 uridecodebin3 uri=$input1 name=demux1 ! queue ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! mux1.sink_0 nvstreammux batch-size=2 batched-push-timeout=33333 width=1920 height=1080 sync-inputs=1 max-latency=250000000 name=mux1 ! queue ! nvmultistreamtiler width=480 height=360 ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! nvv4l2h264enc ! h264parse ! queue ! flvmux name=mux streamable=true ! filesink location=out.flv async=0 qos=0 sync=1 uridecodebin3 uri=$input2 name=demux2 ! queue ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! mux1.sink_1 demux1. ! queue ! audioconvert ! mixer.sink_0 audiomixer latency=250000000 name=mixer ! queue ! avenc_aac ! aacparse ! queue ! mux. demux2. ! queue ! audioconvert ! mixer. fakesrc num-buffers=0 is-live=1 ! mixer. -e
RTSP_IN->RTSP_OUT:
gst-launch-1.0 uridecodebin3 uri=$input1 name=demux1 ! queue ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! mux1.sink_0 nvstreammux batch-size=2 batched-push-timeout=33333 width=1920 height=1080 sync-inputs=1 name=mux1 ! queue ! nvmultistreamtiler width=480 height=360 ! nvrtspoutsinkbin name=r uridecodebin3 uri=$input2 name=demux2 ! queue ! nvvideoconvert ! "video/x-raw(memory:NVMM)" ! mux1.sink_1 demux1. ! queue ! audioconvert ! mixer.sink_0 audiomixer name=mixer ! queue ! r. demux2. ! queue ! audioconvert ! mixer. -e
可以看到,我们只需要对对应sink的qos赋值0即可。python里面可以这么写:
sink_localdisplay = Gst.ElementFactory.make("nveglglessink", "nvvideo-renderer")
sink_localdisplay.set_property("qos", 0)
drop-on-latency的设置在TroubleShooting的网站上是这么写的:
For RTSP streaming input, if the input has high jitter the GStreamer rtpjitterbuffer element might drop packets which are late. Increase the latency property of rtspsrc, for deepstream-app set latency in [source*] group. Alternatively, if using RTSP type source (type=4) with deepstream-app, turn off drop-on-latency in deepstream_source_bin.c. These steps may add cumulative delay in frames reaching the renderer and memory accumulation in the rtpjitterbuffer if the pipeline is not fast enough.
更深一步,我察了一下,Gstreamer关于rtspsrc的网页关于drop-on-latency的描述:
Tells the jitterbuffer to never exceed the given latency in size
也就是说,如果我们使用rtspsrc作为输入模块,那么设置drop-on-latency为0就可以防止jitterbuffer超过预设累计buffer。但如果我们使用uridecodebin就不要设置了,因为没有这个参数。
probe()函数这个可能性是困惑我时间最长的,终于在这里找到明确的答案:
Pipeline unable to perform at real time
WARNING: A lot of buffers are being dropped. (13): gstbasesink.c(2902):
gst_base_sink_is_too_late (): /GstPipeline:pipeline0/GstEglGlesSink:nvvideo-renderer:
There may be a timestamping problem, or this computer is too slow.
Answer:
This could be thrown from any GStreamer app when the pipeline through-put is low
resulting in late buffers at the sink plugin.
This could be a hardware capability limitation or expensive software calls
that hinder pipeline buffer flow.
With python, there is a possibility for such an induced delay in software.
This is with regards to the probe() callbacks an user could leverage
for metadata extraction and other use-cases as demonstrated in our test-apps.
Please NOTE:
a) probe() callbacks are synchronous and thus holds the buffer
(info.get_buffer()) from traversing the pipeline until user return.
b) loops inside probe() callback could be costly in python.
重点:probe() callbacks are synchronous。通常,我们会在某一个模块,比如osd后面加一个probe,然后在里面添加一些后处理,或者metadata收集发送给云端等功能。但问题是,这个probe函数是synchronous的,这个函数耗时越长,越影响整个pipeline的实时性。这里,有好几种办法可以缓解或解决。比如,我们可以在主pipeline上的probe上尽量少写代码或循环语句,而可以通过tee和queue新建一个线程,在那个线程上再新建一个probe函数,里面写复杂逻辑。或者我们所幸新建一个线程,在那个线程里写复杂逻辑,probe如果需要结果的话,从那个线程里拿。总之,不要在主线程里的probe函数写耗时的逻辑。
在案例deepstream_test_3.py中,有这么一段代码:
queue1=Gst.ElementFactory.make("queue","queue1")
queue2=Gst.ElementFactory.make("queue","queue2")
queue3=Gst.ElementFactory.make("queue","queue3")
queue4=Gst.ElementFactory.make("queue","queue4")
queue5=Gst.ElementFactory.make("queue","queue5")
pipeline.add(queue1)
pipeline.add(queue2)
pipeline.add(queue3)
pipeline.add(queue4)
pipeline.add(queue5)
print("Creating Pgie \n ")
pgie = Gst.ElementFactory.make("nvinfer", "primary-inference")
if not pgie:
sys.stderr.write(" Unable to create pgie \n")
print("Creating tiler \n ")
tiler=Gst.ElementFactory.make("nvmultistreamtiler", "nvtiler")
if not tiler:
sys.stderr.write(" Unable to create tiler \n")
print("Creating nvvidconv \n ")
nvvidconv = Gst.ElementFactory.make("nvvideoconvert", "convertor")
if not nvvidconv:
sys.stderr.write(" Unable to create nvvidconv \n")
print("Creating nvosd \n ")
nvosd = Gst.ElementFactory.make("nvdsosd", "onscreendisplay")
if not nvosd:
sys.stderr.write(" Unable to create nvosd \n")
nvosd.set_property('process-mode',OSD_PROCESS_MODE)
nvosd.set_property('display-text',OSD_DISPLAY_TEXT)
if(is_aarch64()):
print("Creating transform \n ")
transform=Gst.ElementFactory.make("nvegltransform", "nvegl-transform")
if not transform:
sys.stderr.write(" Unable to create transform \n")
print("Creating EGLSink \n")
sink = Gst.ElementFactory.make("nveglglessink", "nvvideo-renderer")
if not sink:
sys.stderr.write(" Unable to create egl sink \n")
if is_live:
print("Atleast one of the sources is live")
streammux.set_property('live-source', 1)
streammux.set_property('width', 1920)
streammux.set_property('height', 1080)
streammux.set_property('batch-size', number_sources)
streammux.set_property('batched-push-timeout', 4000000)
pgie.set_property('config-file-path', "dstest3_pgie_config.txt")
pgie_batch_size=pgie.get_property("batch-size")
if(pgie_batch_size != number_sources):
print("WARNING: Overriding infer-config batch-size",pgie_batch_size," with number of sources ", number_sources," \n")
pgie.set_property("batch-size",number_sources)
tiler_rows=int(math.sqrt(number_sources))
tiler_columns=int(math.ceil((1.0*number_sources)/tiler_rows))
tiler.set_property("rows",tiler_rows)
tiler.set_property("columns",tiler_columns)
tiler.set_property("width", TILED_OUTPUT_WIDTH)
tiler.set_property("height", TILED_OUTPUT_HEIGHT)
sink.set_property("qos",0)
sink.set_property("sync", 0)
print("Adding elements to Pipeline \n")
pipeline.add(pgie)
pipeline.add(tiler)
pipeline.add(nvvidconv)
pipeline.add(nvosd)
if is_aarch64():
pipeline.add(transform)
pipeline.add(sink)
print("Linking elements in the Pipeline \n")
streammux.link(queue1)
queue1.link(pgie)
pgie.link(queue2)
queue2.link(tiler)
tiler.link(queue3)
queue3.link(nvvidconv)
nvvidconv.link(queue4)
queue4.link(nvosd)
if is_aarch64():
nvosd.link(queue5)
queue5.link(transform)
transform.link(sink)
else:
nvosd.link(queue5)
queue5.link(sink)
我们看到,在配置pipeline的时候,出现了许多queue,这是为什么?这就是Gstreamer在pipeline中强制使用新线程的一种方式了。下面是Gstreamer官网的原文:
There are several reasons to force the use of threads. However, for performance reasons, you never want to use one thread for every element out there, since that will create some overhead. Let’s now list some situations where threads can be particularly useful:
Above, we’ve mentioned the “queue” element several times now. A queue is the thread boundary element through which you can force the use of threads. It does so by using a classic provider/consumer model as learned in threading classes at universities all around the world. By doing this, it acts both as a means to make data throughput between threads threadsafe, and it can also act as a buffer.
这里需要注意,在任务繁重的probe中,queue中的一个参数:leaky的参数需要设置为2。意思是,由于这个probe函数任务繁重,无法做到每一桢都处理,所以我们会体丢掉一些桢,设置为2的意思是丢掉老的桢。
在Reference中,Deepstream的官方文档还罗列了一些其他可能引起视频显示延迟,卡顿,死机,或应用程序关闭的可能性。但我觉得可能性不大,或不甚理解,这里也罗列如下: