• 【深度学习框架-Paddle】ExternalError: CUDNN error(4), CUDNN_STATUS_INTERNAL_ERROR.报错原因


    报错情况

    系统环境/System Environment:
    ubuntu16
    版本号/Version:Paddle:2.3.1 PaddleOCR:git clone version latest
    问题相关组件/Related components:
    运行SER+RE联合推理报错
    运行指令/Command Code:

    export CUDA_VISIBLE_DEVICES=0
    python3 tools/infer_vqa_token_ser_re.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/re_LayoutXLM_xfun_zh/ Global.infer_img=doc/vqa/input/zh_val_21.jpg -c_ser configs/vqa/ser/layoutxlm.yml -o_ser Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/
    
    • 1
    • 2

    完整报错/Complete Error Message:

    - Traceback (most recent call last):
      File "tools/infer_vqa_token_ser_re.py", line 193, in 
        result = ser_re_engine(img_path)
      File "tools/infer_vqa_token_ser_re.py", line 135, in __call__
        ser_results, ser_inputs = self.ser_engine(img_path)
      File "/data1/liushu/test_PPOCR/PaddleOCR/tools/infer_vqa_token_ser.py", line 98, in __call__
        batch = transform(data, self.ops)
      File "/data1/liushu/test_PPOCR/PaddleOCR/ppocr/data/imaug/__init__.py", line 51, in transform
        data = op(data)
      File "/data1/liushu/test_PPOCR/PaddleOCR/ppocr/data/imaug/label_ops.py", line 885, in __call__
        ocr_info = self._load_ocr_info(data)
      File "/data1/liushu/test_PPOCR/PaddleOCR/ppocr/data/imaug/label_ops.py", line 988, in _load_ocr_info
        ocr_result = self.ocr_engine.ocr(data['image'], cls=False)
      File "/data1/liushu/test_PPOCR/PaddleOCR/paddleocr.py", line 480, in ocr
        dt_boxes, rec_res = self.__call__(img, cls)
      File "/data1/liushu/test_PPOCR/PaddleOCR/tools/infer/predict_system.py", line 69, in __call__
        dt_boxes, elapse = self.text_detector(img)
      File "/data1/liushu/test_PPOCR/PaddleOCR/tools/infer/predict_det.py", line 218, in __call__
        self.predictor.run()
    OSError: In user code:
    
        File "tools/export_model.py", line 172, in 
          main()
        File "tools/export_model.py", line 165, in main
          sub_model_save_path, logger)
        File "tools/export_model.py", line 99, in export_single_model
          paddle.jit.save(model, save_path)
        File "", line 2, in save
          
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
          return wrapped_func(*args, **kwargs)
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 51, in __impl__
          return func(*args, **kwargs)
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/jit.py", line 744, in save
          inner_input_spec)
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 517, in concrete_program_specify_input_spec
          *desired_input_spec)
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 427, in get_concrete_program
          concrete_program, partial_program_layer = self._program_cache[cache_key]
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 723, in __getitem__
          self._caches[item] = self._build_once(item)
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 714, in _build_once
          **cache_key.kwargs)
        File "", line 2, in from_func_spec
          
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
          return wrapped_func(*args, **kwargs)
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 51, in __impl__
          return func(*args, **kwargs)
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 662, in from_func_spec
          outputs = static_func(*inputs)
        File "/paddle/debug/PaddleOCR/ppocr/modeling/architectures/base_model.py", line 79, in forward
          x = self.backbone(x)
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
          return self._dygraph_call_func(*inputs, **kwargs)
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
          outputs = self.forward(*inputs, **kwargs)
        File "/paddle/debug/PaddleOCR/ppocr/modeling/backbones/det_mobilenet_v3.py", line 146, in forward
          x = self.conv(x)
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
          return self._dygraph_call_func(*inputs, **kwargs)
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
          outputs = self.forward(*inputs, **kwargs)
        File "/paddle/debug/PaddleOCR/ppocr/modeling/backbones/det_mobilenet_v3.py", line 179, in forward
          x = self.conv(x)
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
          return self._dygraph_call_func(*inputs, **kwargs)
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
          outputs = self.forward(*inputs, **kwargs)
        File "/usr/local/lib/python3.7/site-packages/paddle/nn/layer/conv.py", line 677, in forward
          use_cudnn=self._use_cudnn)
        File "/usr/local/lib/python3.7/site-packages/paddle/nn/functional/conv.py", line 148, in _conv_nd
          type=op_type, inputs=inputs, outputs=outputs, attrs=attrs)
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op
          return self.main_program.current_block().append_op(*args, **kwargs)
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/framework.py", line 3184, in append_op
          attrs=kwargs.get("attrs", None))
        File "/usr/local/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2224, in __init__
          for frame in traceback.extract_stack():
    
        ExternalError: CUDNN error(4), CUDNN_STATUS_INTERNAL_ERROR. 
          [Hint: 'CUDNN_STATUS_INTERNAL_ERROR'.  An internal cuDNN operation failed.  ] (at /paddle/paddle/phi/backends/gpu/gpu_resources.cc:211)
          [operator < conv2d_fusion > error]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83

    问题排查

    之前并不存在这个情况,只运行OCR+SER推理运行正常,但运行OCR+SER+RE联合推理,就报错。

    1. 第一反应,网上搜索,发现解决方法主要分为两类,1) 版本与环境不兼容,需要重新安装。第二种是我排查完1)后发现的,说可能是由于资源不足的问题,导致报错。
    2. 单独运行SER没有报错,说明配置环境、版本没有问题。
    3. 是否是由于我修改代码导致的,重新clone github上PPOCR的代码,再次运行,还是失败。
    4. 2的失败,说明不是代码的问题,排除了代码和环境的问题,就剩下资源情况,把GPU显存释放掉了一些,没想到就成功了。

    总结

    别看排查问题步骤写的很简单,但是花费了3个小时进行解决。太让人泪目了。
    不过,这次问题解决也让我明白了,报错无非是由三个方面出现的,1)逻辑错误、矩阵运算错误,2)版本(环境配置),3)计算资源
    当然只是简单的划分,其实每一类都存细小的分类。
    后面,可以按照这个思路总结一下,自己遇到问题的类别,这样bug就会越来越少了,嘻嘻嘻嘻。

  • 相关阅读:
    Xlua热更原理浅析
    base64转file类型,并且作为参数发起axios/xhr请求(已封装好)
    在IDEA中如何新建一个web工程
    论文阅读 CVPR2022《Rethinking Semantic Segmentation:A Prototype View》
    XSS漏洞
    空间金字塔池化Spatial Pyramid Pooling
    map容器/multimap容器
    《JavaScript前端开发与实例教程(微课视频版)》
    MQ进阶面试题
    5个设计类宝藏网站分享
  • 原文地址:https://blog.csdn.net/qq_36287702/article/details/126348997