• gunicorn+flask+PaddleOCR


    前言

    由于公司是2G,所以一些收费的公网api不能用(同时也不安全),以至于内部尝试了多种开源ocr框架。首先是使用golang封装的一个ocr模块gosseract,使用英文模型多数字字母识别准确率高一点,不过也只有80%多的准确率。后面就尝试用gunicorn+flask+PaddleOCR 简单开发了一个web服务。

    gosseract(自己弄一个unbuntu的基础镜像)

    dockerfile

    1. RUN echo 'deb http://mirrors.163.com/ubuntu/ bionic main restricted universe multiverse \n\
    2. deb http://mirrors.163.com/ubuntu/ bionic-security main restricted universe multiverse \n\
    3. deb http://mirrors.163.com/ubuntu/ bionic-updates main restricted universe multiverse \n\
    4. deb http://mirrors.163.com/ubuntu/ bionic-proposed main restricted universe multiverse \n\
    5. deb http://mirrors.163.com/ubuntu/ bionic-backports main restricted universe multiverse \n\
    6. deb-src http://mirrors.163.com/ubuntu/ bionic main restricted universe multiverse \n\
    7. deb-src http://mirrors.163.com/ubuntu/ bionic-security main restricted universe multiverse \n\
    8. deb-src http://mirrors.163.com/ubuntu/ bionic-updates main restricted universe multiverse \n\
    9. deb-src http://mirrors.163.com/ubuntu/ bionic-proposed main restricted universe multiverse \n\
    10. deb-src http://mirrors.163.com/ubuntu/ bionic-backports main restricted universe multiverse' > /etc/apt/sources.list
    11. ENV DEBIAN_FRONTEND noninteractive
    12. RUN apt-get update && \
    13. apt-get -y install vim wget net-tools curl sudo make telnet iputils-ping tzdata git gcc libtbb2 zip && \
    14. ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo "Asia/Shanghai" > /etc/timezone
    15. RUN apt-get -y install automake ca-certificates g++ git libtool libleptonica-dev make pkg-config
    16. RUN git clone https://github.com/tesseract-ocr/tesseract.git && cd tesseract && ./autogen.sh && ./configure && make && make install && ldconfig
    17. #libleptonica 需要创建软连接才能使用
    18. RUN ln -s /usr/lib/x86_64-linux-gnu/liblept.so /usr/lib/x86_64-linux-gnu/libleptonica.so

     然后自己根基上述打一个基础镜像,自己的golang代码基于这个基础镜像来生成生产镜像。

    gunicorn+flask+PaddleOCR

    gunicorn是一个wcgi服务,类似网关和反向代理服务(参考php)。能够使用多进程的方式管理应用服务。

    dockerfile(基础镜像)

    1. FROM registry.baidubce.com/paddlepaddle/paddle:2.1.3-gpu-cuda10.2-cudnn7
    2. RUN echo 'deb http://mirrors.163.com/ubuntu/ bionic main restricted universe multiverse \n\
    3. deb http://mirrors.163.com/ubuntu/ bionic-security main restricted universe multiverse \n\
    4. deb http://mirrors.163.com/ubuntu/ bionic-updates main restricted universe multiverse \n\
    5. deb http://mirrors.163.com/ubuntu/ bionic-proposed main restricted universe multiverse \n\
    6. deb http://mirrors.163.com/ubuntu/ bionic-backports main restricted universe multiverse \n\
    7. deb-src http://mirrors.163.com/ubuntu/ bionic main restricted universe multiverse \n\
    8. deb-src http://mirrors.163.com/ubuntu/ bionic-security main restricted universe multiverse \n\
    9. deb-src http://mirrors.163.com/ubuntu/ bionic-updates main restricted universe multiverse \n\
    10. deb-src http://mirrors.163.com/ubuntu/ bionic-proposed main restricted universe multiverse \n\
    11. deb-src http://mirrors.163.com/ubuntu/ bionic-backports main restricted universe multiverse' > /etc/apt/sources.list
    12. #不然的话会加载其他源 会报错
    13. RUN rm -rf /etc/apt/apt.conf.d/* /etc/apt/sources.list.d/
    14. # 不知道为什么原先的ssl居然不能用 太垃圾了(版本匹配不上)
    15. RUN apt update && apt remove -y libssl-dev && apt install -y libssl-dev
    16. RUN python3 -m pip install paddlepaddle "paddleocr>=2.0.1" -i https://mirror.baidu.com/pypi/simple
    17. RUN pip3 install gunicorn gevent flask -i https://mirror.baidu.com/pypi/simple
    18. RUN echo "from paddleocr import PaddleOCR" >> download.py
    19. # 预先加载英文模型 防止在代码跑起来之后加载 如果想要加载中文模型 就是复制俩行
    20. RUN echo "PaddleOCR(use_angle_cls=True, lang=\"en\")" >> download.py
    21. RUN python3 download.py

    生产镜像

    1. FROM ****/ocr_base:0.0.2
    2. WORKDIR /workspace
    3. COPY ./app/ocr/app.py /workspace/app.py
    4. # 这里面是启动3个worker 不要太多 模型加载之后可是能消耗近2g物理内存
    5. CMD cd /workspace && gunicorn -b 0.0.0.0:8000 -w 3 -k gevent --access-logfile - app:app

    其中app.py 就是flask的入口文件

    1. import time
    2. import urllib.request
    3. from flask import Flask, request
    4. from paddleocr import PaddleOCR, draw_ocr
    5. def save_image(url,outputfile):
    6. try:
    7. response = urllib.request.urlopen(url)
    8. data = response.read()
    9. with open(outputfile, "wb") as file:
    10. file.write(data)
    11. return True
    12. except urllib.error.URLError as e:
    13. print("Error occurred while retrieving the URL:", e)
    14. return False
    15. app = Flask(__name__)
    16. ocr = PaddleOCR(use_angle_cls=True, lang="en")
    17. @app.route("/")
    18. def hello():
    19. print("-------------------")
    20. return "Hello World!"
    21. @app.post("/ocr/check")
    22. def check_post():
    23. ret = {}
    24. req = request.get_json()
    25. print(req,type(req))
    26. url = req.get("url")
    27. if url == None :
    28. ret["code"] = -1
    29. ret["msg"] = "param url lost"
    30. ret["data"] = []
    31. return ret
    32. result = ocr.ocr(url, cls=True)
    33. if len(result) == 0:
    34. ret["code"] = -3
    35. ret["msg"] = "ocr result empty"
    36. ret["data"] = []
    37. return ret
    38. data =[]
    39. for idx in range(len(result)):
    40. res = result[idx]
    41. for idx1 in range(len(res)) :
    42. temp = {}
    43. res1 = res[idx1]
    44. temp["text"]=res1[-1][0]
    45. temp["score"]=res1[-1][1]
    46. data.append(temp)
    47. ret["code"] = 0
    48. ret["msg"] = ""
    49. ret["data"] = data
    50. return ret
    51. if __name__ == "__main__" :
    52. app.run()

    到此你就搭建了一个ocr的web服务了

    普通的验证码之类的识别1s 10张 想要更高的性能那你就在生产镜像里面吧worker加到更大,不过消耗的cpu内存也就更多(PaddleOCR其实支持gpu 这里默认是cpu)

  • 相关阅读:
    VHDL实现任意大小矩阵加法运算
    Go字符串拼接6种方式及其性能测试:strings.builder最快
    Java程序设计2023-第六次上机练习
    JavaEE——进程与线程的关系
    【软件设计师-中级——刷题记录7(纯干货)】
    un7.30:linux——如何在docker容器中安装MySQL?
    redis7 搭建集群
    【面试题精讲】SpringBoot的传播机制详解
    C++ STL 用法解释
    使用C#跨PC 远程调用程序并显示UI界面
  • 原文地址:https://blog.csdn.net/qq_29493353/article/details/133697895