gunicorn+flask+PaddleOCR

前言

由于公司是2G，所以一些收费的公网api不能用（同时也不安全），以至于内部尝试了多种开源ocr框架。首先是使用golang封装的一个ocr模块gosseract，使用英文模型多数字字母识别准确率高一点，不过也只有80%多的准确率。后面就尝试用gunicorn+flask+PaddleOCR 简单开发了一个web服务。

gosseract（自己弄一个unbuntu的基础镜像）

dockerfile


RUN echo 'deb http://mirrors.163.com/ubuntu/ bionic main restricted universe multiverse \n\
    deb http://mirrors.163.com/ubuntu/ bionic-security main restricted universe multiverse \n\
    deb http://mirrors.163.com/ubuntu/ bionic-updates main restricted universe multiverse \n\
    deb http://mirrors.163.com/ubuntu/ bionic-proposed main restricted universe multiverse \n\
    deb http://mirrors.163.com/ubuntu/ bionic-backports main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic-security main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic-updates main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic-proposed main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic-backports main restricted universe multiverse' > /etc/apt/sources.list
 
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && \
    apt-get -y install vim wget net-tools curl sudo make telnet iputils-ping tzdata git gcc libtbb2 zip && \
    ln -sf /usr/share/zoneinfo/Asia/Shanghai  /etc/localtime && echo "Asia/Shanghai" > /etc/timezone
 
RUN apt-get -y install automake ca-certificates g++ git libtool libleptonica-dev make pkg-config
RUN git clone https://github.com/tesseract-ocr/tesseract.git && cd tesseract && ./autogen.sh && ./configure && make && make install && ldconfig
#libleptonica 需要创建软连接才能使用
RUN ln -s  /usr/lib/x86_64-linux-gnu/liblept.so /usr/lib/x86_64-linux-gnu/libleptonica.so

然后自己根基上述打一个基础镜像，自己的golang代码基于这个基础镜像来生成生产镜像。

gunicorn+flask+PaddleOCR

gunicorn是一个wcgi服务，类似网关和反向代理服务（参考php）。能够使用多进程的方式管理应用服务。

dockerfile（基础镜像）


FROM registry.baidubce.com/paddlepaddle/paddle:2.1.3-gpu-cuda10.2-cudnn7
 
RUN echo 'deb http://mirrors.163.com/ubuntu/ bionic main restricted universe multiverse \n\
    deb http://mirrors.163.com/ubuntu/ bionic-security main restricted universe multiverse \n\
    deb http://mirrors.163.com/ubuntu/ bionic-updates main restricted universe multiverse \n\
    deb http://mirrors.163.com/ubuntu/ bionic-proposed main restricted universe multiverse \n\
    deb http://mirrors.163.com/ubuntu/ bionic-backports main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic-security main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic-updates main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic-proposed main restricted universe multiverse \n\
    deb-src http://mirrors.163.com/ubuntu/ bionic-backports main restricted universe multiverse' > /etc/apt/sources.list
#不然的话会加载其他源 会报错
RUN rm -rf /etc/apt/apt.conf.d/* /etc/apt/sources.list.d/
# 不知道为什么原先的ssl居然不能用 太垃圾了（版本匹配不上）
RUN apt update && apt remove -y libssl-dev && apt install -y libssl-dev
 
RUN python3 -m pip install paddlepaddle "paddleocr>=2.0.1" -i https://mirror.baidu.com/pypi/simple
RUN pip3 install gunicorn gevent flask  -i https://mirror.baidu.com/pypi/simple
 
RUN echo "from paddleocr import PaddleOCR" >> download.py
# 预先加载英文模型 防止在代码跑起来之后加载 如果想要加载中文模型 就是复制俩行
RUN echo "PaddleOCR(use_angle_cls=True, lang=\"en\")" >> download.py
RUN python3 download.py

生产镜像


FROM ****/ocr_base:0.0.2
 
WORKDIR /workspace
 
COPY ./app/ocr/app.py /workspace/app.py
# 这里面是启动3个worker 不要太多 模型加载之后可是能消耗近2g物理内存
CMD cd /workspace && gunicorn -b 0.0.0.0:8000 -w 3 -k gevent --access-logfile - app:app

其中app.py 就是flask的入口文件


import time
import urllib.request
 
from flask import Flask, request
from paddleocr import PaddleOCR, draw_ocr
 
 
def save_image(url,outputfile):
    try:
        response = urllib.request.urlopen(url)
        data = response.read()
        with open(outputfile, "wb") as file:
            file.write(data)
        return True
    except urllib.error.URLError as e:
        print("Error occurred while retrieving the URL:", e)
        return False
 
app = Flask(__name__)
ocr = PaddleOCR(use_angle_cls=True, lang="en")
 
@app.route("/")
def hello():
    print("-------------------")
    return "Hello World!"
 
@app.post("/ocr/check")
def check_post():
    ret = {}
    req = request.get_json()
    print(req,type(req))
    url = req.get("url")
    if url == None :
        ret["code"] = -1
        ret["msg"] = "param url lost"
        ret["data"] = []
        return ret
    result = ocr.ocr(url, cls=True)
    if len(result) == 0:
        ret["code"] = -3
        ret["msg"] = "ocr result empty"
        ret["data"] = []
        return ret
    data =[]
    for idx in range(len(result)):
        res = result[idx]
        for idx1 in range(len(res)) :
            temp = {}
            res1 = res[idx1]
            temp["text"]=res1[-1][0]
            temp["score"]=res1[-1][1]
            data.append(temp)
    ret["code"] = 0
    ret["msg"] = ""
    ret["data"] = data    
    return ret
 
if __name__ == "__main__" :
    app.run()

到此你就搭建了一个ocr的web服务了

普通的验证码之类的识别1s 10张想要更高的性能那你就在生产镜像里面吧worker加到更大，不过消耗的cpu内存也就更多（PaddleOCR其实支持gpu 这里默认是cpu）

相关阅读:
杰理强制升级工具4.0使用和原理解析
 【特纳斯电子】智能台灯-实物设计
 自动控制原理8.4---描述函数法
 登陆切换：将账号登陆切换为邮箱登录
 基于JAVA科技专业师生沟通平台计算机毕业设计源码+系统+mysql数据库+lw文档+部署
 OpenWrt 20.02.2 小米路由器3G配置CP1025网络打印
 vue3的那些事
 zabbix的原理与安装
 数据可视化【原创】vue+arcgis+threejs 实现流光立体墙效果
 Zibll子比主题V6.4.1wordpress 开心版源码下载_破解原版/直接使用/无需教程
原文地址：https://blog.csdn.net/qq_29493353/article/details/133697895