由于公司是2G,所以一些收费的公网api不能用(同时也不安全),以至于内部尝试了多种开源ocr框架。首先是使用golang封装的一个ocr模块gosseract,使用英文模型多数字字母识别准确率高一点,不过也只有80%多的准确率。后面就尝试用gunicorn+flask+PaddleOCR 简单开发了一个web服务。
dockerfile
- RUN echo 'deb http://mirrors.163.com/ubuntu/ bionic main restricted universe multiverse \n\
- deb http://mirrors.163.com/ubuntu/ bionic-security main restricted universe multiverse \n\
- deb http://mirrors.163.com/ubuntu/ bionic-updates main restricted universe multiverse \n\
- deb http://mirrors.163.com/ubuntu/ bionic-proposed main restricted universe multiverse \n\
- deb http://mirrors.163.com/ubuntu/ bionic-backports main restricted universe multiverse \n\
- deb-src http://mirrors.163.com/ubuntu/ bionic main restricted universe multiverse \n\
- deb-src http://mirrors.163.com/ubuntu/ bionic-security main restricted universe multiverse \n\
- deb-src http://mirrors.163.com/ubuntu/ bionic-updates main restricted universe multiverse \n\
- deb-src http://mirrors.163.com/ubuntu/ bionic-proposed main restricted universe multiverse \n\
- deb-src http://mirrors.163.com/ubuntu/ bionic-backports main restricted universe multiverse' > /etc/apt/sources.list
-
- ENV DEBIAN_FRONTEND noninteractive
- RUN apt-get update && \
- apt-get -y install vim wget net-tools curl sudo make telnet iputils-ping tzdata git gcc libtbb2 zip && \
- ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo "Asia/Shanghai" > /etc/timezone
-
- RUN apt-get -y install automake ca-certificates g++ git libtool libleptonica-dev make pkg-config
- RUN git clone https://github.com/tesseract-ocr/tesseract.git && cd tesseract && ./autogen.sh && ./configure && make && make install && ldconfig
- #libleptonica 需要创建软连接才能使用
- RUN ln -s /usr/lib/x86_64-linux-gnu/liblept.so /usr/lib/x86_64-linux-gnu/libleptonica.so
然后自己根基上述打一个基础镜像,自己的golang代码基于这个基础镜像来生成生产镜像。
gunicorn是一个wcgi服务,类似网关和反向代理服务(参考php)。能够使用多进程的方式管理应用服务。
dockerfile(基础镜像)
- FROM registry.baidubce.com/paddlepaddle/paddle:2.1.3-gpu-cuda10.2-cudnn7
-
- RUN echo 'deb http://mirrors.163.com/ubuntu/ bionic main restricted universe multiverse \n\
- deb http://mirrors.163.com/ubuntu/ bionic-security main restricted universe multiverse \n\
- deb http://mirrors.163.com/ubuntu/ bionic-updates main restricted universe multiverse \n\
- deb http://mirrors.163.com/ubuntu/ bionic-proposed main restricted universe multiverse \n\
- deb http://mirrors.163.com/ubuntu/ bionic-backports main restricted universe multiverse \n\
- deb-src http://mirrors.163.com/ubuntu/ bionic main restricted universe multiverse \n\
- deb-src http://mirrors.163.com/ubuntu/ bionic-security main restricted universe multiverse \n\
- deb-src http://mirrors.163.com/ubuntu/ bionic-updates main restricted universe multiverse \n\
- deb-src http://mirrors.163.com/ubuntu/ bionic-proposed main restricted universe multiverse \n\
- deb-src http://mirrors.163.com/ubuntu/ bionic-backports main restricted universe multiverse' > /etc/apt/sources.list
- #不然的话会加载其他源 会报错
- RUN rm -rf /etc/apt/apt.conf.d/* /etc/apt/sources.list.d/
- # 不知道为什么原先的ssl居然不能用 太垃圾了(版本匹配不上)
- RUN apt update && apt remove -y libssl-dev && apt install -y libssl-dev
-
- RUN python3 -m pip install paddlepaddle "paddleocr>=2.0.1" -i https://mirror.baidu.com/pypi/simple
- RUN pip3 install gunicorn gevent flask -i https://mirror.baidu.com/pypi/simple
-
- RUN echo "from paddleocr import PaddleOCR" >> download.py
- # 预先加载英文模型 防止在代码跑起来之后加载 如果想要加载中文模型 就是复制俩行
- RUN echo "PaddleOCR(use_angle_cls=True, lang=\"en\")" >> download.py
- RUN python3 download.py
生产镜像
- FROM ****/ocr_base:0.0.2
-
- WORKDIR /workspace
-
- COPY ./app/ocr/app.py /workspace/app.py
- # 这里面是启动3个worker 不要太多 模型加载之后可是能消耗近2g物理内存
- CMD cd /workspace && gunicorn -b 0.0.0.0:8000 -w 3 -k gevent --access-logfile - app:app
其中app.py 就是flask的入口文件
- import time
- import urllib.request
-
- from flask import Flask, request
- from paddleocr import PaddleOCR, draw_ocr
-
-
- def save_image(url,outputfile):
- try:
- response = urllib.request.urlopen(url)
- data = response.read()
- with open(outputfile, "wb") as file:
- file.write(data)
- return True
- except urllib.error.URLError as e:
- print("Error occurred while retrieving the URL:", e)
- return False
-
- app = Flask(__name__)
- ocr = PaddleOCR(use_angle_cls=True, lang="en")
-
- @app.route("/")
- def hello():
- print("-------------------")
- return "Hello World!"
-
- @app.post("/ocr/check")
- def check_post():
- ret = {}
- req = request.get_json()
- print(req,type(req))
- url = req.get("url")
- if url == None :
- ret["code"] = -1
- ret["msg"] = "param url lost"
- ret["data"] = []
- return ret
- result = ocr.ocr(url, cls=True)
- if len(result) == 0:
- ret["code"] = -3
- ret["msg"] = "ocr result empty"
- ret["data"] = []
- return ret
- data =[]
- for idx in range(len(result)):
- res = result[idx]
- for idx1 in range(len(res)) :
- temp = {}
- res1 = res[idx1]
- temp["text"]=res1[-1][0]
- temp["score"]=res1[-1][1]
- data.append(temp)
- ret["code"] = 0
- ret["msg"] = ""
- ret["data"] = data
- return ret
-
- if __name__ == "__main__" :
- app.run()
到此你就搭建了一个ocr的web服务了
普通的验证码之类的识别1s 10张 想要更高的性能那你就在生产镜像里面吧worker加到更大,不过消耗的cpu内存也就更多(PaddleOCR其实支持gpu 这里默认是cpu)