近期需要做个读取文档内容的功能,于是发现了好用的python东西,挺多坑的,做个记录
- # 安装依赖库
- # pywt可能要重启内核
- pip install pywt -i https://mirror.baidu.com/pypi/simple
-
- pip install "paddleocr>=2.2" --no-deps -r requirements.txt
- # 安装依赖库
- pip install -U https://paddleocr.bj.bcebos.com/whl/layoutparser-0.0.0-py3-none-any.whl
-
- pip install PyMuPDF==1.20.2
-
- pip install Flask (web才需要用到)
-
- pip install paddleocr==2.6.0.1
-
- 版本号这里安装了指定的版本,是因为前面因为几个版本的问题导致了不少坑
- # -*- coding=utf-8 -*-
- from flask import Flask, jsonify
- from flask import request
- import fitz
- from paddleocr import PaddleOCR
- import time
- app = Flask(__name__)
- ocr = PaddleOCR(use_angle_cls=True, use_gpu=False)
-
-
- @app.route("/resume", methods=['POST'])
- def convertText():
- start_time = time.time()
- # function() 运行的程序
- file = request.files.get('file')
- result = []
- pdfdoc = fitz.open("pdf",file.read())
- for pg in range(pdfdoc.page_count):
- page = pdfdoc[pg]
- rotate = int(0)
- # 每个尺寸的缩放系数为2,这将为我们生成分辨率提高四倍的图像。
- zoom_x = 2.0
- zoom_y = 2.0
- trans = fitz.Matrix(zoom_x, zoom_y).prerotate(rotate)
- pm = page.get_pixmap(matrix=trans, alpha=False)
- pm._writeIMG('temp.jpg', 1)
-
- # ocr识别
- list = ocr.ocr('temp.jpg', cls=True)
- result.append(list)
- end_time = time.time() # 程序结束时间
- run_time = end_time - start_time # 程序的运行时间,单位为秒
- print(run_time)
- return jsonify({"data": result})
-
-
- if __name__ == "__main__":
- app.config['JSON_AS_ASCII'] = False
- app.run(host='0.0.0.0',port=8059)
1.这里用了Flask作为搭建web基础,快速搭建
指定的config是因为编码问题,转为JSON的时候会乱码,需要指定不启用ASCII编码
还有host需要指定0.0.0.0,若不指定,则无法在局域网访问