Python3-提取pdf文件内容的方式，PyPDF2的使用

1 PDF文件格式简介

PDF，全称是Portable Document Format，意为“可携带文档格式”。作为一种文件格式，它操作系统平台无关，支持Windows，Unix/Linux，Mac...等几乎所有的主流操作系统。而且，无论在哪种打印机上都可保证精确的颜色和准确的打印效果，即PDF会忠实地再现原稿的每一个字符、颜色以及图象。
当然，它也不同于普通的可以直接读取内容的文本文件，它需要专门的软件来打开，就如同word文档，需要Office软件打开一样。

本篇博文中，主要介绍如何使用python语言提取PDF文件中的文字。

2 PyPDF2库简介

在python中，提供了PyPDF2库可以进行PDF文件的各种操作。例如：

提取PDF文件文字
按页拆分文档
逐页合并文档
裁剪页面
合并多个页面到一个页
对pdf文档进行加密解密

可以参考如下网址：

https://pypi.org/project/PyPDF2/

PyPDF2的安装：

pip install PyPDF2

3 PyPDF2库的使用-提取pdf文件的内容

（1）核心代码：

# 读取pdf

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

#获取第N页

pageObj = pdfReader.getPage(页索引id)

# 获取内容
dataStr = pageObj.extractText()

（2）全部代码如下：


# -*- coding: utf-8 -*-
import PyPDF2
import chardet
from chardet import detect as char_detect
 
def read_pdf(filename):
    ''' 读取pdf文件的内容'''
    pdfFileObj = open(filename, 'rb')  #rw,r+都会出错
    # pdfFileObj = open(filename, 'r+',encoding="utf-8")
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
 
    print("pages cnt:",pdfReader.numPages)
 
    for i in range(pdfReader.numPages):
        pageObj = pdfReader.getPage(i)
        dataStr = pageObj.extractText()
 
        print("current page index:", i)
        print("============text===============");
        print(dataStr) #ok
 
        if i ==0:
            # 仅仅为了测试，只输出第一页
            break;
    #必须close
    pdfFileObj.close()
 
if __name__ == '__main__':
    # test_chardet()
    filename = 'Effective C++ 英文版.pdf'
    read_pdf(filename)

运行结果：

% python3 pdf1.py

pages cnt: 251

current page index: 0

============text===============

E ffe c tiv eC + +byScottMeyers

Back

to

Dedication

Continue

。。。省略若干字符

截图如下：

说明：

（1）pdfFileObj = open(filename, 'rb') #rw,r+都会出错

打开文件，必须用rb，即可读，二进制方式打开。如果用rw，则错误信息如下：

“ValueError: must have exactly one of create/read/write/append mode”

（2）pdfReader = PyPDF2.PdfFileReader(pdfFileObj)：创建PdfFileReader对象，用于读取数据；

（3）pdfReader.numPages：pdf的页数；

（4）pageObj = pdfReader.getPage(i)：获取第i页page对象；

（5）dataStr = pageObj.extractText()：读取该页的数据；

（6）pdfFileObj.close()：操作完成后，一定要关闭文件。

相关阅读:
Linux基础
解决python3 requests post 报unicode问题
HBase2.x（二）HBase安装部署
【直播】-ARM异常中断答疑篇-2022/11/07
Python3 数据结构
软考-软件设计师
JS 数组的排序 sort方法详解
nacos配置中心
Linux从入门到精通（十）——进程管理
IIS方式部署项目发布上线

原文地址：https://blog.csdn.net/liranke/article/details/126525804