因为一些项目原因,我需要提供解析docx内容功能。本来以为这是一件比较简单的工作,没想到在解析复选框选项上吃了亏,并且较长一段时间内通过各种渠道都没有真正解决这一问题,反而绕了远路。
终于,我在github python-docx模块的Issues中找到了重要的思路及线索,并最终通过后续努力,实现了【解析docx/word文件文字、图片、复选框】这一功能。
Feature: Read checkboxes in Word forms · Issue #224 · python-openxml/python-docx · GitHub
- # 安装python-docx模块
- pip install python-docx
- import os
- import docx
- import time
-
- # 图片附件的存储地址
- image_save_path = 'appendix_dir'
-
- # 读取docx表格里的数据,图片及文字
- def read_table_from_docx(file_path):
- """
- :param file_path:
- :return: table_data, images
- """
-
- # 读取docx/word文件
- doc = docx.Document(file_path)
-
- # 获取docx中的table对象
- tables = doc.tables
- table_data = []
- images = []
-
- # 拿取文件中的图片对象,并存储在images列表里
- for rel in doc.part.rels.values():
- if "image" in rel.reltype:
- image = rel.target_part
- image_data = image.blob
- images.append(image_data)
-
- # 读取文件表格中的文字内容
- # 这里不能解析特殊字符和复选框
- # 并且合并单元格的文字内容,将出现多行多列重复出现,需要注意
- for table in tables:
- for row in table.rows:
- row_data = []
- for cell in row.cells:
- # print(cell, cell.text)
- row_data.append(cell.text)
- table_data.append(row_data)
-
- return table_data, images
-
- table_data, images = read_table_from_docx('template.docx')
- print(table_data)
-
- # 另存docx图片到本地
- for i, image_data in enumerate(images):
- # 拼接 存储图片 绝对路径
- image_name = f"expert_{int(time.time() * 1000)}.jpg"
- with open(os.path.join(image_save_path, image_name), "wb") as f:
- f.write(image_data)
关于docx复选框,在这次项目中遇到了一种独特的复选框样式,这种样式并不是通过wps里的【复选框内容控件】创建的,让我一时没办法找到方向。
这是正常用wps添加的复选框方式
很明显,和我的目标不太一样
二者都没办法通过【python-docx基础操作】拿到,因此我只能继续刨坑,终于如【前言】所述,我不得已去模块github的评论区里找到了线索——直接以xml的形式剖析docx文件,并获取复选框选项。
这里为了节约文本资源(太懒了),直接上代码吧!
- from docx import Document
-
- document = Document('template1.docx')
- tables = document.tables
- content = []
- for table in tables:
- for row in table.rows:
- for cell in row.cells:
- for paragraph in cell.paragraphs:
- p = paragraph._element
-
- # 打印docx的xml内容形式
- # print(p.xml)
-
- # 拿取所有
标签的匹配xml数据 - checkBoxes = p.xpath('.//w14:checkbox')
- if checkBoxes:
- # 解析
内部的内容 - for checkBox in checkBoxes:
- # 尝试匹配xml中的
对象,也就是上面wps自建的复选框 - checked_state = checkBox.xpath('.//w14:checked/@w14:val', namespaces={'w14':'http://schemas.microsoft.com/office/word/2010/wordml'})
- if checked_state:
- checked_value = checked_state[0] # 获取第一个匹配的属性值
- print(paragraph.text, "Checked value:", checked_value)
- break
-
- # 这是原模板的复选框选项拿取方案
- # checkBoxes = p.xpath('.//w:r')
- # if checkBoxes:
- # for checkBox in checkBoxes:
- # checked_state = checkBox.xpath('.//w:sym/@w:char')
- # if checked_state:
- # checked_value = checked_state[0] # 获取第一个匹配的属性值
- # print(paragraph.text, "Checked value:", checked_value)
- # break
这是我的结果【1是选择,0是未选择】
这是docx解析后的xml内容,请自行体会代码与它的联系吧
"http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:wpsCustomData="http://www.wps.cn/officeDocument/2013/wpsCustomData"> -
-
"both"/> -
-
"eastAsia" w:ascii="仿宋_GB2312" w:hAnsi="仿宋_GB2312" w:eastAsia="仿宋_GB2312" w:cs="仿宋_GB2312"/> -
"single"/> -
"en-US" w:eastAsia="zh-CN"/> -
-
-
-
-
"default" w:ascii="仿宋_GB2312" w:hAnsi="仿宋_GB2312" w:eastAsia="仿宋_GB2312" w:cs="仿宋_GB2312"/> -
"21"/> -
"21"/> -
-
"Wingdings" w:char="00A8"/> -
-
-
-
"eastAsia" w:ascii="仿宋_GB2312" w:hAnsi="仿宋_GB2312" w:eastAsia="仿宋_GB2312" w:cs="仿宋_GB2312"/> -
"21"/> -
"21"/> -
"en-US" w:eastAsia="zh-CN"/> -
-
这是选项一 -
-
id="0" w:name="_GoBack"/> -
id="0"/> -
-
-
-
"eastAsia" w:ascii="仿宋_GB2312" w:hAnsi="仿宋_GB2312" w:eastAsia="仿宋_GB2312" w:cs="仿宋_GB2312"/> -
"auto"/> -
"2"/> -
"21"/> -
"24"/> -
"en-US" w:eastAsia="zh-CN" w:bidi="ar-SA"/> -
-
id w:val="147457823"/> -
-
"1"/> -
"2612" w14:font="MS Gothic"/> -
"2610" w14:font="MS Gothic"/> -
-
-
-
-
"eastAsia" w:ascii="仿宋_GB2312" w:hAnsi="仿宋_GB2312" w:eastAsia="仿宋_GB2312" w:cs="仿宋_GB2312"/> -
"auto"/> -
"2"/> -
"21"/> -
"24"/> -
"en-US" w:eastAsia="zh-CN" w:bidi="ar-SA"/> -
-
-
-
-
-
ascii="MS Gothic" w:hAnsi="MS Gothic" w:eastAsia="宋体" w:cs="Times New Roman"/> -
"auto"/> -
"2"/> -
"21"/> -
"24"/> -
"en-US" w:eastAsia="zh-CN" w:bidi="ar-SA"/> -
-
☒ -
-
-
-
-
-
"eastAsia" w:ascii="仿宋_GB2312" w:hAnsi="仿宋_GB2312" w:eastAsia="仿宋_GB2312" w:cs="仿宋_GB2312"/> -
"0"/> -
"0"/> -
"000000"/> -
"0"/> -
"21"/> -
"21"/> -
"none"/> -
"single" w:color="000000" w:sz="4" w:space="0"/> -
"en-US" w:eastAsia="zh-CN" w:bidi="ar"/> -
-
-
"0" distB="0" distL="114300" distR="114300" simplePos="0" relativeHeight="251659264" behindDoc="0" locked="0" layoutInCell="1" allowOverlap="1"> -
"0" y="0"/> -
"column"> -
0 -
-
"paragraph"> -
0 -
-
"18415" cy="19685"/> -
"0" t="0" r="0" b="0"/> -
-
id="1" name="图片_2"/> -
-
"http://schemas.openxmlformats.org/drawingml/2006/main"> -
"http://schemas.openxmlformats.org/drawingml/2006/picture"> -
"http://schemas.openxmlformats.org/drawingml/2006/picture"> -
-
id="1" name="图片_2"/> -
-
-
-
"rId4"/> -
-
-
-
-
-
-
"0" y="0"/> -
"18415" cy="19685"/> -
-
"rect"> -
-
-
-
-
-
-
-
-
-
-
-
-