Scrapy与分布式开发(2.3)：lxml+xpath基本指令和提取方法详解

Scrapy与分布式开发(2.3)：lxml+xpath基本指令和提取方法详解
lxml+xpath基本指令和提取方法详解

 一、XPath简介

XPath，全称为XML Path Language，是一种在XML文档中查找信息的语言。它允许用户通过简单的路径表达式在XML文档中进行导航。XPath不仅适用于XML，还常用于处理HTML文档。

二、基本指令和提取方法

 选择节点

使用XPath，你可以轻松地选择XML文档中的节点。
* 选择根节点：/
* 选择子节点：/parent/child
* 选择所有节点：//*
* 后代节点选择：使用//descendant选择文档中的任意后代节点，无论层级。
* 相邻节点选择：使用/sibling1/following-sibling::sibling2选择相邻的同级节点。

使用轴

XPath提供了多种轴，允许你基于节点之间的关系进行选择。
* 子轴：/parent/child
* 同胞轴：/parent/child1/following-sibling::child2
* 属性轴：/parent/child/@attribute

使用谓语

谓语用于过滤节点集，帮助你更精确地定位节点。
* 选择第一个节点：/parent/child[1]
* 选择具有特定值的节点：/parent/child[@attribute='value']
* 选择多个满足条件的节点：/parent/child[position() > 1]
* 使用/parent/child/@attribute直接选择属性节点。
* 使用/parent/child[position()]根据节点在父节点下的位置进行选择。例如，[1]表示第一个子节点，[last()]表示最后一个子节点。
* 使用/parent/child[text()='value']选择文本内容等于特定值的节点。
* 使用and、or进行多条件选择，如/parent/child[@attribute1='value1' and @attribute2='value2']。

提取加粗样式文本

XPath不仅可以定位节点，还可以提取节点的文本内容。
* 使用text()函数提取节点的文本内容，如/parent/child/text()。
* 使用string()函数提取节点的字符串表示，适用于复杂节点结构。
* 直接使用/@attribute提取节点的属性值，如/parent/child/@attribute。
* 使用逗号,分隔多个XPath表达式，一次性提取多个节点或属性，如/parent/(child1, child2, @attribute)。
* 使用.表示当前节点及其所有子节点，如node()函数。

三、实例演示

下面是一些XPath查询的实例，演示了如何使用XPath来提取XML文档中的数据。

XML文档示例：
```
<bookstore>
 <book>
 <title lang="en">Harry Pottertitle>
 <author>J.K. Rowlingauthor>
 <price>29.99price>
 book>
 <book>
 <title lang="en">Learning XMLtitle>
 <author>Erik T. Rayauthor>
 <price>39.95price>
 book>
 <book>
 <title lang="zh-CN">西游记title>
 <author>吴承恩author>
 <price>28.80price>
 book>
bookstore>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
```
选择所有书名
XPath表达式：/bookstore/book/title
结果：Harry Potter, Learning XML, 西游记

选择第二本书的价格
XPath表达式：/bookstore/book[2]/price
结果：39.95

选择所有英文书名
XPath表达式：/bookstore/book/title[@lang='en']
结果：Harry Potter, Learning XML

选择价格高于30的所有书籍
XPath表达式：/bookstore/book[price > 30]
结果：...（包含Learning XML这本书的信息）

选择所有书籍的作者名字
XPath表达式：/bookstore/book/author/text()
结果：J.K. Rowling, Erik T. Ray, 吴承恩

选择第一本书的标题文本
XPath表达式：/bookstore/book[1]/title/text()
结果：Harry Potter

选择所有书籍的价格（作为文本）
XPath表达式：/bookstore/book/price/text()
结果：29.99, 39.95, 28.80

选择所有具有属性的title节点
XPath表达式：//title[@*]
结果：所有带有属性的</code>节点，如<code><title lang="en">Harry Potter

提取多个节点并返回其文本
XPath 表达式：/bookstore/book/(title/text(), author/text())
结果：对于每一本书，返回其标题和作者的文本内容，例如第一本书返回 ("Harry Potter", "J.K. Rowling")。

提取节点的直接子节点
XPath 表达式：/bookstore/book/price
结果：返回所有节点，因为是的直接子节点。

提取节点的所有子节点
XPath 表达式：/bookstore/book/*
结果：对于每一本书，返回其所有子节点，即</code>, <code><author></code>, 和 <code><price></code>。 提取节点的属性 XPath 表达式：<code>/bookstore/book/title/@lang</code> 结果：返回所有<code><title></code>节点的<code>lang</code>属性值，例如<code>"en"</code>和<code>"zh-CN"</code>。 提取节点的父节点 XPath 表达式：<code>/bookstore/book/price/parent::book</code> 结果：返回每个<code><price></code>节点的父节点<code><book></code>。 提取节点的前一个或后一个同级节点 XPath 表达式：<code>/bookstore/book[2]/title/previous-sibling::title</code> 和 <code>/bookstore/book[2]/title/next-sibling::title</code> 结果：分别返回第二本书标题的前一个和后一个同级标题节点（在这个例子中，因为第二本书是第一个，所以前一个同级节点不存在，后一个同级节点是第三本书的标题）。 提取节点的祖先节点 XPath 表达式：<code>/bookstore/book/title/ancestor::bookstore</code> 结果：返回每个<code><title></code>节点的祖先<code><bookstore></code>节点。 提取节点及其所有后代节点 XPath 表达式：<code>/bookstore/book[1]</code> 结果：返回第一本书及其所有后代节点，即完整的第一本书的信息。 提取满足条件的节点集合 XPath 表达式：<code>/bookstore/book[price > 30]</code> 结果：返回价格大于30的所有<code><book></code>节点。 <h3><a name="t8"></a><a id="lxmlxpath_124"></a>四、lxml应用xpath</h3> 在Python中，<code>lxml</code>是一个功能强大的库，用于解析XML和HTML文档。结合XPath，我们可以轻松地定位和提取文档中的特定信息。下面是一个关于如何使用<code>lxml</code>和XPath进行XML解析和数据提取的详细讲解，重点在于提供实用指令和文本提取方法。 <h4><a name="t9"></a><a id="lxml_126"></a>安装lxml</h4> 首先，确保你已经安装了<code>lxml</code>库。如果没有，可以通过pip进行安装： <pre data-index="1" class="set-code-show prettyprint"><code class="prism language-bash has-numbering" onclick="mdcp.signin(event)" style="position: unset;">pip install lxml <div class="hljs-button signin active" data-title="登录复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.0758672;"><li style="color: rgb(153, 153, 153);">1</li></ul></pre> <h4><a name="t10"></a><a id="XML_131"></a>加载XML文档</h4> 使用<code>lxml</code>的<code>etree</code>模块加载XML文档： <pre data-index="2" class="set-code-show prettyprint"><code class="prism language-python has-numbering" onclick="mdcp.signin(event)" style="position: unset;">from lxml import etree # 加载XML文档 tree = etree.parse('example.xml') <div class="hljs-button signin active" data-title="登录复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.0758672;"><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li></ul></pre> <h4><a name="t11"></a><a id="XPath_138"></a>使用XPath提取数据</h4> <ol><li>选择节点 选择所有<code><book></code>节点：</li></ol> <pre data-index="3" class="set-code-show prettyprint"><code class="prism language-python has-numbering" onclick="mdcp.signin(event)" style="position: unset;">books = tree.xpath('/bookstore/book') <div class="hljs-button signin active" data-title="登录复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.0758672;"><li style="color: rgb(153, 153, 153);">1</li></ul></pre> <ol start="2"><li>选择特定节点 选择第一个<code><book></code>节点：</li></ol> <pre data-index="4" class="set-code-show prettyprint"><code class="prism language-python has-numbering" onclick="mdcp.signin(event)" style="position: unset;">first_book = tree.xpath('/bookstore/book[1]') <div class="hljs-button signin active" data-title="登录复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.0758672;"><li style="color: rgb(153, 153, 153);">1</li></ul></pre> <ol start="3"><li>选择节点属性 选择所有<code><book></code>节点的<code>title</code>属性值：</li></ol> <pre data-index="5" class="set-code-show prettyprint"><code class="prism language-python has-numbering" onclick="mdcp.signin(event)" style="position: unset;">titles = tree.xpath('/bookstore/book/title/@lang') <div class="hljs-button signin active" data-title="登录复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.0758672;"><li style="color: rgb(153, 153, 153);">1</li></ul></pre> <ol start="4"><li>选择节点的文本内容 选择所有<code><title></code>节点的文本内容：</li></ol> <pre data-index="6" class="set-code-show prettyprint"><code class="prism language-python has-numbering" onclick="mdcp.signin(event)" style="position: unset;">titles_text = tree.xpath('/bookstore/book/title/text()') <div class="hljs-button signin active" data-title="登录复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.0758672;"><li style="color: rgb(153, 153, 153);">1</li></ul></pre> <ol start="5"><li>选择多个节点及其文本内容 选择所有<code><book></code>节点的<code><title></code>和<code><author></code>文本内容：</li></ol> <pre data-index="7" class="set-code-show prettyprint"><code class="prism language-python has-numbering" onclick="mdcp.signin(event)" style="position: unset;">books_info = tree.xpath('/bookstore/book/(title/text(), author/text())') <div class="hljs-button signin active" data-title="登录复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.0758672;"><li style="color: rgb(153, 153, 153);">1</li></ul></pre> <ol start="6"><li>条件选择 选择价格大于30的<code><book></code>节点：</li></ol> <pre data-index="8" class="set-code-show prettyprint"><code class="prism language-python has-numbering" onclick="mdcp.signin(event)" style="position: unset;">expensive_books = tree.xpath('/bookstore/book[price > 30]') <div class="hljs-button signin active" data-title="登录复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.0758672;"><li style="color: rgb(153, 153, 153);">1</li></ul></pre> <ol start="7"><li>选择后代节点 选择所有<code><price></code>后代节点：</li></ol> <pre data-index="9" class="set-code-show prettyprint"><code class="prism language-python has-numbering" onclick="mdcp.signin(event)" style="position: unset;">prices = tree.xpath('//price') <div class="hljs-button signin active" data-title="登录复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style="opacity: 0.0758672;"><li style="color: rgb(153, 153, 153);">1</li></ul></pre> <h4><a name="t12"></a><a id="_174"></a>实战演示</h4> <h5><a id="_175"></a>案例一：提取博客文章标题</h5> <pre data-index="10" class="set-code-hide prettyprint"><code class="prism language-python has-numbering" onclick="mdcp.signin(event)" style="position: unset;">from lxml import etree # 假设html_content是博客网页的HTML内容 html_content = """ <html> <head> <title>My Blog
Welcome to My Blog
Article 1 Title Article 1 content... Article 2 Title Article 2 content... """ # 解析HTML tree = etree.HTML(html_content) # 使用XPath定位所有
元素并提取文本内容 article_titles = tree.xpath('//h2/text()') # 打印文章标题 for title in article_titles: print(title.strip()) # 使用strip()移除可能存在的空白字符
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
案例二：提取链接和链接文本
```
from lxml import etree 
 
html_content = """ 
 
 
 Links Page 
 
 
 Here are some links: 
 
 Link 1
Link 2 
 Link 3 
 
 
 
""" 
 
# 解析HTML 
tree = etree.HTML(html_content) 
 
# 使用XPath提取所有链接和链接文本 
links = tree.xpath('//a') 
for link in links: 
 link_text = link.text.strip() # 提取链接文本并移除空白字符 
 link_href = link.get('href') # 提取href属性 
 print(f"Link Text: {link_text}, Link: {link_href}")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
```
案例三：提取链接和链接文本
```
from lxml import etree 
 
html_content = """ 
 
 
 Table Page2 
 Header 3 
 
 
 Row 1, Col 1 
 Row 1, Col 2 
 Row 1, Col 3 
 
 
 Row 2, Col 1 
 Row 2, Col 2 
 Row 2, Col 3 
 
 
 
 
""" 
 
# 解析HTML 
tree = etree.HTML(html_content) 
 
# 使用XPath提取表格的所有行 
table_rows = tree.xpath('//table/tr') 
 
# 遍历行并提取单元格数据 
for row in table_rows: 
 # 提取单元格数据，这里假设所有行都有相同数量的列 
 cells = row.xpath('td|th') 
 row_data = [cell.text.strip() for cell in cells] 
 print(row_data)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
```
注意事项
- XPath表达式是大小写敏感的，确保你的标签名与XML文档中的大小写一致。
- 如果XML文档中有命名空间，你可能需要在XPath表达式中处理它们。
经验之谈

 借用浏览器快速获取xpath指令

打开浏览器进入开发者模式，选定要提取的位置，然后右键按下图流程处理即可快速获取该位置的xpath选择命令

 XPath Helper

浏览器插件XPath Helper可以让我们直观看到自己的选择命令是不是合理的

 代码提取不到但是浏览器可以？

有时候会出现明明浏览器直接copy的指令，或者我们通过浏览器确定是可以的指令，但是在代码执行却提取失败，这种常见的可能性是：网页返回的html文本结构是A，但是经过浏览器渲染后变成了B，这让我们用B的指令去提取A，肯定得不到结果，这种在表格中比较常见，特别table > tbody > tr这一层，如果网页本身没有tbody，浏览器一般会自动渲染上。
解决方法：
- 代码调试
- 查看网页源代码
相关阅读:
【mindspore】【算子执行】执行Exception
一个关于 i++ 和 ++i 的面试题打趴了所有人
 Vue实战篇三十五：实现滑动拼图验证登录
 民法通则配套规定（二）
HTML+JS实现水果消消乐游戏完整源码附注释
 【C++ 学习㉔】- 详解 map 和 set（下）- map 和 set 的模拟实现
 Python实现并测试K-means聚类算法
 spring boot 之整合 kafka
电影【忠犬帕尔玛】
Python操作Hive数据仓库
原文地址：https://blog.csdn.net/weixin_43845191/article/details/136454143

lxml+xpath基本指令和提取方法详解

一、XPath简介

二、基本指令和提取方法

选择节点

使用轴

使用谓语

提取加粗样式文本

三、实例演示

`Welcome to My Blog`

Article 1 Title

Article 2 Title

`元素并提取文本内容 article_titles = tree.xpath('//h2/text()') # 打印文章标题 for title in article_titles: print(title.strip()) # 使用strip()移除可能存在的空白字符`
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

案例二：提取链接和链接文本

案例三：提取链接和链接文本

注意事项

经验之谈

借用浏览器快速获取xpath指令

XPath Helper

代码提取不到但是浏览器可以？

Scrapy与分布式开发(2.3)：lxml+xpath基本指令和提取方法详解

lxml+xpath基本指令和提取方法详解

一、XPath简介

二、基本指令和提取方法

选择节点

使用轴

使用谓语

提取加粗样式文本

三、实例演示

Welcome to My Blog

Article 1 Title

Article 2 Title

元素并提取文本内容 article_titles = tree.xpath('//h2/text()') # 打印文章标题 for title in article_titles: print(title.strip()) # 使用strip()移除可能存在的空白字符 12345678910111213141516171819202122232425262728293031

案例二：提取链接和链接文本

案例三：提取链接和链接文本

注意事项

经验之谈

借用浏览器快速获取xpath指令

XPath Helper

代码提取不到但是浏览器可以？

`Welcome to My Blog`

`元素并提取文本内容 article_titles = tree.xpath('//h2/text()') # 打印文章标题 for title in article_titles: print(title.strip()) # 使用strip()移除可能存在的空白字符`
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31