拓展3: Python Xpath表达式的使用

文章目录

Xpath表达式

Xpath表达式

XPath表达式描述了从一个节点到另一个节点的路径，通过标签的属性或名称定位到上级标签，再通过路径定位到该上级标签的任意下级标签

1. 将 etree 对象实例化的两种方法

（1）etree.parse() 转本地HTML文档

from lxml import etree

html = etree.parse("test1.html")
print(html)
1
2
3
4

在这里插入图片描述

（2）etree.HTML() 转网页源码

import requests
from lxml import etree

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.69"
}
response = requests.get(url="https://www.baidu.com/",headers=headers)
result = response.text
html = etree.HTML(result)
1
2
3
4
5
6
7
8
9

在这里插入图片描述

2.用XPath表达式定位标签并提取数据

（1）定位标签

import requests
from lxml import etree

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.69"
}
response = requests.get(url="https://www.baidu.com/",headers=headers)
result = response.text
html = etree.HTML(result)
print(html.xpath("//*[@class = "title"]"))
1
2
3
4
5
6
7
8
9
10

在这里插入图片描述

除此之外还可以用其他属性来替换*,(如html.xpath(‘//p[@class = ‘title’]’))
逻辑运算符也可以加入到当中
html.xpath('//p[@class="title" and @name = "color"]')
html.xpath('//p[@class="title" or @name = "color"]')
1
2

（2）提取文本内容和属性值

html.xpath('//*[@class="title"]/text()')
html.xpath('//*[@class="title"]//text()')
html.xpath('//*[@class="title"]/@id')
1
2
3

/text(): 提取该节点的直系文本内容
//text() : 提取该节点下的所有文本内容
/@ 属性名 : 提取该节点的指定属性值

3. 快速获取XPath表达式

在这里插入图片描述

相关阅读:
Python实现查询一个文件中的pdf文件中的关键字
Word2Vec 实践
C++/Qt获取屏幕尺寸和放大比例
Web自动化处理“滑动验证码”
用HarmonyOS ArkUI调用三方库PhotoView实现图片的联播、缩放
Spring容器&Bean生命周期常见接口
vue移动端适配
开发者职场“生存状态”大调研报告分析 - 第一版
淘宝API技术文档解析，从入门到实战
关于unity中编辑器相关逻辑的记录

原文地址：https://blog.csdn.net/m0_56126722/article/details/122716794